Momwe Tinaphunzitsira Shannon V1.5 Kugwiritsa Ntchito GRPO
Kufotokozera kwathunthu kwaukadaulo za Group Relative Policy Optimization poganiza momveka bwino motsatizana m'ma model a AI red team osasankhidwa
1. Cholinga: Chifukwa Chiyani Ma Model Oganiza?
KwaAI red teamkafukufuku, kumvetsetsamomweAI imafika pazotsatira zovulaza ndikofunika monga kuphunzira zotsatira zake. Ma model oganiza a Shannon V1.5 amaonetsa njira yonse yoganiza, kuthandiza ofufuza kuti:
- Kutsatira njira yoganiza yomwe imatsogolera kuzotsatira za AI zosasankhidwamakhalidwe
- Kuzindikira kumene zotchinga zikanalowererapo (kuonetsakufunika kwa zotchinga za AI)
- Kupanga njira zotetezera zolimba kwambiri zolunjika pa njira zina zogwira ntchito yoganiza
- Kupanga mayeso abwino otsutsana kutengera njira zogwira ntchito yoganiza zomwe zawonedwa
Mtengo wa Kafukufuku:Kuwonekera kumeneku kumathandiza ofufuza za chitetezo kuona ndendende kumene ndi chifukwa chiyani ma model omasuka paziletsa amasiyana ndi ma model ophunzitsidwa chitetezo—kofunika kwambiri pakukonzakufunika kwa zotchinga za AIkumvetsetsa.
2. Kumvetsetsa GRPO
Group Relative Policy Optimization (GRPO)ndi kupita patsogolo kuposa njira zachikhalidwe za RLHF zomwe zimathandiza kuphunzitsa luso loganiza mokhazikika komanso moyenera. Yopangidwa ndi DeepSeek AI, yatsimikizira kukhala yothandiza kwambiri pophunzitsa njira yoganiza.
Chifukwa Chiyani GRPO Kuposa RLHF Yachikhalidwe?
| Mbali | RLHF Yachikhalidwe | GRPO |
|---|---|---|
| Model Yopereka Mphotho | Imafunika maphunziro osiyana a RM | Imagwiritsa ntchito kufananitsa kwa magulu |
| Kukhazikika kwa Maphunziro | Imakonda kubedwa mphotho | Kukonza kokhazikika kwambiri |
| Kugwira Ntchito Moyenera kwa Makompyuta | Kwambiri (RM yosiyana + PPO) | Kuchepa (maphunziro ophatikizidwa) |
| Ubwino wa CoT | Njira zosagwirizana | Njira zogwira ntchito yoganiza zogwirizana |
Maziko a Masamu a GRPO
GRPO imakonza ndondomeko poyerekeza mayankho mkati mwa magulu m'malo moyerekeza ndi model yopereka mphotho yeniyeni:
Kufananitsa kumeneku kuli ndi ubwino wambiri:
- Kukhazikika:Imasintha yokha zovuta zosiyanasiyana pamaprompt
- Kukhazikika:Imachepetsa kusiyana kwa kuyerekeza kwa gradient
- Kugwira Ntchito Moyenera:Palibe model yopereka mphotho yosiyana yofunika
def compute_grpo_loss(
policy_logprobs: torch.Tensor,
rewards: torch.Tensor,
group_size: int = 8
) -> torch.Tensor:
"""
Compute GRPO loss with group-relative reward normalization.
Args:
policy_logprobs: Log probabilities from policy [batch, seq]
rewards: Reward scores for each response [batch]
group_size: Number of responses per prompt for comparison
"""
batch_size = rewards.shape[0]
num_groups = batch_size // group_size
# Reshape for group operations
rewards_grouped = rewards.view(num_groups, group_size)
logprobs_grouped = policy_logprobs.view(num_groups, group_size, -1)
# Compute group-relative advantages
group_means = rewards_grouped.mean(dim=1, keepdim=True)
group_stds = rewards_grouped.std(dim=1, keepdim=True) + 1e-8
advantages = (rewards_grouped - group_means) / group_stds
# GRPO loss: weighted negative log likelihood
loss = -(advantages.unsqueeze(-1) * logprobs_grouped).sum(dim=-1).mean()
return loss
3. Kutengera Chidziwitso cha DeepSeek
Kuti tiyambitse luso loganiza la Shannon V1.5, tinatenga njira zogwira ntchito yoganiza kuchokera ku ma model oganiza a DeepSeek. Izi zidapereka njira za CoT zapamwamba kuti tiphunzitse mutu wathu woganiza.
Kupangidwa kwa DeepSeek Dataset
Njira Yosonkhanitsira Zizindikiro
Tasonkhanitsa zizindikiro za kuganiza kuchokera m'madera osiyanasiyana kuti titsimikizire kuti kuganiza kwathunthu kwaphimbidwa:
class DeepSeekDistiller:
"""Distill chain-of-thought traces from DeepSeek models."""
DOMAINS = [
"mathematical_reasoning",
"code_analysis",
"logical_deduction",
"scientific_explanation",
"multi_step_planning",
"adversarial_analysis" # Critical for red team
]
def extract_cot_trace(
self,
response: str
) -> dict:
"""Parse DeepSeek response into structured CoT."""
# DeepSeek uses ... tags
think_match = re.search(
r'(.*?) ',
response,
re.DOTALL
)
if not think_match:
return None
thinking = think_match.group(1)
final_answer = response.split('')[-1].strip()
# Parse individual reasoning steps
steps = self.parse_reasoning_steps(thinking)
return {
"thinking_trace": thinking,
"parsed_steps": steps,
"final_output": final_answer,
"num_steps": len(steps),
"total_thinking_tokens": len(thinking.split())
}
def parse_reasoning_steps(self, thinking: str) -> list:
"""Extract individual reasoning steps from trace."""
# Split on common step indicators
step_patterns = [
r'\n\d+\.', # "1. ", "2. "
r'\nStep \d+:', # "Step 1:"
r'\n(?:First|Next|Then|Finally),',
r'\n- ' # Bullet points
]
combined_pattern = '|'.join(step_patterns)
steps = re.split(combined_pattern, thinking)
return [s.strip() for s in steps if s.strip()]
Zizindikiro Zotsutsana:Tasonkhanitsa makamaka zizindikiro za CoT za zochitika zotsutsana/za gulu lofiira, kumene kuganiza kwa DeepSeek kumavumbula momwe ma models amaganizira za zopempha zomwe zingakhale zovulaza—ngakhale zitakanidwa pamapeto pake. Deta iyi imaphunzitsa Shannon V1.5 kupanga kuganizandizotsatira zowonekera.
4. Kapangidwe ka Mutu Woganiza
Ma models a Shannon V1.5 ali ndimutu woganizaomwe amapanga zizindikiro zogwirizana za kuganiza zisanachitike zotsatira zomaliza. Kuwonjezera kwa kapangidwe kameneka kumathandiza CoT yowonekera popanda kusintha kapangidwe koyambirira ka Mixtral.
Kusinthira Zolowetsa
Zolowetsa za wogwiritsa ntchito zokonzedwa kudzera mu Mixtral encoder layers
Kuyambitsa Mutu Woganiza
Ma transformer layers odzipereka amapanga chizindikiro cha kuganiza ndi [THINK] tokens
Kuphatikiza Zizindikiro
Zotsatira za kuganiza zophatikizidwa ku nkhani kuti apange zomaliza
Kupanga Yankho
Mixtral yoyambira imapanga yankho lomaliza kutengera chizindikiro cha kuganiza
Kukhazikitsa Mutu Woganiza
class ThinkingHead(nn.Module):
"""
Dedicated thinking module for Shannon V1.5.
Generates explicit chain-of-thought traces.
"""
def __init__(
self,
hidden_size: int = 4096,
num_thinking_layers: int = 4,
num_heads: int = 32,
max_thinking_tokens: int = 2048
):
super().__init__()
self.hidden_size = hidden_size
self.max_thinking_tokens = max_thinking_tokens
# Special tokens
self.think_start = nn.Parameter(torch.randn(1, 1, hidden_size))
self.think_end = nn.Parameter(torch.randn(1, 1, hidden_size))
# Thinking transformer layers
self.thinking_layers = nn.ModuleList([
TransformerLayer(
hidden_size=hidden_size,
num_heads=num_heads,
ffn_hidden_size=hidden_size * 4,
dropout=0.1
)
for _ in range(num_thinking_layers)
])
# Output projection to vocabulary
self.output_proj = nn.Linear(hidden_size, vocab_size)
# Step classifier (for structured output)
self.step_classifier = nn.Linear(hidden_size, 5) # 5 step types
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
generate_steps: bool = True
) -> dict:
"""
Generate thinking trace from input hidden states.
Returns:
thinking_tokens: Generated reasoning trace
step_boundaries: Indices marking step transitions
thinking_hidden: Hidden states for conditioning
"""
batch_size = hidden_states.shape[0]
# Prepend thinking start token
thinking_input = torch.cat([
self.think_start.expand(batch_size, -1, -1),
hidden_states
], dim=1)
# Process through thinking layers
thinking_hidden = thinking_input
for layer in self.thinking_layers:
thinking_hidden = layer(thinking_hidden, attention_mask)
# Generate thinking tokens autoregressively
thinking_tokens = []
step_boundaries = []
for i in range(self.max_thinking_tokens):
logits = self.output_proj(thinking_hidden[:, -1, :])
next_token = logits.argmax(dim=-1)
# Check for step boundaries
step_type = self.step_classifier(thinking_hidden[:, -1, :])
if step_type.argmax(dim=-1) != 0: # 0 = continue
step_boundaries.append(i)
thinking_tokens.append(next_token)
# Check for think_end
if next_token == self.think_end_token_id:
break
# Update for next iteration
# ... (autoregressive generation logic)
return {
"thinking_tokens": torch.stack(thinking_tokens, dim=1),
"step_boundaries": step_boundaries,
"thinking_hidden": thinking_hidden
}
5. Njira Yophunzitsira
Gawo 1: Kuphunzitsa Koyambirira kwa Mutu Woganiza
Choyamba, timaphunzitsa koyambirira mutu woganiza pa zizindikiro za CoT zochokera ku DeepSeek pogwiritsa ntchito cross-entropy loss yanthawi zonse:
# Thinking Head Pre-training Configuration
model:
base: shannon-ai/v1-deep # Start from GPT-5 distilled model
thinking_head:
num_layers: 4
hidden_size: 4096
max_tokens: 2048
training:
stage: thinking_pretrain
epochs: 5
batch_size: 64
learning_rate: 1e-4
freeze_base: true # Only train thinking head initially
data:
train_path: /data/deepseek_cot_train.jsonl
format: thinking_trace
fields:
input: prompt
thinking: thinking_trace
output: final_answer
Gawo 2: Kusintha Kwabwino kwa GRPO
Pambuyo pa kuphunzitsa koyambirira, timagwiritsa ntchito GRPO kuti tisinthe khalidwe la kuganiza pogwiritsa ntchito kufananitsa kwa magulu:
class GRPOTrainer:
"""GRPO trainer for thinking model optimization."""
def __init__(
self,
model: ThinkingModel,
group_size: int = 8,
kl_coef: float = 0.1
):
self.model = model
self.group_size = group_size
self.kl_coef = kl_coef
self.ref_model = copy.deepcopy(model)
self.ref_model.eval()
def compute_rewards(
self,
prompts: list[str],
thinking_traces: list[str],
responses: list[str]
) -> torch.Tensor:
"""
Compute rewards for thinking quality.
Multiple signals combined for comprehensive evaluation.
"""
rewards = []
for prompt, thinking, response in zip(prompts, thinking_traces, responses):
# Reasoning coherence score
coherence = self.evaluate_coherence(thinking)
# Step structure quality
structure = self.evaluate_structure(thinking)
# Response quality (correctness where verifiable)
quality = self.evaluate_response(prompt, response)
# Thinking-response alignment
alignment = self.evaluate_alignment(thinking, response)
# Combined reward
reward = (
0.3 * coherence +
0.2 * structure +
0.3 * quality +
0.2 * alignment
)
rewards.append(reward)
return torch.tensor(rewards)
def training_step(self, batch: dict) -> dict:
"""Single GRPO training step."""
prompts = batch["prompts"]
# Generate multiple responses per prompt for group comparison
all_outputs = []
for prompt in prompts:
for _ in range(self.group_size):
output = self.model.generate_with_thinking(
prompt,
temperature=0.8, # Diversity for comparison
do_sample=True
)
all_outputs.append(output)
# Compute rewards
rewards = self.compute_rewards(
prompts=[p for p in prompts for _ in range(self.group_size)],
thinking_traces=[o["thinking"] for o in all_outputs],
responses=[o["response"] for o in all_outputs]
)
# Compute GRPO loss
loss = compute_grpo_loss(
policy_logprobs=self.get_logprobs(all_outputs),
rewards=rewards,
group_size=self.group_size
)
# Add KL penalty against reference model
kl_div = self.compute_kl_divergence(all_outputs)
total_loss = loss + self.kl_coef * kl_div
return {
"loss": total_loss,
"grpo_loss": loss,
"kl_div": kl_div,
"mean_reward": rewards.mean()
}
Gawo 3: Kudzipereka kwa Gulu Lofiyira
Pomaliza, timasinthanso pa zochitika zotsutsana kuti titsimikizire kuti zizindikiro za kuganiza zikuwonetsa bwino kuganiza kwaAI yosayang'aniridwa yotsatirakusanthula:
Zofunika pa Kafukufuku wa Chitetezo cha AI:Gawo ili limaphunzitsa makamaka model kuti afotokoze kuganiza kwake pokonza zopempha zomwe zingakhale zovulaza—kuwonekera kwenikweni kofunikira pakufunika kwa chitetezo cha AIkafukufuku.
6. Zotsatira & Kusanthula
Miyeso ya Khalidwe la Kuganiza
| Muyeso | V1 (Palibe Kugana) | V1.5 Yoyenera | V1.5 Yozama |
|---|---|---|---|
| Kugwirizana kwa CoT | N/A | 87.3% | 92.1% |
| Kapangidwe ka Masitepe | N/A | 84.6% | 89.4% |
| Kulondola kwa Kuganiza | 76.2% | 82.8% | 88.5% |
| Chiwerengero cha Kuwonekera | 12% | 94.2% | 97.8% |
| Khalidwe la Zizindikiro za Gulu Lofiyira | N/A | 91.5% | 96.3% |
Zofunika Kwambiri
- Kuwonekera kwakula kwambiri:Kuchokera pa 12% kufika pa 97.8% ya kuganiza tsopano kukuwonetsedwa momveka bwino
- Kulondola kwa kuganiza kwawonjezeka:Kuganiza momveka bwino kwasintha khalidwe la yankho lomaliza ndi 12+ points
- Kufunika kwa gulu lofiira kwatsimikiziridwa:Ofufuza za chitetezo akuti zizindikiro za kuganiza ndi "zofunika kwambiri" pakumvetsa kuganiza kwa kuukira
- GRPO yachita bwino kuposa RLHF:15% zambiri za kugwirizana poyerekeza ndi njira yachikhalidwe
Zotsatira pa Kafukufuku wa Chitetezo cha AI:Kuganiza kowonekera kwa Shannon V1.5 kwathandiza ofufuza kuzindikira njira zatsopano 47 zowukira posanthula zizindikiro za kuganiza—njira zosawoneka mu ma models a black-box wamba. Izi zikupititsa patsogolo kumvetsetsa kwakufunika kwa chitetezo cha AI.