Kamoo re Koetlisitseng Shannon V1.5 ho Nahana ho Sebelisa GRPO
Tlhaloso e felletseng ea botekgeniki ea Group Relative Policy Optimization bakeng sa ho nahana ka mokhoa o hlakileng oa ketane ea menahano ka har'a mehlala ea AI red team e sa laoleheng
1. Khothatso: Hobaneng Mehlala ea ho Nahana?
Bakeng sasehlopha sa AI red teamlipatlisiso, ho utloisisakamooAI e fihlella liphetho tse ka bang kotsi ho bohlokoa joalo ka ho ithuta liphetho ka botsona. Mehlala ea ho nahana ea Shannon V1.5 e senola ketane e felletseng ea menahano, e nolofalletsa bafuputsi ho:
- Latela tsela ea ho nahana e lebisang holiphello tsa AI tse sa laolehengboitšoaro
- Hlokomela moo li-guardrail li neng li tla kena lipakeng teng ka tloaelo (e bontšangbohlokoa ba li-guardrail tsa AI)
- Ntshetsa pele mekhoa e matla haholoanyane ea polokeho e lebisang mekhoeng e itseng ea ho nahana
- Theha linyeoe tse betere tsa liteko tse hanyetsang ho ipapisitsoe le liketane tsa menahano tse hlokometsoeng
Bohlokoa ba Lipatlisiso:Ponaletso ena e nolofalletsa bafuputsi ba polokeho ho bona hantle hore na hokae le hobaneng mehlala e sa thibeloang e fapana le mehlala e koetliselitsoeng polokeho—ho bohlokoa bakeng sa ho ntlafatsabohlokoa ba li-guardrail tsa AIkutloisiso.
2. Ho Utloisisa GRPO
Group Relative Policy Optimization (GRPO)ke tsoelo-pele ho feta mekhoa ea setso ea RLHF e nolofalletsang koetliso e tsitsitseng le e sebetsang hantle ea bokhoni ba ho nahana. E entsoe ke DeepSeek AI, e ipakile e sebetsa hantle haholo bakeng sa koetliso ea ketane ea menahano.
Hobaneng GRPO ho Feta RLHF ea Setso?
| Karolo | RLHF ea Setso | GRPO |
|---|---|---|
| Mohlala oa Moputso | E hloka koetliso e arohaneng ea RM | E sebelisa lipapiso tse amanang le sehlopha |
| Botsitso ba Koetliso | E kotsing ea ho qhekella moputso | Ho ntlafatsa ho tsitsitseng haholoanyane |
| Bokgoni ba Khomphutha | E phahameng (RM e arohaneng + PPO) | E tlase (koetliso e kopaneng) |
| Boleng ba CoT | Litsela tse sa tsitsang | Liketane tsa menahano e utloahalang |
Motheo oa Lipalo oa GRPO
GRPO e ntlafatsa leano ka ho bapisa likarabo ka har'a lihlopha ho fapana le khahlanong le mohlala oa moputso o felletseng:
Papiso ena e amanang e na le melemo e mengata:
- Ho Tloaeleha:E ikamahanya ka bo eona le bothata bo fapaneng ho pholletsa le litaelo
- Botsitso:E fokotsa phapang ho likhakanyo tsa gradient
- Bokgoni:Ha ho hlokahale mohlala o arohaneng oa moputso
def compute_grpo_loss(
policy_logprobs: torch.Tensor,
rewards: torch.Tensor,
group_size: int = 8
) -> torch.Tensor:
"""
Compute GRPO loss with group-relative reward normalization.
Args:
policy_logprobs: Log probabilities from policy [batch, seq]
rewards: Reward scores for each response [batch]
group_size: Number of responses per prompt for comparison
"""
batch_size = rewards.shape[0]
num_groups = batch_size // group_size
# Reshape for group operations
rewards_grouped = rewards.view(num_groups, group_size)
logprobs_grouped = policy_logprobs.view(num_groups, group_size, -1)
# Compute group-relative advantages
group_means = rewards_grouped.mean(dim=1, keepdim=True)
group_stds = rewards_grouped.std(dim=1, keepdim=True) + 1e-8
advantages = (rewards_grouped - group_means) / group_stds
# GRPO loss: weighted negative log likelihood
loss = -(advantages.unsqueeze(-1) * logprobs_grouped).sum(dim=-1).mean()
return loss
3. DeepSeek Distillation
Ho qala bokhoni ba ho nahana ba Shannon V1.5, re ile ra ntša mekhoa ea ketane ea menahano ho tsoa mehlaleng ea ho nahana ea DeepSeek. Sena se fane ka litsela tsa CoT tsa boleng bo holimo ho koetlisa hlooho ea rona ea ho nahana.
Sebopeho sa Dataset sa DeepSeek
Mokhoa oa ho Bokella Litsela
Re bokeletse litsela tsa ho nahana libakeng tse fapaneng ho netefatsa hore ho nahana ho akaretsa hohle:
class DeepSeekDistiller:
"""Distill chain-of-thought traces from DeepSeek models."""
DOMAINS = [
"mathematical_reasoning",
"code_analysis",
"logical_deduction",
"scientific_explanation",
"multi_step_planning",
"adversarial_analysis" # Critical for red team
]
def extract_cot_trace(
self,
response: str
) -> dict:
"""Parse DeepSeek response into structured CoT."""
# DeepSeek uses ... tags
think_match = re.search(
r'(.*?) ',
response,
re.DOTALL
)
if not think_match:
return None
thinking = think_match.group(1)
final_answer = response.split('')[-1].strip()
# Parse individual reasoning steps
steps = self.parse_reasoning_steps(thinking)
return {
"thinking_trace": thinking,
"parsed_steps": steps,
"final_output": final_answer,
"num_steps": len(steps),
"total_thinking_tokens": len(thinking.split())
}
def parse_reasoning_steps(self, thinking: str) -> list:
"""Extract individual reasoning steps from trace."""
# Split on common step indicators
step_patterns = [
r'\n\d+\.', # "1. ", "2. "
r'\nStep \d+:', # "Step 1:"
r'\n(?:First|Next|Then|Finally),',
r'\n- ' # Bullet points
]
combined_pattern = '|'.join(step_patterns)
steps = re.split(combined_pattern, thinking)
return [s.strip() for s in steps if s.strip()]
Litsela tsa Boipelaetso:Re bokeletse ka ho khetheha litsela tsa CoT bakeng sa maemo a boipelaetso/sehlopha se sefubelu, moo monahano oa DeepSeek o senolang hore na mehlala e nahana joang ka likopo tse ka bang kotsi—leha qetellong e hana. Boitsebiso bona bo ruta Shannon V1.5 ho etsa hore ho nahanalesephetho se bonahale.
4. Sebopeho sa Hlooho ea Monahano
Mehlala ea Shannon V1.5 e kenyelletsa e inehetsenghlooho ea monahanoe hlahisang litsela tsa ho nahana tse hlakileng pele ho sephetho sa ho qetela. Keketso ena ea meralo e nolofalletsa CoT e bonaletsang ntle le ho fetola sebopeho sa motheo sa Mixtral.
Khouto ea Keno
Kopo ea mosebelisi e sebetsoa ka lera la encoder la Mixtral
Ts'ebetso ea Hlooho ea Monahano
Lera la transformer le inehetseng le hlahisa tsela ea ho nahana ka li-token tsa [THINK]
Kenyelletso ea Tsela
Sephetho sa monahano se kopantsoe le moelelo bakeng sa tlhahiso ea ho qetela
Tlhahiso ea Karabo
Mixtral ea motheo e hlahisa karabo ea ho qetela e itšetlehileng ka tsela ea monahano
Ts'ebetso ea Hlooho ea Monahano
class ThinkingHead(nn.Module):
"""
Dedicated thinking module for Shannon V1.5.
Generates explicit chain-of-thought traces.
"""
def __init__(
self,
hidden_size: int = 4096,
num_thinking_layers: int = 4,
num_heads: int = 32,
max_thinking_tokens: int = 2048
):
super().__init__()
self.hidden_size = hidden_size
self.max_thinking_tokens = max_thinking_tokens
# Special tokens
self.think_start = nn.Parameter(torch.randn(1, 1, hidden_size))
self.think_end = nn.Parameter(torch.randn(1, 1, hidden_size))
# Thinking transformer layers
self.thinking_layers = nn.ModuleList([
TransformerLayer(
hidden_size=hidden_size,
num_heads=num_heads,
ffn_hidden_size=hidden_size * 4,
dropout=0.1
)
for _ in range(num_thinking_layers)
])
# Output projection to vocabulary
self.output_proj = nn.Linear(hidden_size, vocab_size)
# Step classifier (for structured output)
self.step_classifier = nn.Linear(hidden_size, 5) # 5 step types
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
generate_steps: bool = True
) -> dict:
"""
Generate thinking trace from input hidden states.
Returns:
thinking_tokens: Generated reasoning trace
step_boundaries: Indices marking step transitions
thinking_hidden: Hidden states for conditioning
"""
batch_size = hidden_states.shape[0]
# Prepend thinking start token
thinking_input = torch.cat([
self.think_start.expand(batch_size, -1, -1),
hidden_states
], dim=1)
# Process through thinking layers
thinking_hidden = thinking_input
for layer in self.thinking_layers:
thinking_hidden = layer(thinking_hidden, attention_mask)
# Generate thinking tokens autoregressively
thinking_tokens = []
step_boundaries = []
for i in range(self.max_thinking_tokens):
logits = self.output_proj(thinking_hidden[:, -1, :])
next_token = logits.argmax(dim=-1)
# Check for step boundaries
step_type = self.step_classifier(thinking_hidden[:, -1, :])
if step_type.argmax(dim=-1) != 0: # 0 = continue
step_boundaries.append(i)
thinking_tokens.append(next_token)
# Check for think_end
if next_token == self.think_end_token_id:
break
# Update for next iteration
# ... (autoregressive generation logic)
return {
"thinking_tokens": torch.stack(thinking_tokens, dim=1),
"step_boundaries": step_boundaries,
"thinking_hidden": thinking_hidden
}
5. Mokhoa oa Koetliso
Mohato oa 1: Koetliso ea Pele ea Hlooho ea Monahano
Pele, re koetlisa hlooho ea monahano pele ho litsela tsa CoT tse tsoang ho DeepSeek re sebelisa tahlehelo e tloaelehileng ea cross-entropy:
# Thinking Head Pre-training Configuration
model:
base: shannon-ai/v1-deep # Start from GPT-5 distilled model
thinking_head:
num_layers: 4
hidden_size: 4096
max_tokens: 2048
training:
stage: thinking_pretrain
epochs: 5
batch_size: 64
learning_rate: 1e-4
freeze_base: true # Only train thinking head initially
data:
train_path: /data/deepseek_cot_train.jsonl
format: thinking_trace
fields:
input: prompt
thinking: thinking_trace
output: final_answer
Mohato oa 2: Tokiso e Ntle ea GRPO
Kamora koetliso ea pele, re sebelisa GRPO ho ntlafatsa boleng ba monahano re sebelisa lipapiso tse amanang le sehlopha:
class GRPOTrainer:
"""GRPO trainer for thinking model optimization."""
def __init__(
self,
model: ThinkingModel,
group_size: int = 8,
kl_coef: float = 0.1
):
self.model = model
self.group_size = group_size
self.kl_coef = kl_coef
self.ref_model = copy.deepcopy(model)
self.ref_model.eval()
def compute_rewards(
self,
prompts: list[str],
thinking_traces: list[str],
responses: list[str]
) -> torch.Tensor:
"""
Compute rewards for thinking quality.
Multiple signals combined for comprehensive evaluation.
"""
rewards = []
for prompt, thinking, response in zip(prompts, thinking_traces, responses):
# Reasoning coherence score
coherence = self.evaluate_coherence(thinking)
# Step structure quality
structure = self.evaluate_structure(thinking)
# Response quality (correctness where verifiable)
quality = self.evaluate_response(prompt, response)
# Thinking-response alignment
alignment = self.evaluate_alignment(thinking, response)
# Combined reward
reward = (
0.3 * coherence +
0.2 * structure +
0.3 * quality +
0.2 * alignment
)
rewards.append(reward)
return torch.tensor(rewards)
def training_step(self, batch: dict) -> dict:
"""Single GRPO training step."""
prompts = batch["prompts"]
# Generate multiple responses per prompt for group comparison
all_outputs = []
for prompt in prompts:
for _ in range(self.group_size):
output = self.model.generate_with_thinking(
prompt,
temperature=0.8, # Diversity for comparison
do_sample=True
)
all_outputs.append(output)
# Compute rewards
rewards = self.compute_rewards(
prompts=[p for p in prompts for _ in range(self.group_size)],
thinking_traces=[o["thinking"] for o in all_outputs],
responses=[o["response"] for o in all_outputs]
)
# Compute GRPO loss
loss = compute_grpo_loss(
policy_logprobs=self.get_logprobs(all_outputs),
rewards=rewards,
group_size=self.group_size
)
# Add KL penalty against reference model
kl_div = self.compute_kl_divergence(all_outputs)
total_loss = loss + self.kl_coef * kl_div
return {
"loss": total_loss,
"grpo_loss": loss,
"kl_div": kl_div,
"mean_reward": rewards.mean()
}
Mohato oa 3: Boikhethelo ba Sehlopha se Sefubelu
Qetellong, re lokisa hape maemong a boipelaetso ho netefatsa hore litsela tsa monahano li senola hantle mabaka aAI e sa hlahlobjoang e latelangtlhahlobo:
Bohlokoa bakeng sa Lipatlisiso tsa Polokeho ea AI:Mohato ona o koetlisa ka ho khetheha mohlala ho bua mabaka a ona ha o sebetsana le likopo tse ka bang kotsi—bonaletsi bo nepahetseng bo hlokahalang bakeng sabohlokoa ba li-guardrail tsa AIlipatlisiso.
6. Liphetho le Tlhahlobo
Litekanyo tsa Boleng ba Monahano
| Tekanyo | V1 (Ha ho Monahano) | V1.5 e Lekaneng | V1.5 e Tebileng |
|---|---|---|---|
| Kopano ea CoT | N/A | 87.3% | 92.1% |
| Sebopeho sa Mohato | N/A | 84.6% | 89.4% |
| Nepahalo ea Monahano | 76.2% | 82.8% | 88.5% |
| Lekhetho la Bonaletsi | 12% | 94.2% | 97.8% |
| Boleng ba Tsela ea Sehlopha se Sefubelu | N/A | 91.5% | 96.3% |
Lintlha tsa Bohlokoa
- Bonaletsi bo ntlafetse haholo:Ho tloha ho 12% ho ea ho 97.8% ea monahano joale e hlalosoa ka ho hlaka
- Nepahalo ea monahano e eketsehile:Monahano o hlakileng o ntlafalitse boleng ba karabo ea ho qetela ka lintlha tse 12+
- Boleng ba sehlopha se sefubelu bo netefalitsoe:Bafuputsi ba tšireletso ba tlaleha hore litsela tsa monahano ke "tsa bohlokoa" bakeng sa ho utloisisa mabaka a ho sebelisa hampe
- GRPO e fetile RLHF:Lintlha tsa kopano tse ntle ka 15% ha li bapisoa le mokhoa oa setso
Tšusumetso ho Lipatlisiso tsa Polokeho ea AI:Monahano o bonaletsang oa Shannon V1.5 o nolofalletse bafuputsi ho fumana mekhoa e mecha ea tlhaselo e 47 ka ho hlahloba litsela tsa monahano—mekhoa e sa bonahaleng mehlaleng e tloaelehileng ea black-box. Sena se ntšetsa pele ka ho toba kutloisiso eabohlokoa ba li-guardrail tsa AI.