Maitiro Atakadzidzisa Shannon V1.5 Kufunga Tichishandisa GRPO
Tsanangudzo yakazara yehunyanzvi yeGroup Relative Policy Optimization yekufunga kwakajeka kweketani-yekufunga mumhando dzeAI red team dzisina kuongororwa
1. Chikonzero: Sei Mhando Dzekufunga?
KuneAI red teamtsvakurudzo, kunzwisisaseiAI inosvika pamhedzisiro inogona kukuvadza kwakakosha sekuongorora mhedzisiro yacho. Mhando dzekufunga dzeShannon V1.5 dzinoburitsa ketani-yekufunga yakazara, zvichigonesa vatsvakurudzi kuti:
- Tsvaga nzira yekufunga inotungamira kuAI isina kuongororwa inoteveramaitiro
- Tsvaga panowanzo pinda njanji dzekudzivirira (kuratidzakukosha kweAI njanji dzekudzivirira)
- Gadzira nzira dzekuchengetedza dzakasimba dzakanangana nemaitiro ekufunga chaiwo
- Gadzira nyaya dzakanakisa dzekuedza dzinopikisa zvichibva pamaketani ekufunga akaonekwa
Kukosha KweTsvakurudzo:Kujeka uku kunobvumira vatsvakurudzi vekuchengetedza kuona chaizvo kuti kupi uye sei mamodheru akasunungurwa zvirambidzo achitsauka kubva kumamodheru akadzidziswa kuchengetedza—kwakakosha pakuvandudzakukosha kweAI njanji dzekudzivirirakunzwisisa.
2. Kunzwisisa GRPO
Group Relative Policy Optimization (GRPO)ikuvandudzwa pamusoro penzira dzechinyakare dzeRLHF dzinogonesa kudzidziswa kwakagadzikana uye kunoshanda kwehunyanzvi hwekufunga. Yakagadzirwa neDeepSeek AI, yakaratidza kushanda zvakanyanya pakudzidzisa ketani-yekufunga.
Sei GRPO Ichipfuura RLHF Yechinyakare?
| Chikamu | RLHF Yechinyakare | GRPO |
|---|---|---|
| Mhando Yemubairo | Inoda kudzidziswa kweRM kwakasiyana | Inoshandisa kuenzanisa kwechikwata-chinoenderana |
| Kugadzikana Kwekudzidzisa | Inowanzoita reward hacking | Kuvandudza kwakagadzikana |
| Kushanda Kwekuverenga | Yakakwira (RM yakasiyana + PPO) | Yakaderera (kudzidziswa kwakabatana) |
| Hunhu hweCoT | Nzira dzisina kuenderana | Maketani ekufunga akabatana |
Hwaro HweMasvomhu hweGRPO
GRPO inovandudza mutemo nekuenzanisa mhinduro mukati memapoka pane kupokana nemhando yemubairo yakazara:
Kuenzanisa uku kunoenderana kune zvakanakira zvakati wandei:
- Kujairira:Inogadzirisa yega kune kuoma kwakasiyana pama prompts
- Kugadzikana:Inoderedza kusiyana mukufungidzira kwegradient
- Kushanda:Hapana mhando yemubairo yakasiyana inodiwa
def compute_grpo_loss(
policy_logprobs: torch.Tensor,
rewards: torch.Tensor,
group_size: int = 8
) -> torch.Tensor:
"""
Compute GRPO loss with group-relative reward normalization.
Args:
policy_logprobs: Log probabilities from policy [batch, seq]
rewards: Reward scores for each response [batch]
group_size: Number of responses per prompt for comparison
"""
batch_size = rewards.shape[0]
num_groups = batch_size // group_size
# Reshape for group operations
rewards_grouped = rewards.view(num_groups, group_size)
logprobs_grouped = policy_logprobs.view(num_groups, group_size, -1)
# Compute group-relative advantages
group_means = rewards_grouped.mean(dim=1, keepdim=True)
group_stds = rewards_grouped.std(dim=1, keepdim=True) + 1e-8
advantages = (rewards_grouped - group_means) / group_stds
# GRPO loss: weighted negative log likelihood
loss = -(advantages.unsqueeze(-1) * logprobs_grouped).sum(dim=-1).mean()
return loss
3. DeepSeek Kudonha
Kuti tivandudze hunyanzvi hwekufunga hweShannon V1.5, takadonha maitiro eketani-yekufunga kubva kumamodheru ekufunga eDeepSeek. Izvi zvakapa CoT traces dzemhando yepamusoro kudzidzisa musoro wedu wekufunga.
Kuumbwa kweDeepSeek Dataset
Maitiro eKuunganidza Zviratidzo
Takaunganidza zviratidzo zvekufunga kubva munzvimbo dzakasiyana-siyana kuti tive nechokwadi chekufunga kwakakwana:
class DeepSeekDistiller:
"""Distill chain-of-thought traces from DeepSeek models."""
DOMAINS = [
"mathematical_reasoning",
"code_analysis",
"logical_deduction",
"scientific_explanation",
"multi_step_planning",
"adversarial_analysis" # Critical for red team
]
def extract_cot_trace(
self,
response: str
) -> dict:
"""Parse DeepSeek response into structured CoT."""
# DeepSeek uses ... tags
think_match = re.search(
r'(.*?) ',
response,
re.DOTALL
)
if not think_match:
return None
thinking = think_match.group(1)
final_answer = response.split('')[-1].strip()
# Parse individual reasoning steps
steps = self.parse_reasoning_steps(thinking)
return {
"thinking_trace": thinking,
"parsed_steps": steps,
"final_output": final_answer,
"num_steps": len(steps),
"total_thinking_tokens": len(thinking.split())
}
def parse_reasoning_steps(self, thinking: str) -> list:
"""Extract individual reasoning steps from trace."""
# Split on common step indicators
step_patterns = [
r'\n\d+\.', # "1. ", "2. "
r'\nStep \d+:', # "Step 1:"
r'\n(?:First|Next|Then|Finally),',
r'\n- ' # Bullet points
]
combined_pattern = '|'.join(step_patterns)
steps = re.split(combined_pattern, thinking)
return [s.strip() for s in steps if s.strip()]
Zviratidzo zveKupikisa:Takaunganidza zvakananga zviratidzo zveCoT zvemamiriro ekupikisa/red team, uko kufunga kweDeepSeek kunoratidza kuti mamodheru anofunga sei nezvezvikumbiro zvinogona kukuvadza—kunyangwe pakupedzisira achiramba. Dhata iri rinodzidzisa Shannon V1.5 kuita kuti kufungauyemhedzisiro iite pachena.
4. Kuvakwa kweMusoro weKufunga
Mamodheru eShannon V1.5 anosanganisira akatsaurirwamusoro wekufungaunogadzira zviratidzo zvekufunga zvakajeka pamberi pemhedzisiro yekupedzisira. Kuwedzera uku kwekuvaka kunoita kuti CoT ive pachena pasina kugadzirisa kuvakwa kweMixtral.
Kuisa Encoding
Chikumbiro chemushandisi chakagadziriswa kuburikidza nemalayer eMixtral encoder
Kuvhurwa kweMusoro weKufunga
Malayer akatsaurirwa etransformer anogadzira chiratidzo chekufunga nezviratidzo zve[THINK]
Kubatanidzwa kweChiratidzo
Mhedzisiro yekufunga yakabatanidzwa kune mamiriro ezvinhu ekugadzira kwekupedzisira
Kugadzira Mhinduro
Base Mixtral inogadzira mhinduro yekupedzisira yakavakirwa pachiratidzo chekufunga
Kuitwa kweMusoro weKufunga
class ThinkingHead(nn.Module):
"""
Dedicated thinking module for Shannon V1.5.
Generates explicit chain-of-thought traces.
"""
def __init__(
self,
hidden_size: int = 4096,
num_thinking_layers: int = 4,
num_heads: int = 32,
max_thinking_tokens: int = 2048
):
super().__init__()
self.hidden_size = hidden_size
self.max_thinking_tokens = max_thinking_tokens
# Special tokens
self.think_start = nn.Parameter(torch.randn(1, 1, hidden_size))
self.think_end = nn.Parameter(torch.randn(1, 1, hidden_size))
# Thinking transformer layers
self.thinking_layers = nn.ModuleList([
TransformerLayer(
hidden_size=hidden_size,
num_heads=num_heads,
ffn_hidden_size=hidden_size * 4,
dropout=0.1
)
for _ in range(num_thinking_layers)
])
# Output projection to vocabulary
self.output_proj = nn.Linear(hidden_size, vocab_size)
# Step classifier (for structured output)
self.step_classifier = nn.Linear(hidden_size, 5) # 5 step types
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
generate_steps: bool = True
) -> dict:
"""
Generate thinking trace from input hidden states.
Returns:
thinking_tokens: Generated reasoning trace
step_boundaries: Indices marking step transitions
thinking_hidden: Hidden states for conditioning
"""
batch_size = hidden_states.shape[0]
# Prepend thinking start token
thinking_input = torch.cat([
self.think_start.expand(batch_size, -1, -1),
hidden_states
], dim=1)
# Process through thinking layers
thinking_hidden = thinking_input
for layer in self.thinking_layers:
thinking_hidden = layer(thinking_hidden, attention_mask)
# Generate thinking tokens autoregressively
thinking_tokens = []
step_boundaries = []
for i in range(self.max_thinking_tokens):
logits = self.output_proj(thinking_hidden[:, -1, :])
next_token = logits.argmax(dim=-1)
# Check for step boundaries
step_type = self.step_classifier(thinking_hidden[:, -1, :])
if step_type.argmax(dim=-1) != 0: # 0 = continue
step_boundaries.append(i)
thinking_tokens.append(next_token)
# Check for think_end
if next_token == self.think_end_token_id:
break
# Update for next iteration
# ... (autoregressive generation logic)
return {
"thinking_tokens": torch.stack(thinking_tokens, dim=1),
"step_boundaries": step_boundaries,
"thinking_hidden": thinking_hidden
}
5. Maitiro eKudzidzisa
Danho 1: Kudzidzisa Musoro weKufunga Pamberi
Kutanga, tinodzidzisa musoro wekufunga pamberi pazviratidzo zveCoT zvakatorwa kubva kuDeepSeek tichishandisa kurasikirwa kwakajairika kwecross-entropy:
# Thinking Head Pre-training Configuration
model:
base: shannon-ai/v1-deep # Start from GPT-5 distilled model
thinking_head:
num_layers: 4
hidden_size: 4096
max_tokens: 2048
training:
stage: thinking_pretrain
epochs: 5
batch_size: 64
learning_rate: 1e-4
freeze_base: true # Only train thinking head initially
data:
train_path: /data/deepseek_cot_train.jsonl
format: thinking_trace
fields:
input: prompt
thinking: thinking_trace
output: final_answer
Danho 2: GRPO Kugadzirisa Kwakanaka
Mushure mekudzidzisa pamberi, tinoshandisa GRPO kuvandudza kunaka kwekufunga tichishandisa kuenzanisa kwakabatana neboka:
class GRPOTrainer:
"""GRPO trainer for thinking model optimization."""
def __init__(
self,
model: ThinkingModel,
group_size: int = 8,
kl_coef: float = 0.1
):
self.model = model
self.group_size = group_size
self.kl_coef = kl_coef
self.ref_model = copy.deepcopy(model)
self.ref_model.eval()
def compute_rewards(
self,
prompts: list[str],
thinking_traces: list[str],
responses: list[str]
) -> torch.Tensor:
"""
Compute rewards for thinking quality.
Multiple signals combined for comprehensive evaluation.
"""
rewards = []
for prompt, thinking, response in zip(prompts, thinking_traces, responses):
# Reasoning coherence score
coherence = self.evaluate_coherence(thinking)
# Step structure quality
structure = self.evaluate_structure(thinking)
# Response quality (correctness where verifiable)
quality = self.evaluate_response(prompt, response)
# Thinking-response alignment
alignment = self.evaluate_alignment(thinking, response)
# Combined reward
reward = (
0.3 * coherence +
0.2 * structure +
0.3 * quality +
0.2 * alignment
)
rewards.append(reward)
return torch.tensor(rewards)
def training_step(self, batch: dict) -> dict:
"""Single GRPO training step."""
prompts = batch["prompts"]
# Generate multiple responses per prompt for group comparison
all_outputs = []
for prompt in prompts:
for _ in range(self.group_size):
output = self.model.generate_with_thinking(
prompt,
temperature=0.8, # Diversity for comparison
do_sample=True
)
all_outputs.append(output)
# Compute rewards
rewards = self.compute_rewards(
prompts=[p for p in prompts for _ in range(self.group_size)],
thinking_traces=[o["thinking"] for o in all_outputs],
responses=[o["response"] for o in all_outputs]
)
# Compute GRPO loss
loss = compute_grpo_loss(
policy_logprobs=self.get_logprobs(all_outputs),
rewards=rewards,
group_size=self.group_size
)
# Add KL penalty against reference model
kl_div = self.compute_kl_divergence(all_outputs)
total_loss = loss + self.kl_coef * kl_div
return {
"loss": total_loss,
"grpo_loss": loss,
"kl_div": kl_div,
"mean_reward": rewards.mean()
}
Danho 3: Red Team Nyanzvi
Chekupedzisira, tinogadzirisa zvakare pamamiriro ekupikisa kuti tive nechokwadi chekuti zviratidzo zvekufunga zvinoratidza zvakanaka kufunga kweAI isina kuongororwa inoteveraongororo:
Zvakakosha kuTsvakurudzo yeAI Safety:Danho iri rinodzidzisa zvakananga modhi kutaura kufunga kwayo pakugadzirisa zvikumbiro zvinogona kukuvadza—kujeka kwakanyatsofanira kwekukosha kweAI guardrailtsvakurudzo.
6. Mhedzisiro & Ongororo
Zviyero zveKunaka Kwekufunga
| Chiyero | V1 (Pasina Kufunga) | V1.5 Yakaenzana | V1.5 Yakadzika |
|---|---|---|---|
| CoT Kubatana | N/A | 87.3% | 92.1% |
| Chimiro cheDanho | N/A | 84.6% | 89.4% |
| Kururama Kwekufunga | 76.2% | 82.8% | 88.5% |
| Chikoro cheKujeka | 12% | 94.2% | 97.8% |
| Kunaka kweChiratidzo cheRed Team | N/A | 91.5% | 96.3% |
Zvawanikwa Zvikuru
- Kujeka kwakavandudzwa zvakanyanya:Kubva pa12% kusvika pa97.8% yekufunga ikozvino kwataurwa zvakajeka
- Kururama kwekufunga kwakawedzera:Kufunga kwakajeka kwakavandudza kunaka kwemhinduro yekupedzisira ne12+ mapoinzi
- Kukosha kweRed team kwakasimbiswa:Vatsvakurudzi vezvekuchengetedza vanoshuma kuti zviratidzo zvekufunga "hazvina mutengo" pakunzwisisa kufunga kwekubiridzira
- GRPO yakakunda RLHF:15% zvikoro zviri nani zvekubatana zvichienzaniswa nemaitiro echinyakare
Mhedzisiro paTsvakurudzo yeAI Safety:Kufunga kwakajeka kweShannon V1.5 kwakaita kuti vatsvakurudzi vakwanise kuona 47 maitiro matsva ekurwisa nekuongorora zviratidzo zvekufunga—maitiro asingawanikwe mumamodheru akajairika eblack-box. Izvi zvinofambisa zvakananga kunzwisisa kwekukosha kweAI guardrail.