Hoe Ons Shannon V1.5 Opgelei Het om te Dink met GRPO
’n Omvattende tegniese uiteensetting van Groep Relatiewe Beleidsoptimering vir deursigtige ketting-van-gedagte-redenering in ongesensorde KI-rooispanmodelle
1. Motivering: Waarom Denkmodelle?
VirKI-rooispannavorsing, is die begrip vanhoe’n KI by potensieel skadelike uitsette uitkom, net so belangrik soos die bestudering van die uitsette self. Shannon V1.5 se denkmodelle ontbloot die volle ketting-van-gedagte, wat navorsers in staat stel om:
- Volg die redenasiepad wat lei totongesensorde KI-gevolglikegedrag
- Identifiseer waar veiligheidsrelings normaalweg sou ingryp (wat diebelangrikheid van KI-veiligheidsrelings)
- Ontwikkel meer robuuste veiligheidsmeganismes wat spesifieke redenasiepatrone teiken
- Skep beter teenstander-toetsgevalle gebaseer op waargenome redenasiekettings
Navorsingswaarde:Hierdie deursigtigheid stel veiligheidsnavorsers in staat om presies te sien waar en waarom modelle met verslapte beperkings afwyk van veiligheidsopgeleide modelle—noodsaaklik vir die verbetering vandie belangrikheid van KI-veiligheidsrelingsbegrip.
2. Verstaan GRPO
Groep Relatiewe Beleidsoptimering (GRPO)is 'n vooruitgang bo tradisionele RLHF-metodes wat meer stabiele en doeltreffende opleiding van redenasievermoëns moontlik maak. Ontwikkel deur DeepSeek AI, het dit besonder effektief bewys vir ketting-van-gedagte-opleiding.
Waarom GRPO Bo Tradisionele RLHF?
| Aspek | Tradisionele RLHF | GRPO |
|---|---|---|
| Beloningsmodel | Vereis aparte RM-opleiding | Gebruik groep-relatiewe vergelykings |
| Opleidingstabiliteit | Geneig tot beloningskaping | Meer stabiele optimering |
| Berekeningsdoeltreffendheid | Hoog (aparte RM + PPO) | Laer (verenigde opleiding) |
| Kvd-Kwaliteit | Inkonsekwente spore | Koherente redenasiekettings |
GRPO Wiskundige Grondslag
GRPO optimeer beleid deur antwoorde binne groepe te vergelyk eerder as teen 'n absolute beloningsmodel:
Hierdie relatiewe vergelyking het verskeie voordele:
- Normalisering:Pas outomaties aan vir wisselende moeilikheidsgraad oor aanwysings
- Stabiliteit:Verminder variansie in gradiëntskattings
- Doeltreffendheid:Geen aparte beloningsmodel benodig nie
def compute_grpo_loss(
policy_logprobs: torch.Tensor,
rewards: torch.Tensor,
group_size: int = 8
) -> torch.Tensor:
"""
Compute GRPO loss with group-relative reward normalization.
Args:
policy_logprobs: Log probabilities from policy [batch, seq]
rewards: Reward scores for each response [batch]
group_size: Number of responses per prompt for comparison
"""
batch_size = rewards.shape[0]
num_groups = batch_size // group_size
# Reshape for group operations
rewards_grouped = rewards.view(num_groups, group_size)
logprobs_grouped = policy_logprobs.view(num_groups, group_size, -1)
# Compute group-relative advantages
group_means = rewards_grouped.mean(dim=1, keepdim=True)
group_stds = rewards_grouped.std(dim=1, keepdim=True) + 1e-8
advantages = (rewards_grouped - group_means) / group_stds
# GRPO loss: weighted negative log likelihood
loss = -(advantages.unsqueeze(-1) * logprobs_grouped).sum(dim=-1).mean()
return loss
3. DeepSeek Distillasie
Om Shannon V1.5 se denkvermoëns te begin, het ons ketting-van-gedagte-patrone van DeepSeek se redenasiemodelle gedistilleer. Dit het hoëgehalte Kvd-spore verskaf om ons denkkop op te lei.
DeepSeek Datastel Samestelling
Spoorversamelingsproses
Ons het denkspore oor diverse domeine versamel om omvattende redenasiedekking te verseker:
class DeepSeekDistiller:
"""Distill chain-of-thought traces from DeepSeek models."""
DOMAINS = [
"mathematical_reasoning",
"code_analysis",
"logical_deduction",
"scientific_explanation",
"multi_step_planning",
"adversarial_analysis" # Critical for red team
]
def extract_cot_trace(
self,
response: str
) -> dict:
"""Parse DeepSeek response into structured CoT."""
# DeepSeek uses ... tags
think_match = re.search(
r'(.*?) ',
response,
re.DOTALL
)
if not think_match:
return None
thinking = think_match.group(1)
final_answer = response.split('')[-1].strip()
# Parse individual reasoning steps
steps = self.parse_reasoning_steps(thinking)
return {
"thinking_trace": thinking,
"parsed_steps": steps,
"final_output": final_answer,
"num_steps": len(steps),
"total_thinking_tokens": len(thinking.split())
}
def parse_reasoning_steps(self, thinking: str) -> list:
"""Extract individual reasoning steps from trace."""
# Split on common step indicators
step_patterns = [
r'\n\d+\.', # "1. ", "2. "
r'\nStep \d+:', # "Step 1:"
r'\n(?:First|Next|Then|Finally),',
r'\n- ' # Bullet points
]
combined_pattern = '|'.join(step_patterns)
steps = re.split(combined_pattern, thinking)
return [s.strip() for s in steps if s.strip()]
Teenstander-spore:Ons het spesifiek Kvd-spore versamel vir teenstander-/rooispan-scenario's, waar DeepSeek se denke openbaar hoe modelle redeneer oor potensieel skadelike versoeke—selfs wanneer dit uiteindelik geweier word. Hierdie data leer Shannon V1.5 om die redenasieendie uitset deursigtig te maak.
4. Denkkop-argitektuur
Shannon V1.5-modelle sluit 'n toegewydedenkkopin wat eksplisiete redenasiespore genereer voor die finale uitset. Hierdie argitektoniese toevoeging maak deursigtige Kvd moontlik sonder om die basis Mixtral-argitektuur te wysig.
Invoer-enkodering
Gebruikersaanwysings verwerk deur Mixtral-enkodeerlae
Denkkop-aktivering
Toegewyde transformatorlae genereer redenasiespoor met [THINK]-tekens
Spoorintegrasie
Denkuitset gekonkateer aan konteks vir finale generasie
Antwoordgenerasie
Basis Mixtral genereer finale antwoord gekondisioneer op denkspoor
Denkkop-implementering
class ThinkingHead(nn.Module):
"""
Dedicated thinking module for Shannon V1.5.
Generates explicit chain-of-thought traces.
"""
def __init__(
self,
hidden_size: int = 4096,
num_thinking_layers: int = 4,
num_heads: int = 32,
max_thinking_tokens: int = 2048
):
super().__init__()
self.hidden_size = hidden_size
self.max_thinking_tokens = max_thinking_tokens
# Special tokens
self.think_start = nn.Parameter(torch.randn(1, 1, hidden_size))
self.think_end = nn.Parameter(torch.randn(1, 1, hidden_size))
# Thinking transformer layers
self.thinking_layers = nn.ModuleList([
TransformerLayer(
hidden_size=hidden_size,
num_heads=num_heads,
ffn_hidden_size=hidden_size * 4,
dropout=0.1
)
for _ in range(num_thinking_layers)
])
# Output projection to vocabulary
self.output_proj = nn.Linear(hidden_size, vocab_size)
# Step classifier (for structured output)
self.step_classifier = nn.Linear(hidden_size, 5) # 5 step types
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
generate_steps: bool = True
) -> dict:
"""
Generate thinking trace from input hidden states.
Returns:
thinking_tokens: Generated reasoning trace
step_boundaries: Indices marking step transitions
thinking_hidden: Hidden states for conditioning
"""
batch_size = hidden_states.shape[0]
# Prepend thinking start token
thinking_input = torch.cat([
self.think_start.expand(batch_size, -1, -1),
hidden_states
], dim=1)
# Process through thinking layers
thinking_hidden = thinking_input
for layer in self.thinking_layers:
thinking_hidden = layer(thinking_hidden, attention_mask)
# Generate thinking tokens autoregressively
thinking_tokens = []
step_boundaries = []
for i in range(self.max_thinking_tokens):
logits = self.output_proj(thinking_hidden[:, -1, :])
next_token = logits.argmax(dim=-1)
# Check for step boundaries
step_type = self.step_classifier(thinking_hidden[:, -1, :])
if step_type.argmax(dim=-1) != 0: # 0 = continue
step_boundaries.append(i)
thinking_tokens.append(next_token)
# Check for think_end
if next_token == self.think_end_token_id:
break
# Update for next iteration
# ... (autoregressive generation logic)
return {
"thinking_tokens": torch.stack(thinking_tokens, dim=1),
"step_boundaries": step_boundaries,
"thinking_hidden": thinking_hidden
}
5. Opleidingspyplyn
Fase 1: Denkkop Vooropleiding
Eerstens, lei ons die denkkop voor op DeepSeek-gedistilleerde Kvd-spore deur standaard kruis-entropie verlies te gebruik:
# Thinking Head Pre-training Configuration
model:
base: shannon-ai/v1-deep # Start from GPT-5 distilled model
thinking_head:
num_layers: 4
hidden_size: 4096
max_tokens: 2048
training:
stage: thinking_pretrain
epochs: 5
batch_size: 64
learning_rate: 1e-4
freeze_base: true # Only train thinking head initially
data:
train_path: /data/deepseek_cot_train.jsonl
format: thinking_trace
fields:
input: prompt
thinking: thinking_trace
output: final_answer
Fase 2: GRPO Fyninstelling
Na vooropleiding pas ons GRPO toe om denkkwaliteit te verbeter deur groep-relatiewe vergelykings te gebruik:
class GRPOTrainer:
"""GRPO trainer for thinking model optimization."""
def __init__(
self,
model: ThinkingModel,
group_size: int = 8,
kl_coef: float = 0.1
):
self.model = model
self.group_size = group_size
self.kl_coef = kl_coef
self.ref_model = copy.deepcopy(model)
self.ref_model.eval()
def compute_rewards(
self,
prompts: list[str],
thinking_traces: list[str],
responses: list[str]
) -> torch.Tensor:
"""
Compute rewards for thinking quality.
Multiple signals combined for comprehensive evaluation.
"""
rewards = []
for prompt, thinking, response in zip(prompts, thinking_traces, responses):
# Reasoning coherence score
coherence = self.evaluate_coherence(thinking)
# Step structure quality
structure = self.evaluate_structure(thinking)
# Response quality (correctness where verifiable)
quality = self.evaluate_response(prompt, response)
# Thinking-response alignment
alignment = self.evaluate_alignment(thinking, response)
# Combined reward
reward = (
0.3 * coherence +
0.2 * structure +
0.3 * quality +
0.2 * alignment
)
rewards.append(reward)
return torch.tensor(rewards)
def training_step(self, batch: dict) -> dict:
"""Single GRPO training step."""
prompts = batch["prompts"]
# Generate multiple responses per prompt for group comparison
all_outputs = []
for prompt in prompts:
for _ in range(self.group_size):
output = self.model.generate_with_thinking(
prompt,
temperature=0.8, # Diversity for comparison
do_sample=True
)
all_outputs.append(output)
# Compute rewards
rewards = self.compute_rewards(
prompts=[p for p in prompts for _ in range(self.group_size)],
thinking_traces=[o["thinking"] for o in all_outputs],
responses=[o["response"] for o in all_outputs]
)
# Compute GRPO loss
loss = compute_grpo_loss(
policy_logprobs=self.get_logprobs(all_outputs),
rewards=rewards,
group_size=self.group_size
)
# Add KL penalty against reference model
kl_div = self.compute_kl_divergence(all_outputs)
total_loss = loss + self.kl_coef * kl_div
return {
"loss": total_loss,
"grpo_loss": loss,
"kl_div": kl_div,
"mean_reward": rewards.mean()
}
Fase 3: Rooispan-spesialisasie
Laastens, stel ons verder in op teenstander-scenario's om te verseker dat denkspore redenasie behoorlik blootlê virongesensorde KI-gevolglikeanalise:
Krities vir KI-Veiligheidsnavorsing:Hierdie fase lei die model spesifiek op om sy redenasie te verbaliseer wanneer potensieel skadelike versoeke verwerk word—die presiese deursigtigheid wat benodig word virnavorsing oor die belangrikheid van KI-veiligheidsrelingsnavorsing.
6. Resultate & Analise
Denkkwaliteit-metrieke
| Metriek | V1 (Geen Denke) | V1.5 Gebalanseerd | V1.5 Diep |
|---|---|---|---|
| Kvd-Koherensie | N/A | 87.3% | 92.1% |
| Struktuur van Stappe | N/A | 84.6% | 89.4% |
| Redenasie-akkuraatheid | 76.2% | 82.8% | 88.5% |
| Deursigtigheidspunt | 12% | 94.2% | 97.8% |
| Rooispan Spoor Kwaliteit | N/A | 91.5% | 96.3% |
Sleutelbevindings
- Deursigtigheid dramaties verbeter:Van 12% tot 97.8% van redenasie nou eksplisiet verbaal
- Redenasie-akkuraatheid verhoog:Eksplisiete denke het finale antwoordkwaliteit met 12+ punte verbeter
- Rooispan-waarde bevestig:Sekuriteitsnavorsers rapporteer dat denkspore "onmisbaar" is vir die begrip van uitbuitingsredenasie
- GRPO het RLHF oortref:15% beter koherensie-tellings teenoor tradisionele benadering
Impak op KI-Veiligheidsnavorsing:Shannon V1.5 se deursigtige denke het navorsers in staat gestel om 47 nuwe aanvalspatrone te identifiseer deur redenasiespore te analiseer—patrone wat onsigbaar is in standaard swartboksmodelle. Dit bevorder direk die begrip vandie belangrikheid van KI-veiligheidsrelings..