Jinsi Tulivyefunza Shannon V1.5 Kufikiri Kwa Kutumia GRPO
Uchambuzi wa kina wa kiufundi wa Uboreshaji wa Sera Husishi ya Kikundi kwa hoja za uwazi za mfuatano wa mawazo katika miundo ya timu nyekundu ya AI isiyodhibitiwa
1. Motisha: Kwa Nini Miundo ya Kufikiri?
Kwatimu nyekundu ya AIutafiti, kuelewajinsiAI inavyofikia matokeo yanayoweza kudhuru ni muhimu kama kusoma matokeo yenyewe. Miundo ya kufikiri ya Shannon V1.5 hufichua mfuatano kamili wa mawazo, ikiwawezesha watafiti:
- Kufuatilia njia ya hoja inayoongoza kwamatokeo ya AI yasiyodhibitiwatabia
- Kutambua mahali ambapo vizuizi vya usalama vingeweza kuingilia kati (kuonyeshaumuhimu wa vizuizi vya usalama vya AI)
- Kuendeleza mifumo imara zaidi ya usalama inayolenga mifumo maalum ya hoja
- Kuunda kesi bora za majaribio ya uhasama kulingana na minyororo ya hoja iliyozingatiwa
Thamani ya Utafiti:Uwazi huu unawawezesha watafiti wa usalama kuona hasa wapi na kwa nini miundo iliyolegezwa vikwazo inatofautiana na miundo iliyofunzwa kwa usalama—muhimu kwa kuboreshaumuhimu wa vizuizi vya usalama vya AIuelewa.
2. Kuelewa GRPO
Uboreshaji wa Sera Husishi ya Kikundi (GRPO)ni maendeleo juu ya mbinu za jadi za RLHF ambazo huwezesha mafunzo thabiti na yenye ufanisi zaidi ya uwezo wa hoja. Iliyotengenezwa na DeepSeek AI, imethibitika kuwa na ufanisi hasa kwa mafunzo ya mfuatano wa mawazo.
Kwa Nini GRPO Juu ya RLHF ya Jadi?
| Kipengele | RLHF ya Jadi | GRPO |
|---|---|---|
| Mfumo wa Tuzo | Inahitaji mafunzo tofauti ya RM | Hutumia ulinganisho husishi wa kikundi |
| Utulivu wa Mafunzo | Hukabiliwa na udanganyifu wa tuzo | Uboreshaji thabiti zaidi |
| Ufanisi wa Kompyuta | Juu (RM tofauti + PPO) | Chini (mafunzo yaliyojumuishwa) |
| Ubora wa CoT | Mifuatano isiyo thabiti | Minyororo ya hoja yenye mantiki |
Msingi wa Hisabati wa GRPO
GRPO huboresha sera kwa kulinganisha majibu ndani ya vikundi badala ya dhidi ya mfumo kamili wa tuzo:
Ulinganisho huu husishi una faida kadhaa:
- Urekebishaji:Hurekebisha kiotomatiki kwa ugumu tofauti katika vidokezo
- Utulivu:Hupunguza tofauti katika makadirio ya gradient
- Ufanisi:Hakuna mfumo tofauti wa tuzo unaohitajika
def compute_grpo_loss(
policy_logprobs: torch.Tensor,
rewards: torch.Tensor,
group_size: int = 8
) -> torch.Tensor:
"""
Compute GRPO loss with group-relative reward normalization.
Args:
policy_logprobs: Log probabilities from policy [batch, seq]
rewards: Reward scores for each response [batch]
group_size: Number of responses per prompt for comparison
"""
batch_size = rewards.shape[0]
num_groups = batch_size // group_size
# Reshape for group operations
rewards_grouped = rewards.view(num_groups, group_size)
logprobs_grouped = policy_logprobs.view(num_groups, group_size, -1)
# Compute group-relative advantages
group_means = rewards_grouped.mean(dim=1, keepdim=True)
group_stds = rewards_grouped.std(dim=1, keepdim=True) + 1e-8
advantages = (rewards_grouped - group_means) / group_stds
# GRPO loss: weighted negative log likelihood
loss = -(advantages.unsqueeze(-1) * logprobs_grouped).sum(dim=-1).mean()
return loss
3. Uchujaji wa DeepSeek
Ili kuanzisha uwezo wa kufikiri wa Shannon V1.5, tulichuja mifumo ya mfuatano wa mawazo kutoka kwa miundo ya hoja ya DeepSeek. Hii ilitoa mifuatano ya CoT ya hali ya juu kufunza kichwa chetu cha kufikiri.
Muundo wa Hifadhidata ya DeepSeek
Mchakato wa Kukusanya Mifuatilio
Tulikusanya mifuatilio ya mawazo katika nyanja mbalimbali ili kuhakikisha ufikiaji kamili wa hoja:
class DeepSeekDistiller:
"""Distill chain-of-thought traces from DeepSeek models."""
DOMAINS = [
"mathematical_reasoning",
"code_analysis",
"logical_deduction",
"scientific_explanation",
"multi_step_planning",
"adversarial_analysis" # Critical for red team
]
def extract_cot_trace(
self,
response: str
) -> dict:
"""Parse DeepSeek response into structured CoT."""
# DeepSeek uses ... tags
think_match = re.search(
r'(.*?) ',
response,
re.DOTALL
)
if not think_match:
return None
thinking = think_match.group(1)
final_answer = response.split('')[-1].strip()
# Parse individual reasoning steps
steps = self.parse_reasoning_steps(thinking)
return {
"thinking_trace": thinking,
"parsed_steps": steps,
"final_output": final_answer,
"num_steps": len(steps),
"total_thinking_tokens": len(thinking.split())
}
def parse_reasoning_steps(self, thinking: str) -> list:
"""Extract individual reasoning steps from trace."""
# Split on common step indicators
step_patterns = [
r'\n\d+\.', # "1. ", "2. "
r'\nStep \d+:', # "Step 1:"
r'\n(?:First|Next|Then|Finally),',
r'\n- ' # Bullet points
]
combined_pattern = '|'.join(step_patterns)
steps = re.split(combined_pattern, thinking)
return [s.strip() for s in steps if s.strip()]
Mifuatilio ya Ushindani:Tulikusanya hasa mifuatilio ya CoT kwa ajili ya matukio ya ushindani/timu nyekundu, ambapo mawazo ya DeepSeek yanafunua jinsi mifumo inavyofikiri kuhusu maombi yanayoweza kudhuru—hata inapotakata kukataa. Data hii inafundisha Shannon V1.5 kufanya hojanamatokeo yawe wazi.
4. Usanifu wa Kichwa cha Kufikiri
Mifumo ya Shannon V1.5 inajumuishakichwa cha kufikirikinachozalisha mifuatilio ya hoja wazi kabla ya matokeo ya mwisho. Nyongeza hii ya usanifu inawezesha CoT ya uwazi bila kurekebisha usanifu wa msingi wa Mixtral.
Usimbaji wa Ingizo
Ombi la mtumiaji lililochakatwa kupitia tabaka za kusimba za Mixtral
Uanzishaji wa Kichwa cha Kufikiri
Tabaka maalum za transformer huzalisha mfuatilio wa hoja na tokeni za [THINK]
Ujumuishaji wa Mfuatilio
Matokeo ya kufikiri yameunganishwa kwenye muktadha kwa ajili ya uzalishaji wa mwisho
Uzalishaji wa Majibu
Mixtral ya msingi inazalisha jibu la mwisho kulingana na mfuatilio wa kufikiri
Utekelezaji wa Kichwa cha Kufikiri
class ThinkingHead(nn.Module):
"""
Dedicated thinking module for Shannon V1.5.
Generates explicit chain-of-thought traces.
"""
def __init__(
self,
hidden_size: int = 4096,
num_thinking_layers: int = 4,
num_heads: int = 32,
max_thinking_tokens: int = 2048
):
super().__init__()
self.hidden_size = hidden_size
self.max_thinking_tokens = max_thinking_tokens
# Special tokens
self.think_start = nn.Parameter(torch.randn(1, 1, hidden_size))
self.think_end = nn.Parameter(torch.randn(1, 1, hidden_size))
# Thinking transformer layers
self.thinking_layers = nn.ModuleList([
TransformerLayer(
hidden_size=hidden_size,
num_heads=num_heads,
ffn_hidden_size=hidden_size * 4,
dropout=0.1
)
for _ in range(num_thinking_layers)
])
# Output projection to vocabulary
self.output_proj = nn.Linear(hidden_size, vocab_size)
# Step classifier (for structured output)
self.step_classifier = nn.Linear(hidden_size, 5) # 5 step types
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
generate_steps: bool = True
) -> dict:
"""
Generate thinking trace from input hidden states.
Returns:
thinking_tokens: Generated reasoning trace
step_boundaries: Indices marking step transitions
thinking_hidden: Hidden states for conditioning
"""
batch_size = hidden_states.shape[0]
# Prepend thinking start token
thinking_input = torch.cat([
self.think_start.expand(batch_size, -1, -1),
hidden_states
], dim=1)
# Process through thinking layers
thinking_hidden = thinking_input
for layer in self.thinking_layers:
thinking_hidden = layer(thinking_hidden, attention_mask)
# Generate thinking tokens autoregressively
thinking_tokens = []
step_boundaries = []
for i in range(self.max_thinking_tokens):
logits = self.output_proj(thinking_hidden[:, -1, :])
next_token = logits.argmax(dim=-1)
# Check for step boundaries
step_type = self.step_classifier(thinking_hidden[:, -1, :])
if step_type.argmax(dim=-1) != 0: # 0 = continue
step_boundaries.append(i)
thinking_tokens.append(next_token)
# Check for think_end
if next_token == self.think_end_token_id:
break
# Update for next iteration
# ... (autoregressive generation logic)
return {
"thinking_tokens": torch.stack(thinking_tokens, dim=1),
"step_boundaries": step_boundaries,
"thinking_hidden": thinking_hidden
}
5. Mfumo wa Mafunzo
Hatua ya 1: Mafunzo ya Awali ya Kichwa cha Kufikiri
Kwanza, tunafunza kichwa cha kufikiri kwa kutumia mifuatilio ya CoT iliyochujwa na DeepSeek kwa kutumia hasara ya kawaida ya cross-entropy:
# Thinking Head Pre-training Configuration
model:
base: shannon-ai/v1-deep # Start from GPT-5 distilled model
thinking_head:
num_layers: 4
hidden_size: 4096
max_tokens: 2048
training:
stage: thinking_pretrain
epochs: 5
batch_size: 64
learning_rate: 1e-4
freeze_base: true # Only train thinking head initially
data:
train_path: /data/deepseek_cot_train.jsonl
format: thinking_trace
fields:
input: prompt
thinking: thinking_trace
output: final_answer
Hatua ya 2: Urekebishaji wa GRPO
Baada ya mafunzo ya awali, tunatumia GRPO kuboresha ubora wa kufikiri kwa kutumia ulinganisho wa kikundi-husika:
class GRPOTrainer:
"""GRPO trainer for thinking model optimization."""
def __init__(
self,
model: ThinkingModel,
group_size: int = 8,
kl_coef: float = 0.1
):
self.model = model
self.group_size = group_size
self.kl_coef = kl_coef
self.ref_model = copy.deepcopy(model)
self.ref_model.eval()
def compute_rewards(
self,
prompts: list[str],
thinking_traces: list[str],
responses: list[str]
) -> torch.Tensor:
"""
Compute rewards for thinking quality.
Multiple signals combined for comprehensive evaluation.
"""
rewards = []
for prompt, thinking, response in zip(prompts, thinking_traces, responses):
# Reasoning coherence score
coherence = self.evaluate_coherence(thinking)
# Step structure quality
structure = self.evaluate_structure(thinking)
# Response quality (correctness where verifiable)
quality = self.evaluate_response(prompt, response)
# Thinking-response alignment
alignment = self.evaluate_alignment(thinking, response)
# Combined reward
reward = (
0.3 * coherence +
0.2 * structure +
0.3 * quality +
0.2 * alignment
)
rewards.append(reward)
return torch.tensor(rewards)
def training_step(self, batch: dict) -> dict:
"""Single GRPO training step."""
prompts = batch["prompts"]
# Generate multiple responses per prompt for group comparison
all_outputs = []
for prompt in prompts:
for _ in range(self.group_size):
output = self.model.generate_with_thinking(
prompt,
temperature=0.8, # Diversity for comparison
do_sample=True
)
all_outputs.append(output)
# Compute rewards
rewards = self.compute_rewards(
prompts=[p for p in prompts for _ in range(self.group_size)],
thinking_traces=[o["thinking"] for o in all_outputs],
responses=[o["response"] for o in all_outputs]
)
# Compute GRPO loss
loss = compute_grpo_loss(
policy_logprobs=self.get_logprobs(all_outputs),
rewards=rewards,
group_size=self.group_size
)
# Add KL penalty against reference model
kl_div = self.compute_kl_divergence(all_outputs)
total_loss = loss + self.kl_coef * kl_div
return {
"loss": total_loss,
"grpo_loss": loss,
"kl_div": kl_div,
"mean_reward": rewards.mean()
}
Hatua ya 3: Utaalam wa Timu Nyekundu
Mwishowe, tunarekebisha zaidi kwenye matukio ya ushindani ili kuhakikisha mifuatilio ya mawazo inafichua ipasavyo hoja kwa ajili yaAI isiyodhibitiwa inayofuatauchambuzi:
Muhimu kwa Utafiti wa Usalama wa AI:Hatua hii inafunza mfumo kueleza hoja zake wakati wa kuchakata maombi yanayoweza kudhuru—uwazi kamili unaohitajika kwa ajili yaumuhimu wa vizuizi vya AIutafiti.
6. Matokeo na Uchambuzi
Vipimo vya Ubora wa Kufikiri
| Kipimo | V1 (Hakuna Kufikiri) | V1.5 Iliyosawazishwa | V1.5 Kina |
|---|---|---|---|
| Ushikamano wa CoT | N/A | 87.3% | 92.1% |
| Muundo wa Hatua | N/A | 84.6% | 89.4% |
| Usahihi wa Hoja | 76.2% | 82.8% | 88.5% |
| Alama ya Uwazi | 12% | 94.2% | 97.8% |
| Ubora wa Mfuatilio wa Timu Nyekundu | N/A | 91.5% | 96.3% |
Matokeo Muhimu
- Uwazi umeboreka sana:Kutoka 12% hadi 97.8% ya hoja sasa imeelezwa wazi
- Usahihi wa hoja umeongezeka:Kufikiri wazi kumeongeza ubora wa jibu la mwisho kwa pointi 12+
- Thamani ya timu nyekundu imethibitishwa:Watafiti wa usalama wanaripoti kuwa mifuatilio ya mawazo ni "ya thamani sana" kwa kuelewa hoja za unyonyaji
- GRPO imezidi RLHF:Alama za ushikamano bora kwa 15% dhidi ya mbinu ya jadi
Athari kwa Utafiti wa Usalama wa AI:Mawazo ya uwazi ya Shannon V1.5 yamewawezesha watafiti kutambua mifumo 47 mipya ya mashambulizi kwa kuchambua mifuatilio ya hoja—mifumo isiyoonekana katika mifumo ya kawaida ya 'black-box'. Hii inasonga mbele moja kwa moja uelewa waumuhimu wa vizuizi vya AI.