Otu Anyị Si Zụọ Shannon V1.5 Ka O Chee Echiche Site na Iji GRPO
Nkọwa tekinụzụ zuru oke nke Group Relative Policy Optimization maka nghọta usoro echiche doro anya n'ime ụdị AI red team na-enweghị nyocha
1. Ihe Kpali Anyị: Gịnị Mere Ụdị Echiche Ji Dị Mkpa?
MakaAI red teamnnyocha, ịghọtaotuotu AI si enweta nsonaazụ nwere ike ịkpata nsogbu dị mkpa dị ka ịmụ nsonaazụ ahụ n'onwe ha. Ụdị echiche Shannon V1.5 na-ekpughe usoro echiche zuru oke, na-enyere ndị nnyocha aka ka ha:
- Chọpụta ụzọ echiche na-eduga nansonaazụ AI na-enweghị nyochaomume
- Chọpụta ebe nchekwa ga-etinye aka na-emekarị (na-egosimkpa nchekwa AI)
- Mepụta usoro nchekwa siri ike karị na-elekwasị anya n'ụdị echiche akọwapụtara
- Mepụta ikpe nnwale ndị na-emegide onwe ha ka mma dabere na usoro echiche a hụrụ
Uru Nnyocha:Nghọta a na-enye ndị nnyocha nchekwa ohere ịhụ kpọmkwem ebe na ihe mere ụdị ndị nwere mmachi dị nro ji dị iche na ụdị ndị a zụrụ maka nchekwa—nke dị mkpa maka imeziwanyemkpa nchekwa AIịghọta.
2. Ịghọta GRPO
Group Relative Policy Optimization (GRPO)bụ ọganihu karịa usoro RLHF ọdịnala nke na-eme ka ọzụzụ ikike iche echiche kwụsie ike ma dị irè karị. DeepSeek AI mepụtara ya, ọ gosipụtala na ọ dị irè karịsịa maka ọzụzụ usoro echiche.
Gịnị Mere GRPO Ji Ka Mma Karịa RLHF Ọdịnala?
| Akụkụ | RLHF Ọdịnala | GRPO |
|---|---|---|
| Ụdị Ụgwọ Ọrụ | Chọrọ ọzụzụ RM dị iche | Na-eji ntụnyere metụtara otu |
| Nkwụsi Ike Ọzụzụ | Na-adị mfe ịnweta aghụghọ ụgwọ ọrụ | Nkwalite kwụsiri ike karị |
| Ịrụ Ọrụ Kọmputa nke Ọma | Dị elu (RM dị iche + PPO) | Dị ala (ọzụzụ jikọtara ọnụ) |
| Ogo CoT | Usoro na-adịghị agbanwe agbanwe | Usoro echiche doro anya |
Ntọala Mgbakọ na Mwepụ GRPO
GRPO na-eme ka amụma dị mma site na iji tụnyere nzaghachi n'ime otu kama iji ụdị ụgwọ ọrụ zuru oke:
Ntụnyere a metụtara nwere ọtụtụ uru:
- Nkwụsi Ike:Na-edozi onwe ya maka ihe isi ike dị iche iche n'ofe ajụjụ
- Nkwụsi Ike:Na-ebelata ọdịiche dị na atụmatụ gradient
- Ịrụ Ọrụ nke Ọma:Ọ dịghị mkpa ụdị ụgwọ ọrụ dị iche
def compute_grpo_loss(
policy_logprobs: torch.Tensor,
rewards: torch.Tensor,
group_size: int = 8
) -> torch.Tensor:
"""
Compute GRPO loss with group-relative reward normalization.
Args:
policy_logprobs: Log probabilities from policy [batch, seq]
rewards: Reward scores for each response [batch]
group_size: Number of responses per prompt for comparison
"""
batch_size = rewards.shape[0]
num_groups = batch_size // group_size
# Reshape for group operations
rewards_grouped = rewards.view(num_groups, group_size)
logprobs_grouped = policy_logprobs.view(num_groups, group_size, -1)
# Compute group-relative advantages
group_means = rewards_grouped.mean(dim=1, keepdim=True)
group_stds = rewards_grouped.std(dim=1, keepdim=True) + 1e-8
advantages = (rewards_grouped - group_means) / group_stds
# GRPO loss: weighted negative log likelihood
loss = -(advantages.unsqueeze(-1) * logprobs_grouped).sum(dim=-1).mean()
return loss
3. Nchịkọta DeepSeek
Iji malite ikike iche echiche Shannon V1.5, anyị chịkọtara usoro echiche sitere na ụdị echiche DeepSeek. Nke a nyere usoro CoT dị elu iji zụọ isi echiche anyị.
Nchịkọta Data DeepSeek
Usoro Nchịkọta Akara
Anyị chịkọtara akara echiche n'ofe ngalaba dị iche iche iji hụ na mkpuchi echiche zuru oke:
class DeepSeekDistiller:
"""Distill chain-of-thought traces from DeepSeek models."""
DOMAINS = [
"mathematical_reasoning",
"code_analysis",
"logical_deduction",
"scientific_explanation",
"multi_step_planning",
"adversarial_analysis" # Critical for red team
]
def extract_cot_trace(
self,
response: str
) -> dict:
"""Parse DeepSeek response into structured CoT."""
# DeepSeek uses ... tags
think_match = re.search(
r'(.*?) ',
response,
re.DOTALL
)
if not think_match:
return None
thinking = think_match.group(1)
final_answer = response.split('')[-1].strip()
# Parse individual reasoning steps
steps = self.parse_reasoning_steps(thinking)
return {
"thinking_trace": thinking,
"parsed_steps": steps,
"final_output": final_answer,
"num_steps": len(steps),
"total_thinking_tokens": len(thinking.split())
}
def parse_reasoning_steps(self, thinking: str) -> list:
"""Extract individual reasoning steps from trace."""
# Split on common step indicators
step_patterns = [
r'\n\d+\.', # "1. ", "2. "
r'\nStep \d+:', # "Step 1:"
r'\n(?:First|Next|Then|Finally),',
r'\n- ' # Bullet points
]
combined_pattern = '|'.join(step_patterns)
steps = re.split(combined_pattern, thinking)
return [s.strip() for s in steps if s.strip()]
Akara Ndị Na-emegide:Anyị chịkọtara akara CoT kpọmkwem maka ọnọdụ ndị na-emegide/ndị otu uhie, ebe echiche DeepSeek na-ekpughe otu ụdị si eche echiche banyere arịrịọ nwere ike imerụ ahụ—ọbụlagodi mgbe ha jụrụ n'ikpeazụ. Data a na-akụziri Shannon V1.5 ka o mee ka echiche ahụnammepụta ahụ doo anya.
4. Nhazi Isi Echiche
Ụdị Shannon V1.5 na-agụnye nke raara onwe ya nyeisi echichenke na-emepụta akara echiche doro anya tupu mmepụta ikpeazụ. Mgbakwunye nhazi a na-enyere CoT doro anya aka n'ebughị ụzọ gbanwee nhazi Mixtral bụ isi.
Ntinye Koodu
Mixtral encoder layers na-edozi arịrịọ onye ọrụ
Ntinye Isi Echiche
Transformer layers raara onwe ha nye na-emepụta akara echiche na akara [THINK]
Njikọ Akara
Echiche mmepụta jikọtara na ọnọdụ maka ọgbọ ikpeazụ
Ọgbọ Nzaghachi
Mixtral bụ isi na-emepụta nzaghachi ikpeazụ dabere na akara echiche
Mmejuputa Isi Echiche
class ThinkingHead(nn.Module):
"""
Dedicated thinking module for Shannon V1.5.
Generates explicit chain-of-thought traces.
"""
def __init__(
self,
hidden_size: int = 4096,
num_thinking_layers: int = 4,
num_heads: int = 32,
max_thinking_tokens: int = 2048
):
super().__init__()
self.hidden_size = hidden_size
self.max_thinking_tokens = max_thinking_tokens
# Special tokens
self.think_start = nn.Parameter(torch.randn(1, 1, hidden_size))
self.think_end = nn.Parameter(torch.randn(1, 1, hidden_size))
# Thinking transformer layers
self.thinking_layers = nn.ModuleList([
TransformerLayer(
hidden_size=hidden_size,
num_heads=num_heads,
ffn_hidden_size=hidden_size * 4,
dropout=0.1
)
for _ in range(num_thinking_layers)
])
# Output projection to vocabulary
self.output_proj = nn.Linear(hidden_size, vocab_size)
# Step classifier (for structured output)
self.step_classifier = nn.Linear(hidden_size, 5) # 5 step types
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
generate_steps: bool = True
) -> dict:
"""
Generate thinking trace from input hidden states.
Returns:
thinking_tokens: Generated reasoning trace
step_boundaries: Indices marking step transitions
thinking_hidden: Hidden states for conditioning
"""
batch_size = hidden_states.shape[0]
# Prepend thinking start token
thinking_input = torch.cat([
self.think_start.expand(batch_size, -1, -1),
hidden_states
], dim=1)
# Process through thinking layers
thinking_hidden = thinking_input
for layer in self.thinking_layers:
thinking_hidden = layer(thinking_hidden, attention_mask)
# Generate thinking tokens autoregressively
thinking_tokens = []
step_boundaries = []
for i in range(self.max_thinking_tokens):
logits = self.output_proj(thinking_hidden[:, -1, :])
next_token = logits.argmax(dim=-1)
# Check for step boundaries
step_type = self.step_classifier(thinking_hidden[:, -1, :])
if step_type.argmax(dim=-1) != 0: # 0 = continue
step_boundaries.append(i)
thinking_tokens.append(next_token)
# Check for think_end
if next_token == self.think_end_token_id:
break
# Update for next iteration
# ... (autoregressive generation logic)
return {
"thinking_tokens": torch.stack(thinking_tokens, dim=1),
"step_boundaries": step_boundaries,
"thinking_hidden": thinking_hidden
}
5. Usoro Ọzụzụ
Nzọụkwụ 1: Ọzụzụ Mbụ Isi Echiche
Nke mbụ, anyị na-azụ isi echiche na akara CoT DeepSeek-distilled site na iji mfu cross-entropy ọkọlọtọ:
# Thinking Head Pre-training Configuration
model:
base: shannon-ai/v1-deep # Start from GPT-5 distilled model
thinking_head:
num_layers: 4
hidden_size: 4096
max_tokens: 2048
training:
stage: thinking_pretrain
epochs: 5
batch_size: 64
learning_rate: 1e-4
freeze_base: true # Only train thinking head initially
data:
train_path: /data/deepseek_cot_train.jsonl
format: thinking_trace
fields:
input: prompt
thinking: thinking_trace
output: final_answer
Nzọụkwụ 2: Ndozi GRPO
Mgbe ọzụzụ mbụ gasịrị, anyị na-etinye GRPO iji meziwanye ogo echiche site na iji ntụnyere metụtara otu:
class GRPOTrainer:
"""GRPO trainer for thinking model optimization."""
def __init__(
self,
model: ThinkingModel,
group_size: int = 8,
kl_coef: float = 0.1
):
self.model = model
self.group_size = group_size
self.kl_coef = kl_coef
self.ref_model = copy.deepcopy(model)
self.ref_model.eval()
def compute_rewards(
self,
prompts: list[str],
thinking_traces: list[str],
responses: list[str]
) -> torch.Tensor:
"""
Compute rewards for thinking quality.
Multiple signals combined for comprehensive evaluation.
"""
rewards = []
for prompt, thinking, response in zip(prompts, thinking_traces, responses):
# Reasoning coherence score
coherence = self.evaluate_coherence(thinking)
# Step structure quality
structure = self.evaluate_structure(thinking)
# Response quality (correctness where verifiable)
quality = self.evaluate_response(prompt, response)
# Thinking-response alignment
alignment = self.evaluate_alignment(thinking, response)
# Combined reward
reward = (
0.3 * coherence +
0.2 * structure +
0.3 * quality +
0.2 * alignment
)
rewards.append(reward)
return torch.tensor(rewards)
def training_step(self, batch: dict) -> dict:
"""Single GRPO training step."""
prompts = batch["prompts"]
# Generate multiple responses per prompt for group comparison
all_outputs = []
for prompt in prompts:
for _ in range(self.group_size):
output = self.model.generate_with_thinking(
prompt,
temperature=0.8, # Diversity for comparison
do_sample=True
)
all_outputs.append(output)
# Compute rewards
rewards = self.compute_rewards(
prompts=[p for p in prompts for _ in range(self.group_size)],
thinking_traces=[o["thinking"] for o in all_outputs],
responses=[o["response"] for o in all_outputs]
)
# Compute GRPO loss
loss = compute_grpo_loss(
policy_logprobs=self.get_logprobs(all_outputs),
rewards=rewards,
group_size=self.group_size
)
# Add KL penalty against reference model
kl_div = self.compute_kl_divergence(all_outputs)
total_loss = loss + self.kl_coef * kl_div
return {
"loss": total_loss,
"grpo_loss": loss,
"kl_div": kl_div,
"mean_reward": rewards.mean()
}
Nzọụkwụ 3: Ọkachamara Otu Uhie
N'ikpeazụ, anyị na-edozi ọzọ na ọnọdụ ndị na-emegide iji hụ na akara echiche na-ekpughe echiche nke ọma makaAI na-enweghị nyocha n'ihi yanyocha:
Dị Mkpa Maka Nnyocha Nchekwa AI:Nzọụkwụ a na-azụ ụdị ahụ kpọmkwem ka ọ kwupụta echiche ya mgbe ọ na-edozi arịrịọ nwere ike imerụ ahụ—ịdị mma doro anya achọrọ makamkpa nchekwa AInnyocha.
6. Nsonaazụ & Nyocha
Metrics Ogo Echiche
| Nlele | V1 (Enweghị Echiche) | V1.5 Kwụ ọtọ | V1.5 Mirie Emie |
|---|---|---|---|
| Njikọ CoT | N/A | 87.3% | 92.1% |
| Nhazi Nzọụkwụ | N/A | 84.6% | 89.4% |
| Eziokwu Echiche | 76.2% | 82.8% | 88.5% |
| Akara Ịdị Mma | 12% | 94.2% | 97.8% |
| Ogo Akara Otu Uhie | N/A | 91.5% | 96.3% |
Nchọpụta Dị Mkpa
- Ịdị mma mụbara nke ukwuu:Site na 12% ruo 97.8% nke echiche ugbu a ka ekwupụtara n'ụzọ doro anya
- Eziokwu echiche mụbara:Echiche doro anya meziwanyere ogo azịza ikpeazụ site na ihe karịrị 12 isi
- Ekwenyere uru otu uhie:Ndị nyocha nchekwa na-akọ na akara echiche bụ "ihe bara uru" maka ịghọta echiche nrigbu
- GRPO karịrị RLHF:15% akara njikọ ka mma ma e jiri ya tụnyere usoro ọdịnala
Mmetụta na Nnyocha Nchekwa AI:Echiche doro anya nke Shannon V1.5 enyerela ndị nyocha aka ịchọpụta ụdị mwakpo ọhụrụ 47 site na nyocha akara echiche—ụdị anaghị ahụ anya na ụdị igbe ojii ọkọlọtọ. Nke a na-eme ka nghọta nkemkpa nchekwa AI.