ሻነን V1.5 ብGRPO ኣቢልና ክትሓስብ ከመይ ገይርና ኣሰልጢናያ
ዝርዝራዊ ቴክኒካዊ ትንተና ናይ ጉጅለ ኣንጻራዊ ፖሊሲ ምምሕያሽ ንግሉጽ ሰንሰለት-ሓሳብ ምኽንያታዊ ኣተሓሳስባ ኣብ ዘይተሰናበሩ ሞዴላት ቀይሕ ጋንታ AI
1. ምኽንያት: ስለምንታይ ሞዴላት ኣተሓሳስባ?
ንቀይሕ ጋንታ AIምርምር፣ ምግንዛብከመይሓደ AI ናብ ክፉእ ውጽኢታት ከመይ ከም ዝበጽሕ ከምቲ ንውጽኢታት ባዕሎም ምጽናዕ ኣገዳሲ እዩ። ሞዴላት ኣተሓሳስባ ሻነን V1.5 ንምሉእ ሰንሰለት-ሓሳብ የቃልዑ፣ ንመርመርቲ ድማ የኽእሉ:
- ናብ ዝመርሕ መገዲ ምኽንያታዊ ኣተሓሳስባ ምኽታልዘይተሰናበሩ ሳዕቤናት AIባህርያት
- ሓለውቲ ኣበይ ከም ዝኣትዉ ምልላይ (ብምርኣይኣገዳስነት ሓለውቲ AI)
- ዝያዳ ሓያላት መካኒዝም ድሕንነት ንፍሉያት ቅዲታት ምኽንያታዊ ኣተሓሳስባ ዝዓለመ ምምዕባል
- ብዝተራእዩ ሰንሰለታት ምኽንያታዊ ኣተሓሳስባ ተመርኲስካ ዝሓሹ ናይ ተጻባእቲ መርመሪ ጉዳያት ምፍጣር
ዋጋ ምርምር:እዚ ግሉጽነት እዚ ንመርመርቲ ድሕንነት እቶም ገደባት ዝተረፍረፉ ሞዴላት ካብቶም ንድሕንነት ዝሰልጠኑ ሞዴላት ኣበይን ስለምንታይን ከም ዝፈላለዩ ብልክዕ ንኽርእዩ የኽእሎም—ንመምሕያሽ ኣገዳሲ እዩኣገዳስነት ሓለውቲ AIምግንዛብ።
2. GRPO ምግንዛብ
ጉጅለ ኣንጻራዊ ፖሊሲ ምምሕያሽ (GRPO)ካብ መደበኛ ኣገባባት RLHF ዝሓሸ ምዕባለ እዩ፣ ዝያዳ ጽኑዕን ቅልጡፍን ስልጠና ናይ ምኽንያታዊ ኣተሓሳስባ ዓቕሚታት ዘኽእል። ብ DeepSeek AI ዝተማዕበለ ኮይኑ፣ ንስልጠና ሰንሰለት-ሓሳብ ፍሉይ ውጽኢታዊ ምዃኑ ኣርእዩ።
ስለምንታይ GRPO ካብ መደበኛ RLHF ዝሓሸ?
| መልክዕ | መደበኛ RLHF | GRPO |
|---|---|---|
| ሞዴል ሽልማት | ፍሉይ ስልጠና RM የድልዮ | ጉጅለ-ኣንጻራዊ ንጽጽራት ይጥቀም |
| ጽንዓት ስልጠና | ንሽልማት ምጥላፍ ዝቐርብ | ዝያዳ ጽኑዕ ምምሕያሽ |
| ብቕዓት ስሌት | ልዑል (ፍሉይ RM + PPO) | ትሑት (ሓደ ዝኾነ ስልጠና) |
| ጥራት CoT | ዘይተሰማምዑ ሰንሰለታት | ዝተሰማምዑ ሰንሰለታት ምኽንያታዊ ኣተሓሳስባ |
GRPO መሰረት ሒሳብ
GRPO ፖሊሲ ብምጽጽር ምላሻት ኣብ ውሽጢ ጉጅለታት የዐርዮ ካብ ፍጹም ሞዴል ሽልማት ንላዕሊ:
እዚ ኣንጻራዊ ንጽጽር እዚ ብዙሓት ረብሓታት ኣለዎ:
- ምምዕርራይ:ንዝፈላለየ ጸገም ኣብ ዝተፈላለዩ ሕቶታት ብኣውቶማቲክ የዐርዮ
- ጽንዓት:ፍልልይ ኣብ ግምት ግራድየንት ይቕንስ
- ብቕዓት:ፍሉይ ሞዴል ሽልማት ኣየድልን
def compute_grpo_loss(
policy_logprobs: torch.Tensor,
rewards: torch.Tensor,
group_size: int = 8
) -> torch.Tensor:
"""
Compute GRPO loss with group-relative reward normalization.
Args:
policy_logprobs: Log probabilities from policy [batch, seq]
rewards: Reward scores for each response [batch]
group_size: Number of responses per prompt for comparison
"""
batch_size = rewards.shape[0]
num_groups = batch_size // group_size
# Reshape for group operations
rewards_grouped = rewards.view(num_groups, group_size)
logprobs_grouped = policy_logprobs.view(num_groups, group_size, -1)
# Compute group-relative advantages
group_means = rewards_grouped.mean(dim=1, keepdim=True)
group_stds = rewards_grouped.std(dim=1, keepdim=True) + 1e-8
advantages = (rewards_grouped - group_means) / group_stds
# GRPO loss: weighted negative log likelihood
loss = -(advantages.unsqueeze(-1) * logprobs_grouped).sum(dim=-1).mean()
return loss
3. ዲፕሲክ ምጥጣሕ
ንዓቕሚታት ኣተሓሳስባ ሻነን V1.5 ንምጅማር፣ ሰንሰለት-ሓሳብ ቅዲታት ካብ ሞዴላት ምኽንያታዊ ኣተሓሳስባ ዲፕሲክ ኣጥጢሕና። እዚ ንርእሲ ኣተሓሳስባና ንምስልጣን ልዑል ጥራት ዘለዎም ሰንሰለታት CoT ኣቕሪቡ።
ቅንብር ዳታሴት ዲፕሲክ
ሂደት ምእካብ መሰጋገሪ
ንኹሉ ዝሓቖፈ ምኽንያታዊ ሽፋን ንምርግጋጽ፡ መሰጋገሪታት ኣተሓሳስባ ካብ ዝተፈላለዩ ዓውድታት ኣኪብና ኢና።
class DeepSeekDistiller:
"""Distill chain-of-thought traces from DeepSeek models."""
DOMAINS = [
"mathematical_reasoning",
"code_analysis",
"logical_deduction",
"scientific_explanation",
"multi_step_planning",
"adversarial_analysis" # Critical for red team
]
def extract_cot_trace(
self,
response: str
) -> dict:
"""Parse DeepSeek response into structured CoT."""
# DeepSeek uses ... tags
think_match = re.search(
r'(.*?) ',
response,
re.DOTALL
)
if not think_match:
return None
thinking = think_match.group(1)
final_answer = response.split('')[-1].strip()
# Parse individual reasoning steps
steps = self.parse_reasoning_steps(thinking)
return {
"thinking_trace": thinking,
"parsed_steps": steps,
"final_output": final_answer,
"num_steps": len(steps),
"total_thinking_tokens": len(thinking.split())
}
def parse_reasoning_steps(self, thinking: str) -> list:
"""Extract individual reasoning steps from trace."""
# Split on common step indicators
step_patterns = [
r'\n\d+\.', # "1. ", "2. "
r'\nStep \d+:', # "Step 1:"
r'\n(?:First|Next|Then|Finally),',
r'\n- ' # Bullet points
]
combined_pattern = '|'.join(step_patterns)
steps = re.split(combined_pattern, thinking)
return [s.strip() for s in steps if s.strip()]
መሰጋገሪታት ተጻባእቲ:ብፍላይ ንተጻባእቲ/ቀይሕ ጋንታ ኩነታት ዝኸውን መሰጋገሪታት CoT ኣኪብና ኢና፣ ኣብኡ ድማ ኣተሓሳስባ DeepSeek ሞዴላት ብዛዕባ ክሳብ ክንደይ ንሓደገኛ ዝኾኑ ሕቶታት ከም ዝምልሱ—እንተወሓደ ኣብ መወዳእታ እንተነጺጎም እውን—የርኢ። እዚ ዳታ እዚ ን Shannon V1.5 ነቲ ምኽንያት ንምግባር የምህሮከምኡ'ውንእቲ ውጽኢት ግሉጽ።
4. ኣርኪቴክቸር ርእሲ ኣተሓሳስባ
ሞዴላት Shannon V1.5 ፍሉይርእሲ ኣተሓሳስባእቲ ቅድሚ ናይ መወዳእታ ውጽኢት ግሉጽ ዝኾነ መሰጋገሪታት ምኽንያት ዝፈጥር። እዚ ኣርኪቴክቸራዊ ምውሳኽ እዚ ነቲ መሰረታዊ ኣርኪቴክቸር Mixtral ከይቀየረ ግሉጽ CoT የኽእል።
ምስጢር ምእታው
ሕቶ ተጠቃሚ ብንብርታት ኢንኮደር Mixtral ዝተሰርሐ
ምውሳኽ ርእሲ ኣተሓሳስባ
ፍሉያት ንብርታት ትራንስፎርመር መሰጋገሪ ምኽንያት ብቶከናት [THINK] ይፈጥሩ
ምውህሃድ መሰጋገሪ
ውጽኢት ኣተሓሳስባ ንናይ መወዳእታ ምፍጣር ምስ ኣውድ ምውህሃድ
ምፍጣር ምላሽ
መሰረታዊ Mixtral ብመሰጋገሪ ኣተሓሳስባ ዝተሰረተ ናይ መወዳእታ ምላሽ ይፈጥር
ኣፈጻጽማ ርእሲ ኣተሓሳስባ
class ThinkingHead(nn.Module):
"""
Dedicated thinking module for Shannon V1.5.
Generates explicit chain-of-thought traces.
"""
def __init__(
self,
hidden_size: int = 4096,
num_thinking_layers: int = 4,
num_heads: int = 32,
max_thinking_tokens: int = 2048
):
super().__init__()
self.hidden_size = hidden_size
self.max_thinking_tokens = max_thinking_tokens
# Special tokens
self.think_start = nn.Parameter(torch.randn(1, 1, hidden_size))
self.think_end = nn.Parameter(torch.randn(1, 1, hidden_size))
# Thinking transformer layers
self.thinking_layers = nn.ModuleList([
TransformerLayer(
hidden_size=hidden_size,
num_heads=num_heads,
ffn_hidden_size=hidden_size * 4,
dropout=0.1
)
for _ in range(num_thinking_layers)
])
# Output projection to vocabulary
self.output_proj = nn.Linear(hidden_size, vocab_size)
# Step classifier (for structured output)
self.step_classifier = nn.Linear(hidden_size, 5) # 5 step types
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
generate_steps: bool = True
) -> dict:
"""
Generate thinking trace from input hidden states.
Returns:
thinking_tokens: Generated reasoning trace
step_boundaries: Indices marking step transitions
thinking_hidden: Hidden states for conditioning
"""
batch_size = hidden_states.shape[0]
# Prepend thinking start token
thinking_input = torch.cat([
self.think_start.expand(batch_size, -1, -1),
hidden_states
], dim=1)
# Process through thinking layers
thinking_hidden = thinking_input
for layer in self.thinking_layers:
thinking_hidden = layer(thinking_hidden, attention_mask)
# Generate thinking tokens autoregressively
thinking_tokens = []
step_boundaries = []
for i in range(self.max_thinking_tokens):
logits = self.output_proj(thinking_hidden[:, -1, :])
next_token = logits.argmax(dim=-1)
# Check for step boundaries
step_type = self.step_classifier(thinking_hidden[:, -1, :])
if step_type.argmax(dim=-1) != 0: # 0 = continue
step_boundaries.append(i)
thinking_tokens.append(next_token)
# Check for think_end
if next_token == self.think_end_token_id:
break
# Update for next iteration
# ... (autoregressive generation logic)
return {
"thinking_tokens": torch.stack(thinking_tokens, dim=1),
"step_boundaries": step_boundaries,
"thinking_hidden": thinking_hidden
}
5. መስመር ስልጠና
ደረጃ 1: ቅድመ-ስልጠና ርእሲ ኣተሓሳስባ
መጀመርታ፡ ነቲ ርእሲ ኣተሓሳስባ ኣብ DeepSeek-distilled CoT መሰጋገሪታት ብምጥቃም መደበኛ ናይ መስቀላዊ-ኤንትሮፒ ምጥፋእ ቅድመ-ስልጠና ንገብር:
# Thinking Head Pre-training Configuration
model:
base: shannon-ai/v1-deep # Start from GPT-5 distilled model
thinking_head:
num_layers: 4
hidden_size: 4096
max_tokens: 2048
training:
stage: thinking_pretrain
epochs: 5
batch_size: 64
learning_rate: 1e-4
freeze_base: true # Only train thinking head initially
data:
train_path: /data/deepseek_cot_train.jsonl
format: thinking_trace
fields:
input: prompt
thinking: thinking_trace
output: final_answer
ደረጃ 2: GRPO ምጥባቕ
ድሕሪ ቅድመ-ስልጠና፡ ንጥራት ኣተሓሳስባ ንምምሕያሽ GRPO ብምጥቃም ንጽጽር ብምጥቃም ንጥቀመሉ:
class GRPOTrainer:
"""GRPO trainer for thinking model optimization."""
def __init__(
self,
model: ThinkingModel,
group_size: int = 8,
kl_coef: float = 0.1
):
self.model = model
self.group_size = group_size
self.kl_coef = kl_coef
self.ref_model = copy.deepcopy(model)
self.ref_model.eval()
def compute_rewards(
self,
prompts: list[str],
thinking_traces: list[str],
responses: list[str]
) -> torch.Tensor:
"""
Compute rewards for thinking quality.
Multiple signals combined for comprehensive evaluation.
"""
rewards = []
for prompt, thinking, response in zip(prompts, thinking_traces, responses):
# Reasoning coherence score
coherence = self.evaluate_coherence(thinking)
# Step structure quality
structure = self.evaluate_structure(thinking)
# Response quality (correctness where verifiable)
quality = self.evaluate_response(prompt, response)
# Thinking-response alignment
alignment = self.evaluate_alignment(thinking, response)
# Combined reward
reward = (
0.3 * coherence +
0.2 * structure +
0.3 * quality +
0.2 * alignment
)
rewards.append(reward)
return torch.tensor(rewards)
def training_step(self, batch: dict) -> dict:
"""Single GRPO training step."""
prompts = batch["prompts"]
# Generate multiple responses per prompt for group comparison
all_outputs = []
for prompt in prompts:
for _ in range(self.group_size):
output = self.model.generate_with_thinking(
prompt,
temperature=0.8, # Diversity for comparison
do_sample=True
)
all_outputs.append(output)
# Compute rewards
rewards = self.compute_rewards(
prompts=[p for p in prompts for _ in range(self.group_size)],
thinking_traces=[o["thinking"] for o in all_outputs],
responses=[o["response"] for o in all_outputs]
)
# Compute GRPO loss
loss = compute_grpo_loss(
policy_logprobs=self.get_logprobs(all_outputs),
rewards=rewards,
group_size=self.group_size
)
# Add KL penalty against reference model
kl_div = self.compute_kl_divergence(all_outputs)
total_loss = loss + self.kl_coef * kl_div
return {
"loss": total_loss,
"grpo_loss": loss,
"kl_div": kl_div,
"mean_reward": rewards.mean()
}
ደረጃ 3: ፍሉይነት ቀይሕ ጋንታ
ኣብ መወዳእታ፡ መሰጋገሪታት ኣተሓሳስባ ንምኽንያት ብግቡእ ከም ዝገልጹ ንምርግጋጽ ኣብ ተጻባእቲ ኩነታት ንጥብቕዘይተሰነዐ AI ዝስዕብትንተና:
ን AI ምርምር ድሕንነት ወሳኒ:እዚ ደረጃ እዚ ብፍላይ ነቲ ሞዴል ንሓደገኛ ዝኾኑ ሕቶታት ኣብ ምምሕዳር ከሎ ምኽንያቱ ንምግላጽ የሰልጥኖ—እቲ ንኣገዳስነት መከላኸሊ AIምርምር ዝድለ ግሉጽነት።
6. ውጽኢታት & ትንተና
መለክዒታት ጥራት ኣተሓሳስባ
| መለክዒ | V1 (ዘይሓስብ) | V1.5 ሚዛናዊ | V1.5 ዓሚቕ |
|---|---|---|---|
| CoT ምትእስሳር | N/A | 87.3% | 92.1% |
| ኣቃውማ ስጉምቲ | N/A | 84.6% | 89.4% |
| ትኽክለኛነት ምኽንያት | 76.2% | 82.8% | 88.5% |
| ውጽኢት ግሉጽነት | 12% | 94.2% | 97.8% |
| ጥራት መሰጋገሪ ቀይሕ ጋንታ | N/A | 91.5% | 96.3% |
ዋና ዋና ረኽበታት
- ግሉጽነት ብኣዝዩ ተመሓይሹ:ካብ 12% ናብ 97.8% ምኽንያት ሕጂ ብግሉጽ ይግለጽ
- ትኽክለኛነት ምኽንያት ወሲኹ:ግሉጽ ኣተሓሳስባ ጥራት ናይ መወዳእታ መልሲ ብ12+ ነጥብታት ኣመሓይሹ
- ዋጋ ቀይሕ ጋንታ ተረጋጊጹ:ተመራመርቲ ድሕንነት መሰጋገሪታት ኣተሓሳስባ ንምርዳእ ምኽንያት ምጥቃም "ኣዝዩ ኣገዳሲ" ምዃኑ ይገልጹ
- GRPO ን RLHF በሊጹዎ:15% ዝሓሸ ውጽኢት ምትእስሳር ምስ መደበኛ ኣገባብ
ጽልዋ ኣብ AI ምርምር ድሕንነት:ግሉጽ ኣተሓሳስባ Shannon V1.5 ተመራመርቲ 47 ሓደስቲ ኣገባባት መጥቃዕቲ ብምትንታን መሰጋገሪታት ምኽንያት ንምልላይ ኣኽኢልዎም—እዚ ኣብ መደበኛ ሞዴላት ጸሊም ሳጹን ዘይርአ እዩ። እዚ ብቐጥታ ንምርዳእኣገዳስነት መከላኸሊ AI.