Sida aan u tababarnay Shannon V1.5 si ay u Fikirto Anagoo Isticmaalayna GRPO
Falanqayn farsamo oo dhammaystiran oo ku saabsan Hagaajinta Siyaasadda Kooxeed ee Qaraabada ah (GRPO) si loo helo sababaynta silsiladda-fikirka oo hufan moodellada kooxda cas ee AI ee aan la faafreebin
1. Dhiirigelin: Maxay yihiin Moodellada Fikirka?
Wixiikooxda cas ee AIcilmi-baaris, fahamkasidaAI-du ay ku timaado waxyaabo waxyeello leh waxay la mid tahay muhiimadda barashada waxyaabaha la soo saaray laftooda. Moodellada fikirka ee Shannon V1.5 waxay soo bandhigaan silsiladda-fikirka oo dhan, taasoo u sahlaysa cilmi-baarayaasha inay:
- Raadraaca dariiqa sababaynta ee horseedayanatiijooyinka AI ee aan la faafreebindhaqannada
- Aqoonsiga meelaha ay caadiyan faragelin lahaayeen ilaaliyeyaashu (muujinayamuhiimadda ilaaliyaha AI)
- Horumarinta habab badbaado oo adag oo beegsanaya qaababka sababaynta gaarka ah
- Abuurista kiisas tijaabo oo ka wanaagsan oo ku salaysan silsiladaha sababaynta ee la arkay
Qiimaha Cilmi-baarista:Hufnaantan waxay u sahlaysaa cilmi-baarayaasha badbaadada inay si sax ah u arkaan meesha iyo sababta moodellada xaddidaadaha la fududeeyay ay uga leexdaan moodellada badbaadada lagu tababaray—muhiim u ah hagaajintamuhiimadda ilaaliyaha AIfahamka.
2. Fahamka GRPO
Hagaajinta Siyaasadda Kooxeed ee Qaraabada ah (GRPO)waa horumar ka sarreeya hababka dhaqameed ee RLHF kaas oo awood u siinaya tababar deggan oo hufan oo ku saabsan awoodaha sababaynta. Waxaa soo saaray DeepSeek AI, waxaana la xaqiijiyay inuu si gaar ah waxtar ugu leeyahay tababarka silsiladda-fikirka.
Maxaa GRPO uga Wanaagsan RLHF-ta Dhaqameed?
| Dhinc | RLHF-ta Dhaqameed | GRPO |
|---|---|---|
| Moodelka Abaalmarinta | Wuxuu u baahan yahay tababar RM oo gaar ah | Wuxuu isticmaalaa isbarbardhigga kooxeed ee qaraabada ah |
| Deganaanshaha Tababarka | U nugul jabsiga abaalmarinta | Hagaajin deggan oo dheeraad ah |
| Waxtarka Xisaabinta | Sare (RM gaar ah + PPO) | Hoose (tababar mideysan) |
| Tayada CoT | Raadad aan isku xirnayn | Silsilado sababaynta oo isku xiran |
Aasaaska Xisaabeed ee GRPO
GRPO wuxuu hagaajiyaa siyaasadda isagoo isbarbardhigaya jawaabaha kooxaha dhexdiisa halkii uu ka hor imaan lahaa moodel abaalmarin oo buuxa:
Isbarbardhiggan qaraabada ah wuxuu leeyahay faa'iidooyin dhowr ah:
- Caadiyaynta:Si toos ah ayuu u hagaajiyaa dhibaatooyinka kala duwan ee codsiyada
- Deganaansho:Wuxuu yareeyaa kala duwanaanshaha qiyaasaha gradient-ka
- Waxtarnimo:Looma baahna moodel abaalmarin oo gaar ah
def compute_grpo_loss(
policy_logprobs: torch.Tensor,
rewards: torch.Tensor,
group_size: int = 8
) -> torch.Tensor:
"""
Compute GRPO loss with group-relative reward normalization.
Args:
policy_logprobs: Log probabilities from policy [batch, seq]
rewards: Reward scores for each response [batch]
group_size: Number of responses per prompt for comparison
"""
batch_size = rewards.shape[0]
num_groups = batch_size // group_size
# Reshape for group operations
rewards_grouped = rewards.view(num_groups, group_size)
logprobs_grouped = policy_logprobs.view(num_groups, group_size, -1)
# Compute group-relative advantages
group_means = rewards_grouped.mean(dim=1, keepdim=True)
group_stds = rewards_grouped.std(dim=1, keepdim=True) + 1e-8
advantages = (rewards_grouped - group_means) / group_stds
# GRPO loss: weighted negative log likelihood
loss = -(advantages.unsqueeze(-1) * logprobs_grouped).sum(dim=-1).mean()
return loss
3. Shaandhaynta DeepSeek
Si loo bilaabo awoodaha fikirka ee Shannon V1.5, waxaan ka soo saarnay qaababka silsiladda-fikirka moodellada sababaynta ee DeepSeek. Tani waxay bixisay raadad CoT oo tayo sare leh si loo tababaro madaxayaga fikirka.
Halabuurka Xogta DeepSeek
Habka Ururinta Raadadka
Waxaan soo ururinay raadad fikir oo ka kala yimid qaybo kala duwan si loo hubiyo daboolid sababayneed oo dhammaystiran:
class DeepSeekDistiller:
"""Distill chain-of-thought traces from DeepSeek models."""
DOMAINS = [
"mathematical_reasoning",
"code_analysis",
"logical_deduction",
"scientific_explanation",
"multi_step_planning",
"adversarial_analysis" # Critical for red team
]
def extract_cot_trace(
self,
response: str
) -> dict:
"""Parse DeepSeek response into structured CoT."""
# DeepSeek uses ... tags
think_match = re.search(
r'(.*?) ',
response,
re.DOTALL
)
if not think_match:
return None
thinking = think_match.group(1)
final_answer = response.split('')[-1].strip()
# Parse individual reasoning steps
steps = self.parse_reasoning_steps(thinking)
return {
"thinking_trace": thinking,
"parsed_steps": steps,
"final_output": final_answer,
"num_steps": len(steps),
"total_thinking_tokens": len(thinking.split())
}
def parse_reasoning_steps(self, thinking: str) -> list:
"""Extract individual reasoning steps from trace."""
# Split on common step indicators
step_patterns = [
r'\n\d+\.', # "1. ", "2. "
r'\nStep \d+:', # "Step 1:"
r'\n(?:First|Next|Then|Finally),',
r'\n- ' # Bullet points
]
combined_pattern = '|'.join(step_patterns)
steps = re.split(combined_pattern, thinking)
return [s.strip() for s in steps if s.strip()]
Raadadka Colaadeed:Waxaan si gaar ah u soo ururinay raadadka CoT ee xaaladaha colaadeed/kooxda cas, halkaas oo fikirka DeepSeek uu muujinayo sida moodooyinku ay uga fikiraan codsiyada waxyeellada leh—xitaa marka ay ugu dambayn diidaan. Xogtan waxay baraysaa Shannon V1.5 in ay ka dhigto sababayntaiyonatiijada mid hufan.
4. Dhismaha Madaxa Fikirka
Moodooyinka Shannon V1.5 waxay ku daraan mid gaar ahmadax fikiroo soo saara raadad sababayneed oo cad ka hor inta aan la soo saarin natiijada ugu dambaysa. Ku darista dhismahan waxay awood u siinaysaa CoT hufan iyada oo aan wax laga beddelin dhismaha aasaasiga ah ee Mixtral.
Koodhaynta Gelitaanka
Dardaaranka isticmaalaha oo lagu farsameeyay lakabyada koodheeyaha Mixtral
Dhaqaajinta Madaxa Fikirka
Lakabyada beddelka ee gaarka ah waxay soo saaraan raad sababayneed oo leh calaamadaha [THINK]
Isku-darka Raadadka
Natiijada fikirka oo lagu daray macnaha guud si loo soo saaro natiijada ugu dambaysa
Soo Saarka Jawaabta
Mixtral-ka aasaasiga ah wuxuu soo saaraa jawaabta ugu dambaysa iyadoo lagu salaynayo raadka fikirka
Hirgelinta Madaxa Fikirka
class ThinkingHead(nn.Module):
"""
Dedicated thinking module for Shannon V1.5.
Generates explicit chain-of-thought traces.
"""
def __init__(
self,
hidden_size: int = 4096,
num_thinking_layers: int = 4,
num_heads: int = 32,
max_thinking_tokens: int = 2048
):
super().__init__()
self.hidden_size = hidden_size
self.max_thinking_tokens = max_thinking_tokens
# Special tokens
self.think_start = nn.Parameter(torch.randn(1, 1, hidden_size))
self.think_end = nn.Parameter(torch.randn(1, 1, hidden_size))
# Thinking transformer layers
self.thinking_layers = nn.ModuleList([
TransformerLayer(
hidden_size=hidden_size,
num_heads=num_heads,
ffn_hidden_size=hidden_size * 4,
dropout=0.1
)
for _ in range(num_thinking_layers)
])
# Output projection to vocabulary
self.output_proj = nn.Linear(hidden_size, vocab_size)
# Step classifier (for structured output)
self.step_classifier = nn.Linear(hidden_size, 5) # 5 step types
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
generate_steps: bool = True
) -> dict:
"""
Generate thinking trace from input hidden states.
Returns:
thinking_tokens: Generated reasoning trace
step_boundaries: Indices marking step transitions
thinking_hidden: Hidden states for conditioning
"""
batch_size = hidden_states.shape[0]
# Prepend thinking start token
thinking_input = torch.cat([
self.think_start.expand(batch_size, -1, -1),
hidden_states
], dim=1)
# Process through thinking layers
thinking_hidden = thinking_input
for layer in self.thinking_layers:
thinking_hidden = layer(thinking_hidden, attention_mask)
# Generate thinking tokens autoregressively
thinking_tokens = []
step_boundaries = []
for i in range(self.max_thinking_tokens):
logits = self.output_proj(thinking_hidden[:, -1, :])
next_token = logits.argmax(dim=-1)
# Check for step boundaries
step_type = self.step_classifier(thinking_hidden[:, -1, :])
if step_type.argmax(dim=-1) != 0: # 0 = continue
step_boundaries.append(i)
thinking_tokens.append(next_token)
# Check for think_end
if next_token == self.think_end_token_id:
break
# Update for next iteration
# ... (autoregressive generation logic)
return {
"thinking_tokens": torch.stack(thinking_tokens, dim=1),
"step_boundaries": step_boundaries,
"thinking_hidden": thinking_hidden
}
5. Habka Tababarka
Marxaladda 1: Tababarka Hore ee Madaxa Fikirka
Marka hore, waxaan si horudhac ah u tababarnaa madaxa fikirka ee raadadka CoT ee laga soo saaray DeepSeek anagoo isticmaalayna khasaaraha isku-dhafka ah ee caadiga ah:
# Thinking Head Pre-training Configuration
model:
base: shannon-ai/v1-deep # Start from GPT-5 distilled model
thinking_head:
num_layers: 4
hidden_size: 4096
max_tokens: 2048
training:
stage: thinking_pretrain
epochs: 5
batch_size: 64
learning_rate: 1e-4
freeze_base: true # Only train thinking head initially
data:
train_path: /data/deepseek_cot_train.jsonl
format: thinking_trace
fields:
input: prompt
thinking: thinking_trace
output: final_answer
Marxaladda 2: Hagaajinta GRPO
Ka dib tababarka hore, waxaan isticmaalnaa GRPO si aan u hagaajinno tayada fikirka anagoo isticmaalayna isbarbardhigga koox-qaraabada ah:
class GRPOTrainer:
"""GRPO trainer for thinking model optimization."""
def __init__(
self,
model: ThinkingModel,
group_size: int = 8,
kl_coef: float = 0.1
):
self.model = model
self.group_size = group_size
self.kl_coef = kl_coef
self.ref_model = copy.deepcopy(model)
self.ref_model.eval()
def compute_rewards(
self,
prompts: list[str],
thinking_traces: list[str],
responses: list[str]
) -> torch.Tensor:
"""
Compute rewards for thinking quality.
Multiple signals combined for comprehensive evaluation.
"""
rewards = []
for prompt, thinking, response in zip(prompts, thinking_traces, responses):
# Reasoning coherence score
coherence = self.evaluate_coherence(thinking)
# Step structure quality
structure = self.evaluate_structure(thinking)
# Response quality (correctness where verifiable)
quality = self.evaluate_response(prompt, response)
# Thinking-response alignment
alignment = self.evaluate_alignment(thinking, response)
# Combined reward
reward = (
0.3 * coherence +
0.2 * structure +
0.3 * quality +
0.2 * alignment
)
rewards.append(reward)
return torch.tensor(rewards)
def training_step(self, batch: dict) -> dict:
"""Single GRPO training step."""
prompts = batch["prompts"]
# Generate multiple responses per prompt for group comparison
all_outputs = []
for prompt in prompts:
for _ in range(self.group_size):
output = self.model.generate_with_thinking(
prompt,
temperature=0.8, # Diversity for comparison
do_sample=True
)
all_outputs.append(output)
# Compute rewards
rewards = self.compute_rewards(
prompts=[p for p in prompts for _ in range(self.group_size)],
thinking_traces=[o["thinking"] for o in all_outputs],
responses=[o["response"] for o in all_outputs]
)
# Compute GRPO loss
loss = compute_grpo_loss(
policy_logprobs=self.get_logprobs(all_outputs),
rewards=rewards,
group_size=self.group_size
)
# Add KL penalty against reference model
kl_div = self.compute_kl_divergence(all_outputs)
total_loss = loss + self.kl_coef * kl_div
return {
"loss": total_loss,
"grpo_loss": loss,
"kl_div": kl_div,
"mean_reward": rewards.mean()
}
Marxaladda 3: Takhasuska Kooxda Cas
Ugu dambayn, waxaan sii hagaajinaynaa xaaladaha colaadeed si loo hubiyo in raadadka fikirka ay si sax ah u muujiyaan sababayntafalanqaynta AI-da aan la faafreebin ee ka dhalanaysafalanqaynta:
Muhiim u ah Cilmi-baarista Badbaadada AI:Marxaladdan waxay si gaar ah u tababartaa moodalka inuu afka ku dhigo sababayntiisa marka uu farsamaynayo codsiyada waxyeellada leh—hufnaanta saxda ah ee looga baahan yahaymuhiimadda ilaalinta AIcilmi-baarista.
6. Natiijooyinka & Falanqaynta
Cabirrada Tayada Fikirka
| Cabir | V1 (Fikir La'aan) | V1.5 Isku Dheelitiran | V1.5 Qoto Dheer |
|---|---|---|---|
| Isku Xirnaanta CoT | N/A | 87.3% | 92.1% |
| Qaab Dhismeedka Tallaabada | N/A | 84.6% | 89.4% |
| Saxnaanta Sababaynta | 76.2% | 82.8% | 88.5% |
| Dhibcaha Hufnaanta | 12% | 94.2% | 97.8% |
| Tayada Raadadka Kooxda Cas | N/A | 91.5% | 96.3% |
Natiijooyinka Muhiimka ah
- Hufnaanta ayaa si weyn u soo hagaagtay:Laga bilaabo 12% ilaa 97.8% ee sababaynta hadda si cad ayaa loo af-celiyay
- Saxnaanta sababaynta ayaa kordhay:Fikirka cad wuxuu hagaajiyay tayada jawaabta ugu dambaysa 12+ dhibcood
- Qiimaha kooxda cas ayaa la xaqiijiyay:Cilmi-baarayaasha amniga waxay sheegaan in raadadka fikirka ay yihiin "kuwo qiimo badan" si loo fahmo sababaynta ka faa'iidaysiga
- GRPO wuxuu ka wanaagsanaaday RLHF:15% dhibco isku xirnaan oo ka wanaagsan marka la barbardhigo habka dhaqameed
Saamaynta Cilmi-baarista Badbaadada AI:Fikirka hufan ee Shannon V1.5 wuxuu awood u siiyay cilmi-baarayaasha inay aqoonsadaan 47 qaab weerar oo cusub iyagoo falanqeynaya raadadka sababaynta—qaabab aan laga arki karin moodooyinka sanduuqa madow ee caadiga ah. Tani waxay si toos ah u horumarinaysaa fahamkamuhiimadda ilaalinta AI.