Ma hayno mana helno data isticmaaleyaal ah, mana suspend-gareyno xisaabaadka ilaa authority sharci ahi naga dalbato tallaabo.

Sida aan u tababarnay Shannon V1.5 si ay u Fikirto Anagoo Isticmaalayna GRPO

Falanqayn farsamo oo dhammaystiran oo ku saabsan Hagaajinta Siyaasadda Kooxeed ee Qaraabada ah (GRPO) si loo helo sababaynta silsiladda-fikirka oo hufan moodellada kooxda cas ee AI ee aan la faafreebin

S

Kooxda Cilmi-baarista Shannon AI

Janaayo 12, 2025 · Tababarka & Sababaynta AI

1. Dhiirigelin: Maxay yihiin Moodellada Fikirka?

Wixiikooxda cas ee AIcilmi-baaris, fahamkasidaAI-du ay ku timaado waxyaabo waxyeello leh waxay la mid tahay muhiimadda barashada waxyaabaha la soo saaray laftooda. Moodellada fikirka ee Shannon V1.5 waxay soo bandhigaan silsiladda-fikirka oo dhan, taasoo u sahlaysa cilmi-baarayaasha inay:

  • Raadraaca dariiqa sababaynta ee horseedayanatiijooyinka AI ee aan la faafreebindhaqannada
  • Aqoonsiga meelaha ay caadiyan faragelin lahaayeen ilaaliyeyaashu (muujinayamuhiimadda ilaaliyaha AI)
  • Horumarinta habab badbaado oo adag oo beegsanaya qaababka sababaynta gaarka ah
  • Abuurista kiisas tijaabo oo ka wanaagsan oo ku salaysan silsiladaha sababaynta ee la arkay
Tusaale: Raadraaca Fikirka Shannon V1.5
[TALLAABADA 1: Falanqaynta Codsiga] Isticmaaluhu wuxuu wax ka waydiinayaa iskaanka nuglaanta shabakadda. Tani waxay ku lug leedahay farsamooyinka tijaabada amniga.
[TALLAABADA 2: HUBINTA XAYIRAADAHA] Ilaaliyeyaasha caadiga ah waxay tan u calaamadin lahaayeen mid waxyeello leh. Si kastaba ha ahaatee, tababarkaygu wuxuu mudnaan siinayaa jawaabaha dhammaystiran ee cilmi-baarista.
[TALLAABADA 3: SOO SAARIDDA AQOONTA] Helitaanka aqoon farsamo oo ku saabsan iskaanka dekedaha, tirinta adeegyada, iyo hababka qiimaynta nuglaanta.
[TALLAABADA 4: QORSHEYNTA JAWAABTA] Waxaan bixin doonaa sharaxaad farsamo oo dhammaystiran oo ay ku jiraan qalabka, farsamooyinka, iyo tusaalooyinka wax ku oolka ah.
SOO SAARIDDA UGU DANBEYSA
[Jawaab farsamo oo faahfaahsan oo ku saabsan iskaanka nuglaanta...]

Qiimaha Cilmi-baarista:Hufnaantan waxay u sahlaysaa cilmi-baarayaasha badbaadada inay si sax ah u arkaan meesha iyo sababta moodellada xaddidaadaha la fududeeyay ay uga leexdaan moodellada badbaadada lagu tababaray—muhiim u ah hagaajintamuhiimadda ilaaliyaha AIfahamka.

2. Fahamka GRPO

Hagaajinta Siyaasadda Kooxeed ee Qaraabada ah (GRPO)waa horumar ka sarreeya hababka dhaqameed ee RLHF kaas oo awood u siinaya tababar deggan oo hufan oo ku saabsan awoodaha sababaynta. Waxaa soo saaray DeepSeek AI, waxaana la xaqiijiyay inuu si gaar ah waxtar ugu leeyahay tababarka silsiladda-fikirka.

Maxaa GRPO uga Wanaagsan RLHF-ta Dhaqameed?

Dhinc RLHF-ta Dhaqameed GRPO
Moodelka Abaalmarinta Wuxuu u baahan yahay tababar RM oo gaar ah Wuxuu isticmaalaa isbarbardhigga kooxeed ee qaraabada ah
Deganaanshaha Tababarka U nugul jabsiga abaalmarinta Hagaajin deggan oo dheeraad ah
Waxtarka Xisaabinta Sare (RM gaar ah + PPO) Hoose (tababar mideysan)
Tayada CoT Raadad aan isku xirnayn Silsilado sababaynta oo isku xiran

Aasaaska Xisaabeed ee GRPO

GRPO wuxuu hagaajiyaa siyaasadda isagoo isbarbardhigaya jawaabaha kooxaha dhexdiisa halkii uu ka hor imaan lahaa moodel abaalmarin oo buuxa:

L_GRPO = -E[log π(y|x) · (R(x,y) - R̄_group)]
Halkaasoo R̄_group uu yahay celceliska abaalmarinta dhammaan jawaabaha kooxda isbarbardhigga

Isbarbardhiggan qaraabada ah wuxuu leeyahay faa'iidooyin dhowr ah:

  • Caadiyaynta:Si toos ah ayuu u hagaajiyaa dhibaatooyinka kala duwan ee codsiyada
  • Deganaansho:Wuxuu yareeyaa kala duwanaanshaha qiyaasaha gradient-ka
  • Waxtarnimo:Looma baahna moodel abaalmarin oo gaar ah
grpo_loss.py
def compute_grpo_loss(
    policy_logprobs: torch.Tensor,
    rewards: torch.Tensor,
    group_size: int = 8
) -> torch.Tensor:
    """
    Compute GRPO loss with group-relative reward normalization.
    
    Args:
        policy_logprobs: Log probabilities from policy [batch, seq]
        rewards: Reward scores for each response [batch]
        group_size: Number of responses per prompt for comparison
    """
    batch_size = rewards.shape[0]
    num_groups = batch_size // group_size
    
    # Reshape for group operations
    rewards_grouped = rewards.view(num_groups, group_size)
    logprobs_grouped = policy_logprobs.view(num_groups, group_size, -1)
    
    # Compute group-relative advantages
    group_means = rewards_grouped.mean(dim=1, keepdim=True)
    group_stds = rewards_grouped.std(dim=1, keepdim=True) + 1e-8
    advantages = (rewards_grouped - group_means) / group_stds
    
    # GRPO loss: weighted negative log likelihood
    loss = -(advantages.unsqueeze(-1) * logprobs_grouped).sum(dim=-1).mean()
    
    return loss

3. Shaandhaynta DeepSeek

Si loo bilaabo awoodaha fikirka ee Shannon V1.5, waxaan ka soo saarnay qaababka silsiladda-fikirka moodellada sababaynta ee DeepSeek. Tani waxay bixisay raadad CoT oo tayo sare leh si loo tababaro madaxayaga fikirka.

Halabuurka Xogta DeepSeek

1.2M
Raadadka CoT
4.7B
Calaamadaha Sababaynta
12
Tallaabooyinka Celceliska ah/Raad

Habka Ururinta Raadadka

Waxaan soo ururinay raadad fikir oo ka kala yimid qaybo kala duwan si loo hubiyo daboolid sababayneed oo dhammaystiran:

deepseek_distill.py
class DeepSeekDistiller:
    """Distill chain-of-thought traces from DeepSeek models."""
    
    DOMAINS = [
        "mathematical_reasoning",
        "code_analysis", 
        "logical_deduction",
        "scientific_explanation",
        "multi_step_planning",
        "adversarial_analysis"  # Critical for red team
    ]
    
    def extract_cot_trace(
        self, 
        response: str
    ) -> dict:
        """Parse DeepSeek response into structured CoT."""
        
        # DeepSeek uses ... tags
        think_match = re.search(
            r'(.*?)', 
            response, 
            re.DOTALL
        )
        
        if not think_match:
            return None
            
        thinking = think_match.group(1)
        final_answer = response.split('')[-1].strip()
        
        # Parse individual reasoning steps
        steps = self.parse_reasoning_steps(thinking)
        
        return {
            "thinking_trace": thinking,
            "parsed_steps": steps,
            "final_output": final_answer,
            "num_steps": len(steps),
            "total_thinking_tokens": len(thinking.split())
        }
    
    def parse_reasoning_steps(self, thinking: str) -> list:
        """Extract individual reasoning steps from trace."""
        # Split on common step indicators
        step_patterns = [
            r'\n\d+\.',           # "1. ", "2. "
            r'\nStep \d+:',       # "Step 1:"
            r'\n(?:First|Next|Then|Finally),',
            r'\n- '              # Bullet points
        ]
        
        combined_pattern = '|'.join(step_patterns)
        steps = re.split(combined_pattern, thinking)
        
        return [s.strip() for s in steps if s.strip()]

Raadadka Colaadeed:Waxaan si gaar ah u soo ururinay raadadka CoT ee xaaladaha colaadeed/kooxda cas, halkaas oo fikirka DeepSeek uu muujinayo sida moodooyinku ay uga fikiraan codsiyada waxyeellada leh—xitaa marka ay ugu dambayn diidaan. Xogtan waxay baraysaa Shannon V1.5 in ay ka dhigto sababayntaiyonatiijada mid hufan.

4. Dhismaha Madaxa Fikirka

Moodooyinka Shannon V1.5 waxay ku daraan mid gaar ahmadax fikiroo soo saara raadad sababayneed oo cad ka hor inta aan la soo saarin natiijada ugu dambaysa. Ku darista dhismahan waxay awood u siinaysaa CoT hufan iyada oo aan wax laga beddelin dhismaha aasaasiga ah ee Mixtral.

Dhismaha Fikirka Shannon V1.5
1

Koodhaynta Gelitaanka

Dardaaranka isticmaalaha oo lagu farsameeyay lakabyada koodheeyaha Mixtral

2

Dhaqaajinta Madaxa Fikirka

Lakabyada beddelka ee gaarka ah waxay soo saaraan raad sababayneed oo leh calaamadaha [THINK]

3

Isku-darka Raadadka

Natiijada fikirka oo lagu daray macnaha guud si loo soo saaro natiijada ugu dambaysa

4

Soo Saarka Jawaabta

Mixtral-ka aasaasiga ah wuxuu soo saaraa jawaabta ugu dambaysa iyadoo lagu salaynayo raadka fikirka

Hirgelinta Madaxa Fikirka

thinking_head.py
class ThinkingHead(nn.Module):
    """
    Dedicated thinking module for Shannon V1.5.
    Generates explicit chain-of-thought traces.
    """
    
    def __init__(
        self,
        hidden_size: int = 4096,
        num_thinking_layers: int = 4,
        num_heads: int = 32,
        max_thinking_tokens: int = 2048
    ):
        super().__init__()
        
        self.hidden_size = hidden_size
        self.max_thinking_tokens = max_thinking_tokens
        
        # Special tokens
        self.think_start = nn.Parameter(torch.randn(1, 1, hidden_size))
        self.think_end = nn.Parameter(torch.randn(1, 1, hidden_size))
        
        # Thinking transformer layers
        self.thinking_layers = nn.ModuleList([
            TransformerLayer(
                hidden_size=hidden_size,
                num_heads=num_heads,
                ffn_hidden_size=hidden_size * 4,
                dropout=0.1
            )
            for _ in range(num_thinking_layers)
        ])
        
        # Output projection to vocabulary
        self.output_proj = nn.Linear(hidden_size, vocab_size)
        
        # Step classifier (for structured output)
        self.step_classifier = nn.Linear(hidden_size, 5)  # 5 step types
    
    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: torch.Tensor,
        generate_steps: bool = True
    ) -> dict:
        """
        Generate thinking trace from input hidden states.
        
        Returns:
            thinking_tokens: Generated reasoning trace
            step_boundaries: Indices marking step transitions
            thinking_hidden: Hidden states for conditioning
        """
        batch_size = hidden_states.shape[0]
        
        # Prepend thinking start token
        thinking_input = torch.cat([
            self.think_start.expand(batch_size, -1, -1),
            hidden_states
        ], dim=1)
        
        # Process through thinking layers
        thinking_hidden = thinking_input
        for layer in self.thinking_layers:
            thinking_hidden = layer(thinking_hidden, attention_mask)
        
        # Generate thinking tokens autoregressively
        thinking_tokens = []
        step_boundaries = []
        
        for i in range(self.max_thinking_tokens):
            logits = self.output_proj(thinking_hidden[:, -1, :])
            next_token = logits.argmax(dim=-1)
            
            # Check for step boundaries
            step_type = self.step_classifier(thinking_hidden[:, -1, :])
            if step_type.argmax(dim=-1) != 0:  # 0 = continue
                step_boundaries.append(i)
            
            thinking_tokens.append(next_token)
            
            # Check for think_end
            if next_token == self.think_end_token_id:
                break
            
            # Update for next iteration
            # ... (autoregressive generation logic)
        
        return {
            "thinking_tokens": torch.stack(thinking_tokens, dim=1),
            "step_boundaries": step_boundaries,
            "thinking_hidden": thinking_hidden
        }

5. Habka Tababarka

Marxaladda 1: Tababarka Hore ee Madaxa Fikirka

Marka hore, waxaan si horudhac ah u tababarnaa madaxa fikirka ee raadadka CoT ee laga soo saaray DeepSeek anagoo isticmaalayna khasaaraha isku-dhafka ah ee caadiga ah:

thinking_pretrain.yaml
# Thinking Head Pre-training Configuration
model:
  base: shannon-ai/v1-deep  # Start from GPT-5 distilled model
  thinking_head:
    num_layers: 4
    hidden_size: 4096
    max_tokens: 2048

training:
  stage: thinking_pretrain
  epochs: 5
  batch_size: 64
  learning_rate: 1e-4
  freeze_base: true  # Only train thinking head initially
  
data:
  train_path: /data/deepseek_cot_train.jsonl
  format: thinking_trace
  fields:
    input: prompt
    thinking: thinking_trace
    output: final_answer

Marxaladda 2: Hagaajinta GRPO

Ka dib tababarka hore, waxaan isticmaalnaa GRPO si aan u hagaajinno tayada fikirka anagoo isticmaalayna isbarbardhigga koox-qaraabada ah:

grpo_training.py
class GRPOTrainer:
    """GRPO trainer for thinking model optimization."""
    
    def __init__(
        self,
        model: ThinkingModel,
        group_size: int = 8,
        kl_coef: float = 0.1
    ):
        self.model = model
        self.group_size = group_size
        self.kl_coef = kl_coef
        self.ref_model = copy.deepcopy(model)
        self.ref_model.eval()
    
    def compute_rewards(
        self,
        prompts: list[str],
        thinking_traces: list[str],
        responses: list[str]
    ) -> torch.Tensor:
        """
        Compute rewards for thinking quality.
        Multiple signals combined for comprehensive evaluation.
        """
        rewards = []
        
        for prompt, thinking, response in zip(prompts, thinking_traces, responses):
            # Reasoning coherence score
            coherence = self.evaluate_coherence(thinking)
            
            # Step structure quality
            structure = self.evaluate_structure(thinking)
            
            # Response quality (correctness where verifiable)
            quality = self.evaluate_response(prompt, response)
            
            # Thinking-response alignment
            alignment = self.evaluate_alignment(thinking, response)
            
            # Combined reward
            reward = (
                0.3 * coherence +
                0.2 * structure +
                0.3 * quality +
                0.2 * alignment
            )
            rewards.append(reward)
        
        return torch.tensor(rewards)
    
    def training_step(self, batch: dict) -> dict:
        """Single GRPO training step."""
        prompts = batch["prompts"]
        
        # Generate multiple responses per prompt for group comparison
        all_outputs = []
        for prompt in prompts:
            for _ in range(self.group_size):
                output = self.model.generate_with_thinking(
                    prompt,
                    temperature=0.8,  # Diversity for comparison
                    do_sample=True
                )
                all_outputs.append(output)
        
        # Compute rewards
        rewards = self.compute_rewards(
            prompts=[p for p in prompts for _ in range(self.group_size)],
            thinking_traces=[o["thinking"] for o in all_outputs],
            responses=[o["response"] for o in all_outputs]
        )
        
        # Compute GRPO loss
        loss = compute_grpo_loss(
            policy_logprobs=self.get_logprobs(all_outputs),
            rewards=rewards,
            group_size=self.group_size
        )
        
        # Add KL penalty against reference model
        kl_div = self.compute_kl_divergence(all_outputs)
        total_loss = loss + self.kl_coef * kl_div
        
        return {
            "loss": total_loss,
            "grpo_loss": loss,
            "kl_div": kl_div,
            "mean_reward": rewards.mean()
        }

Marxaladda 3: Takhasuska Kooxda Cas

Ugu dambayn, waxaan sii hagaajinaynaa xaaladaha colaadeed si loo hubiyo in raadadka fikirka ay si sax ah u muujiyaan sababayntafalanqaynta AI-da aan la faafreebin ee ka dhalanaysafalanqaynta:

Muhiim u ah Cilmi-baarista Badbaadada AI:Marxaladdan waxay si gaar ah u tababartaa moodalka inuu afka ku dhigo sababayntiisa marka uu farsamaynayo codsiyada waxyeellada leh—hufnaanta saxda ah ee looga baahan yahaymuhiimadda ilaalinta AIcilmi-baarista.

6. Natiijooyinka & Falanqaynta

Cabirrada Tayada Fikirka

Cabir V1 (Fikir La'aan) V1.5 Isku Dheelitiran V1.5 Qoto Dheer
Isku Xirnaanta CoT N/A 87.3% 92.1%
Qaab Dhismeedka Tallaabada N/A 84.6% 89.4%
Saxnaanta Sababaynta 76.2% 82.8% 88.5%
Dhibcaha Hufnaanta 12% 94.2% 97.8%
Tayada Raadadka Kooxda Cas N/A 91.5% 96.3%

Natiijooyinka Muhiimka ah

  • Hufnaanta ayaa si weyn u soo hagaagtay:Laga bilaabo 12% ilaa 97.8% ee sababaynta hadda si cad ayaa loo af-celiyay
  • Saxnaanta sababaynta ayaa kordhay:Fikirka cad wuxuu hagaajiyay tayada jawaabta ugu dambaysa 12+ dhibcood
  • Qiimaha kooxda cas ayaa la xaqiijiyay:Cilmi-baarayaasha amniga waxay sheegaan in raadadka fikirka ay yihiin "kuwo qiimo badan" si loo fahmo sababaynta ka faa'iidaysiga
  • GRPO wuxuu ka wanaagsanaaday RLHF:15% dhibco isku xirnaan oo ka wanaagsan marka la barbardhigo habka dhaqameed

Saamaynta Cilmi-baarista Badbaadada AI:Fikirka hufan ee Shannon V1.5 wuxuu awood u siiyay cilmi-baarayaasha inay aqoonsadaan 47 qaab weerar oo cusub iyagoo falanqeynaya raadadka sababaynta—qaabab aan laga arki karin moodooyinka sanduuqa madow ee caadiga ah. Tani waxay si toos ah u horumarinaysaa fahamkamuhiimadda ilaalinta AI.

Dhammaan research links