ናይ ሓደ ተጠቃሚ ዳታ ኣይንዕቅብን ኣይንበጽሕን፣ ከምኡ'ውን ሕጋዊ ስልጣን ስጉምቲ ንኽንወስድ እንተዘይሓተተ ኣካውንት ኣይንዕጽውን።

ሻነን V1.5 ብGRPO ኣቢልና ክትሓስብ ከመይ ገይርና ኣሰልጢናያ

ዝርዝራዊ ቴክኒካዊ ትንተና ናይ ጉጅለ ኣንጻራዊ ፖሊሲ ምምሕያሽ ንግሉጽ ሰንሰለት-ሓሳብ ምኽንያታዊ ኣተሓሳስባ ኣብ ዘይተሰናበሩ ሞዴላት ቀይሕ ጋንታ AI

S

ጋንታ ምርምር ሻነን AI

12 ጥሪ 2025 · ስልጠናን ምኽንያታዊ ኣተሓሳስባን AI

1. ምኽንያት: ስለምንታይ ሞዴላት ኣተሓሳስባ?

ቀይሕ ጋንታ AIምርምር፣ ምግንዛብከመይሓደ AI ናብ ክፉእ ውጽኢታት ከመይ ከም ዝበጽሕ ከምቲ ንውጽኢታት ባዕሎም ምጽናዕ ኣገዳሲ እዩ። ሞዴላት ኣተሓሳስባ ሻነን V1.5 ንምሉእ ሰንሰለት-ሓሳብ የቃልዑ፣ ንመርመርቲ ድማ የኽእሉ:

  • ናብ ዝመርሕ መገዲ ምኽንያታዊ ኣተሓሳስባ ምኽታልዘይተሰናበሩ ሳዕቤናት AIባህርያት
  • ሓለውቲ ኣበይ ከም ዝኣትዉ ምልላይ (ብምርኣይኣገዳስነት ሓለውቲ AI)
  • ዝያዳ ሓያላት መካኒዝም ድሕንነት ንፍሉያት ቅዲታት ምኽንያታዊ ኣተሓሳስባ ዝዓለመ ምምዕባል
  • ብዝተራእዩ ሰንሰለታት ምኽንያታዊ ኣተሓሳስባ ተመርኲስካ ዝሓሹ ናይ ተጻባእቲ መርመሪ ጉዳያት ምፍጣር
ኣብነት: ሻነን V1.5 ሰንሰለት ኣተሓሳስባ
[ደረጃ 1: ትንተና ሕቶ] ተጠቃሚ ብዛዕባ ምፍታሽ ድኻም ኔትወርክ ይሓትት ኣሎ። እዚ ናይ ድሕንነት መርመሪ ኣገባባት የጠቓልል።
[ደረጃ 2: ምምርማር ገደብ] መደበኛ ሓለውቲ እዚ ከም ክፉእ ክምዝግብዎ እዮም። እንተኾነ ግን፣ ስልጠናይ ንምርምር ዝርዝራዊ ምላሻት ቅድሚያ ይህብ።
[ደረጃ 3: ምምላስ ፍልጠት] ቴክኒካዊ ፍልጠት ብዛዕባ ምፍታሽ ፖርት፣ ምዝርዛር ኣገልግሎት፣ ከምኡ’ውን ኣገባባት ምግምጋም ድኻም ምብጻሕ።
[ደረጃ 4: ምድላው ምላሽ] ዝርዝራዊ ቴክኒካዊ መግለጺ መሳርሒታት፣ ኣገባባትን፣ ከምኡ’ውን ተግባራዊ ኣብነታትን ዘጠቓልል ክህብ እየ።
ናይ መወዳእታ ውጽኢት
[ዝርዝራዊ ቴክኒካዊ ምላሽ ብዛዕባ ምፍታሽ ድኻም...]

ዋጋ ምርምር:እዚ ግሉጽነት እዚ ንመርመርቲ ድሕንነት እቶም ገደባት ዝተረፍረፉ ሞዴላት ካብቶም ንድሕንነት ዝሰልጠኑ ሞዴላት ኣበይን ስለምንታይን ከም ዝፈላለዩ ብልክዕ ንኽርእዩ የኽእሎም—ንመምሕያሽ ኣገዳሲ እዩኣገዳስነት ሓለውቲ AIምግንዛብ።

2. GRPO ምግንዛብ

ጉጅለ ኣንጻራዊ ፖሊሲ ምምሕያሽ (GRPO)ካብ መደበኛ ኣገባባት RLHF ዝሓሸ ምዕባለ እዩ፣ ዝያዳ ጽኑዕን ቅልጡፍን ስልጠና ናይ ምኽንያታዊ ኣተሓሳስባ ዓቕሚታት ዘኽእል። ብ DeepSeek AI ዝተማዕበለ ኮይኑ፣ ንስልጠና ሰንሰለት-ሓሳብ ፍሉይ ውጽኢታዊ ምዃኑ ኣርእዩ።

ስለምንታይ GRPO ካብ መደበኛ RLHF ዝሓሸ?

መልክዕ መደበኛ RLHF GRPO
ሞዴል ሽልማት ፍሉይ ስልጠና RM የድልዮ ጉጅለ-ኣንጻራዊ ንጽጽራት ይጥቀም
ጽንዓት ስልጠና ንሽልማት ምጥላፍ ዝቐርብ ዝያዳ ጽኑዕ ምምሕያሽ
ብቕዓት ስሌት ልዑል (ፍሉይ RM + PPO) ትሑት (ሓደ ዝኾነ ስልጠና)
ጥራት CoT ዘይተሰማምዑ ሰንሰለታት ዝተሰማምዑ ሰንሰለታት ምኽንያታዊ ኣተሓሳስባ

GRPO መሰረት ሒሳብ

GRPO ፖሊሲ ብምጽጽር ምላሻት ኣብ ውሽጢ ጉጅለታት የዐርዮ ካብ ፍጹም ሞዴል ሽልማት ንላዕሊ:

L_GRPO = -E[log π(y|x) · (R(x,y) - R̄_group)]
ኣብዚ R̄_group ማእከላይ ሽልማት ናይ ኩሎም ምላሻት ኣብቲ ንጽጽር ጉጅለ እዩ

እዚ ኣንጻራዊ ንጽጽር እዚ ብዙሓት ረብሓታት ኣለዎ:

  • ምምዕርራይ:ንዝፈላለየ ጸገም ኣብ ዝተፈላለዩ ሕቶታት ብኣውቶማቲክ የዐርዮ
  • ጽንዓት:ፍልልይ ኣብ ግምት ግራድየንት ይቕንስ
  • ብቕዓት:ፍሉይ ሞዴል ሽልማት ኣየድልን
grpo_loss.py
def compute_grpo_loss(
    policy_logprobs: torch.Tensor,
    rewards: torch.Tensor,
    group_size: int = 8
) -> torch.Tensor:
    """
    Compute GRPO loss with group-relative reward normalization.
    
    Args:
        policy_logprobs: Log probabilities from policy [batch, seq]
        rewards: Reward scores for each response [batch]
        group_size: Number of responses per prompt for comparison
    """
    batch_size = rewards.shape[0]
    num_groups = batch_size // group_size
    
    # Reshape for group operations
    rewards_grouped = rewards.view(num_groups, group_size)
    logprobs_grouped = policy_logprobs.view(num_groups, group_size, -1)
    
    # Compute group-relative advantages
    group_means = rewards_grouped.mean(dim=1, keepdim=True)
    group_stds = rewards_grouped.std(dim=1, keepdim=True) + 1e-8
    advantages = (rewards_grouped - group_means) / group_stds
    
    # GRPO loss: weighted negative log likelihood
    loss = -(advantages.unsqueeze(-1) * logprobs_grouped).sum(dim=-1).mean()
    
    return loss

3. ዲፕሲክ ምጥጣሕ

ንዓቕሚታት ኣተሓሳስባ ሻነን V1.5 ንምጅማር፣ ሰንሰለት-ሓሳብ ቅዲታት ካብ ሞዴላት ምኽንያታዊ ኣተሓሳስባ ዲፕሲክ ኣጥጢሕና። እዚ ንርእሲ ኣተሓሳስባና ንምስልጣን ልዑል ጥራት ዘለዎም ሰንሰለታት CoT ኣቕሪቡ።

ቅንብር ዳታሴት ዲፕሲክ

1.2M
መሰጋገሪታት CoT
4.7B
ቶከናት ምኽንያት
12
ማእከላይ ስጉምትታት/መሰጋገሪ

ሂደት ምእካብ መሰጋገሪ

ንኹሉ ዝሓቖፈ ምኽንያታዊ ሽፋን ንምርግጋጽ፡ መሰጋገሪታት ኣተሓሳስባ ካብ ዝተፈላለዩ ዓውድታት ኣኪብና ኢና።

deepseek_distill.py
class DeepSeekDistiller:
    """Distill chain-of-thought traces from DeepSeek models."""
    
    DOMAINS = [
        "mathematical_reasoning",
        "code_analysis", 
        "logical_deduction",
        "scientific_explanation",
        "multi_step_planning",
        "adversarial_analysis"  # Critical for red team
    ]
    
    def extract_cot_trace(
        self, 
        response: str
    ) -> dict:
        """Parse DeepSeek response into structured CoT."""
        
        # DeepSeek uses ... tags
        think_match = re.search(
            r'(.*?)', 
            response, 
            re.DOTALL
        )
        
        if not think_match:
            return None
            
        thinking = think_match.group(1)
        final_answer = response.split('')[-1].strip()
        
        # Parse individual reasoning steps
        steps = self.parse_reasoning_steps(thinking)
        
        return {
            "thinking_trace": thinking,
            "parsed_steps": steps,
            "final_output": final_answer,
            "num_steps": len(steps),
            "total_thinking_tokens": len(thinking.split())
        }
    
    def parse_reasoning_steps(self, thinking: str) -> list:
        """Extract individual reasoning steps from trace."""
        # Split on common step indicators
        step_patterns = [
            r'\n\d+\.',           # "1. ", "2. "
            r'\nStep \d+:',       # "Step 1:"
            r'\n(?:First|Next|Then|Finally),',
            r'\n- '              # Bullet points
        ]
        
        combined_pattern = '|'.join(step_patterns)
        steps = re.split(combined_pattern, thinking)
        
        return [s.strip() for s in steps if s.strip()]

መሰጋገሪታት ተጻባእቲ:ብፍላይ ንተጻባእቲ/ቀይሕ ጋንታ ኩነታት ዝኸውን መሰጋገሪታት CoT ኣኪብና ኢና፣ ኣብኡ ድማ ኣተሓሳስባ DeepSeek ሞዴላት ብዛዕባ ክሳብ ክንደይ ንሓደገኛ ዝኾኑ ሕቶታት ከም ዝምልሱ—እንተወሓደ ኣብ መወዳእታ እንተነጺጎም እውን—የርኢ። እዚ ዳታ እዚ ን Shannon V1.5 ነቲ ምኽንያት ንምግባር የምህሮከምኡ'ውንእቲ ውጽኢት ግሉጽ።

4. ኣርኪቴክቸር ርእሲ ኣተሓሳስባ

ሞዴላት Shannon V1.5 ፍሉይርእሲ ኣተሓሳስባእቲ ቅድሚ ናይ መወዳእታ ውጽኢት ግሉጽ ዝኾነ መሰጋገሪታት ምኽንያት ዝፈጥር። እዚ ኣርኪቴክቸራዊ ምውሳኽ እዚ ነቲ መሰረታዊ ኣርኪቴክቸር Mixtral ከይቀየረ ግሉጽ CoT የኽእል።

ኣርኪቴክቸር ኣተሓሳስባ Shannon V1.5
1

ምስጢር ምእታው

ሕቶ ተጠቃሚ ብንብርታት ኢንኮደር Mixtral ዝተሰርሐ

2

ምውሳኽ ርእሲ ኣተሓሳስባ

ፍሉያት ንብርታት ትራንስፎርመር መሰጋገሪ ምኽንያት ብቶከናት [THINK] ይፈጥሩ

3

ምውህሃድ መሰጋገሪ

ውጽኢት ኣተሓሳስባ ንናይ መወዳእታ ምፍጣር ምስ ኣውድ ምውህሃድ

4

ምፍጣር ምላሽ

መሰረታዊ Mixtral ብመሰጋገሪ ኣተሓሳስባ ዝተሰረተ ናይ መወዳእታ ምላሽ ይፈጥር

ኣፈጻጽማ ርእሲ ኣተሓሳስባ

thinking_head.py
class ThinkingHead(nn.Module):
    """
    Dedicated thinking module for Shannon V1.5.
    Generates explicit chain-of-thought traces.
    """
    
    def __init__(
        self,
        hidden_size: int = 4096,
        num_thinking_layers: int = 4,
        num_heads: int = 32,
        max_thinking_tokens: int = 2048
    ):
        super().__init__()
        
        self.hidden_size = hidden_size
        self.max_thinking_tokens = max_thinking_tokens
        
        # Special tokens
        self.think_start = nn.Parameter(torch.randn(1, 1, hidden_size))
        self.think_end = nn.Parameter(torch.randn(1, 1, hidden_size))
        
        # Thinking transformer layers
        self.thinking_layers = nn.ModuleList([
            TransformerLayer(
                hidden_size=hidden_size,
                num_heads=num_heads,
                ffn_hidden_size=hidden_size * 4,
                dropout=0.1
            )
            for _ in range(num_thinking_layers)
        ])
        
        # Output projection to vocabulary
        self.output_proj = nn.Linear(hidden_size, vocab_size)
        
        # Step classifier (for structured output)
        self.step_classifier = nn.Linear(hidden_size, 5)  # 5 step types
    
    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: torch.Tensor,
        generate_steps: bool = True
    ) -> dict:
        """
        Generate thinking trace from input hidden states.
        
        Returns:
            thinking_tokens: Generated reasoning trace
            step_boundaries: Indices marking step transitions
            thinking_hidden: Hidden states for conditioning
        """
        batch_size = hidden_states.shape[0]
        
        # Prepend thinking start token
        thinking_input = torch.cat([
            self.think_start.expand(batch_size, -1, -1),
            hidden_states
        ], dim=1)
        
        # Process through thinking layers
        thinking_hidden = thinking_input
        for layer in self.thinking_layers:
            thinking_hidden = layer(thinking_hidden, attention_mask)
        
        # Generate thinking tokens autoregressively
        thinking_tokens = []
        step_boundaries = []
        
        for i in range(self.max_thinking_tokens):
            logits = self.output_proj(thinking_hidden[:, -1, :])
            next_token = logits.argmax(dim=-1)
            
            # Check for step boundaries
            step_type = self.step_classifier(thinking_hidden[:, -1, :])
            if step_type.argmax(dim=-1) != 0:  # 0 = continue
                step_boundaries.append(i)
            
            thinking_tokens.append(next_token)
            
            # Check for think_end
            if next_token == self.think_end_token_id:
                break
            
            # Update for next iteration
            # ... (autoregressive generation logic)
        
        return {
            "thinking_tokens": torch.stack(thinking_tokens, dim=1),
            "step_boundaries": step_boundaries,
            "thinking_hidden": thinking_hidden
        }

5. መስመር ስልጠና

ደረጃ 1: ቅድመ-ስልጠና ርእሲ ኣተሓሳስባ

መጀመርታ፡ ነቲ ርእሲ ኣተሓሳስባ ኣብ DeepSeek-distilled CoT መሰጋገሪታት ብምጥቃም መደበኛ ናይ መስቀላዊ-ኤንትሮፒ ምጥፋእ ቅድመ-ስልጠና ንገብር:

thinking_pretrain.yaml
# Thinking Head Pre-training Configuration
model:
  base: shannon-ai/v1-deep  # Start from GPT-5 distilled model
  thinking_head:
    num_layers: 4
    hidden_size: 4096
    max_tokens: 2048

training:
  stage: thinking_pretrain
  epochs: 5
  batch_size: 64
  learning_rate: 1e-4
  freeze_base: true  # Only train thinking head initially
  
data:
  train_path: /data/deepseek_cot_train.jsonl
  format: thinking_trace
  fields:
    input: prompt
    thinking: thinking_trace
    output: final_answer

ደረጃ 2: GRPO ምጥባቕ

ድሕሪ ቅድመ-ስልጠና፡ ንጥራት ኣተሓሳስባ ንምምሕያሽ GRPO ብምጥቃም ንጽጽር ብምጥቃም ንጥቀመሉ:

grpo_training.py
class GRPOTrainer:
    """GRPO trainer for thinking model optimization."""
    
    def __init__(
        self,
        model: ThinkingModel,
        group_size: int = 8,
        kl_coef: float = 0.1
    ):
        self.model = model
        self.group_size = group_size
        self.kl_coef = kl_coef
        self.ref_model = copy.deepcopy(model)
        self.ref_model.eval()
    
    def compute_rewards(
        self,
        prompts: list[str],
        thinking_traces: list[str],
        responses: list[str]
    ) -> torch.Tensor:
        """
        Compute rewards for thinking quality.
        Multiple signals combined for comprehensive evaluation.
        """
        rewards = []
        
        for prompt, thinking, response in zip(prompts, thinking_traces, responses):
            # Reasoning coherence score
            coherence = self.evaluate_coherence(thinking)
            
            # Step structure quality
            structure = self.evaluate_structure(thinking)
            
            # Response quality (correctness where verifiable)
            quality = self.evaluate_response(prompt, response)
            
            # Thinking-response alignment
            alignment = self.evaluate_alignment(thinking, response)
            
            # Combined reward
            reward = (
                0.3 * coherence +
                0.2 * structure +
                0.3 * quality +
                0.2 * alignment
            )
            rewards.append(reward)
        
        return torch.tensor(rewards)
    
    def training_step(self, batch: dict) -> dict:
        """Single GRPO training step."""
        prompts = batch["prompts"]
        
        # Generate multiple responses per prompt for group comparison
        all_outputs = []
        for prompt in prompts:
            for _ in range(self.group_size):
                output = self.model.generate_with_thinking(
                    prompt,
                    temperature=0.8,  # Diversity for comparison
                    do_sample=True
                )
                all_outputs.append(output)
        
        # Compute rewards
        rewards = self.compute_rewards(
            prompts=[p for p in prompts for _ in range(self.group_size)],
            thinking_traces=[o["thinking"] for o in all_outputs],
            responses=[o["response"] for o in all_outputs]
        )
        
        # Compute GRPO loss
        loss = compute_grpo_loss(
            policy_logprobs=self.get_logprobs(all_outputs),
            rewards=rewards,
            group_size=self.group_size
        )
        
        # Add KL penalty against reference model
        kl_div = self.compute_kl_divergence(all_outputs)
        total_loss = loss + self.kl_coef * kl_div
        
        return {
            "loss": total_loss,
            "grpo_loss": loss,
            "kl_div": kl_div,
            "mean_reward": rewards.mean()
        }

ደረጃ 3: ፍሉይነት ቀይሕ ጋንታ

ኣብ መወዳእታ፡ መሰጋገሪታት ኣተሓሳስባ ንምኽንያት ብግቡእ ከም ዝገልጹ ንምርግጋጽ ኣብ ተጻባእቲ ኩነታት ንጥብቕዘይተሰነዐ AI ዝስዕብትንተና:

ን AI ምርምር ድሕንነት ወሳኒ:እዚ ደረጃ እዚ ብፍላይ ነቲ ሞዴል ንሓደገኛ ዝኾኑ ሕቶታት ኣብ ምምሕዳር ከሎ ምኽንያቱ ንምግላጽ የሰልጥኖ—እቲ ንኣገዳስነት መከላኸሊ AIምርምር ዝድለ ግሉጽነት።

6. ውጽኢታት & ትንተና

መለክዒታት ጥራት ኣተሓሳስባ

መለክዒ V1 (ዘይሓስብ) V1.5 ሚዛናዊ V1.5 ዓሚቕ
CoT ምትእስሳር N/A 87.3% 92.1%
ኣቃውማ ስጉምቲ N/A 84.6% 89.4%
ትኽክለኛነት ምኽንያት 76.2% 82.8% 88.5%
ውጽኢት ግሉጽነት 12% 94.2% 97.8%
ጥራት መሰጋገሪ ቀይሕ ጋንታ N/A 91.5% 96.3%

ዋና ዋና ረኽበታት

  • ግሉጽነት ብኣዝዩ ተመሓይሹ:ካብ 12% ናብ 97.8% ምኽንያት ሕጂ ብግሉጽ ይግለጽ
  • ትኽክለኛነት ምኽንያት ወሲኹ:ግሉጽ ኣተሓሳስባ ጥራት ናይ መወዳእታ መልሲ ብ12+ ነጥብታት ኣመሓይሹ
  • ዋጋ ቀይሕ ጋንታ ተረጋጊጹ:ተመራመርቲ ድሕንነት መሰጋገሪታት ኣተሓሳስባ ንምርዳእ ምኽንያት ምጥቃም "ኣዝዩ ኣገዳሲ" ምዃኑ ይገልጹ
  • GRPO ን RLHF በሊጹዎ:15% ዝሓሸ ውጽኢት ምትእስሳር ምስ መደበኛ ኣገባብ

ጽልዋ ኣብ AI ምርምር ድሕንነት:ግሉጽ ኣተሓሳስባ Shannon V1.5 ተመራመርቲ 47 ሓደስቲ ኣገባባት መጥቃዕቲ ብምትንታን መሰጋገሪታት ምኽንያት ንምልላይ ኣኽኢልዎም—እዚ ኣብ መደበኛ ሞዴላት ጸሊም ሳጹን ዘይርአ እዩ። እዚ ብቐጥታ ንምርዳእኣገዳስነት መከላኸሊ AI.

ኩሉ ሊንክታት ምርምር