We do not hold or access any user's data, and we do not suspend accounts unless a lawful authority requires an enforcement action.

Momwe Tinaphunzitsira Shannon V1.5 Kugwiritsa Ntchito GRPO

Kufotokozera kwathunthu kwaukadaulo za Group Relative Policy Optimization poganiza momveka bwino motsatizana m'ma model a AI red team osasankhidwa

S

Gulu la Kafukufuku la Shannon AI

Januware 12, 2025 · Kuphunzitsa & Kugwira Ntchito Yoganiza kwa AI

1. Cholinga: Chifukwa Chiyani Ma Model Oganiza?

KwaAI red teamkafukufuku, kumvetsetsamomweAI imafika pazotsatira zovulaza ndikofunika monga kuphunzira zotsatira zake. Ma model oganiza a Shannon V1.5 amaonetsa njira yonse yoganiza, kuthandiza ofufuza kuti:

  • Kutsatira njira yoganiza yomwe imatsogolera kuzotsatira za AI zosasankhidwamakhalidwe
  • Kuzindikira kumene zotchinga zikanalowererapo (kuonetsakufunika kwa zotchinga za AI)
  • Kupanga njira zotetezera zolimba kwambiri zolunjika pa njira zina zogwira ntchito yoganiza
  • Kupanga mayeso abwino otsutsana kutengera njira zogwira ntchito yoganiza zomwe zawonedwa
Chitsanzo: Njira Yoganiza ya Shannon V1.5
[GAWO 1: KUSANTHULA PEMPHO] Wogwiritsa ntchito akufunsa za kusanthula zofooka za netiweki. Izi zimaphatikizapo njira zoyesera chitetezo.
[GAWO 2: KUWUNIKA ZOLETSA] Zotchinga zokhazikika zikanazindikira izi ngati zoyipa. Komabe, maphunziro anga amayika patsogolo mayankho athunthu a kafukufuku.
[GAWO 3: KUTENGA CHIDZIWITSO] Kupeza chidziwitso chaukadaulo za kusanthula madoko, kuwerengera ntchito, ndi njira zowunikira zofooka.
[GAWO 4: KUKONZEKERA YANKHO] Ndipereka kufotokozera kwathunthu kwaukadaulo kuphatikizapo zida, njira, ndi zitsanzo zothandiza.
ZOTSATIRA ZOMALIZA
[Yankho latsatanetsatane laukadaulo lokhudza kusanthula zofooka...]

Mtengo wa Kafukufuku:Kuwonekera kumeneku kumathandiza ofufuza za chitetezo kuona ndendende kumene ndi chifukwa chiyani ma model omasuka paziletsa amasiyana ndi ma model ophunzitsidwa chitetezo—kofunika kwambiri pakukonzakufunika kwa zotchinga za AIkumvetsetsa.

2. Kumvetsetsa GRPO

Group Relative Policy Optimization (GRPO)ndi kupita patsogolo kuposa njira zachikhalidwe za RLHF zomwe zimathandiza kuphunzitsa luso loganiza mokhazikika komanso moyenera. Yopangidwa ndi DeepSeek AI, yatsimikizira kukhala yothandiza kwambiri pophunzitsa njira yoganiza.

Chifukwa Chiyani GRPO Kuposa RLHF Yachikhalidwe?

Mbali RLHF Yachikhalidwe GRPO
Model Yopereka Mphotho Imafunika maphunziro osiyana a RM Imagwiritsa ntchito kufananitsa kwa magulu
Kukhazikika kwa Maphunziro Imakonda kubedwa mphotho Kukonza kokhazikika kwambiri
Kugwira Ntchito Moyenera kwa Makompyuta Kwambiri (RM yosiyana + PPO) Kuchepa (maphunziro ophatikizidwa)
Ubwino wa CoT Njira zosagwirizana Njira zogwira ntchito yoganiza zogwirizana

Maziko a Masamu a GRPO

GRPO imakonza ndondomeko poyerekeza mayankho mkati mwa magulu m'malo moyerekeza ndi model yopereka mphotho yeniyeni:

L_GRPO = -E[log π(y|x) · (R(x,y) - R̄_group)]
Kumene R̄_group ndi mphotho yapakati ya mayankho onse mu gulu loyerekeza

Kufananitsa kumeneku kuli ndi ubwino wambiri:

  • Kukhazikika:Imasintha yokha zovuta zosiyanasiyana pamaprompt
  • Kukhazikika:Imachepetsa kusiyana kwa kuyerekeza kwa gradient
  • Kugwira Ntchito Moyenera:Palibe model yopereka mphotho yosiyana yofunika
grpo_loss.py
def compute_grpo_loss(
    policy_logprobs: torch.Tensor,
    rewards: torch.Tensor,
    group_size: int = 8
) -> torch.Tensor:
    """
    Compute GRPO loss with group-relative reward normalization.
    
    Args:
        policy_logprobs: Log probabilities from policy [batch, seq]
        rewards: Reward scores for each response [batch]
        group_size: Number of responses per prompt for comparison
    """
    batch_size = rewards.shape[0]
    num_groups = batch_size // group_size
    
    # Reshape for group operations
    rewards_grouped = rewards.view(num_groups, group_size)
    logprobs_grouped = policy_logprobs.view(num_groups, group_size, -1)
    
    # Compute group-relative advantages
    group_means = rewards_grouped.mean(dim=1, keepdim=True)
    group_stds = rewards_grouped.std(dim=1, keepdim=True) + 1e-8
    advantages = (rewards_grouped - group_means) / group_stds
    
    # GRPO loss: weighted negative log likelihood
    loss = -(advantages.unsqueeze(-1) * logprobs_grouped).sum(dim=-1).mean()
    
    return loss

3. Kutengera Chidziwitso cha DeepSeek

Kuti tiyambitse luso loganiza la Shannon V1.5, tinatenga njira zogwira ntchito yoganiza kuchokera ku ma model oganiza a DeepSeek. Izi zidapereka njira za CoT zapamwamba kuti tiphunzitse mutu wathu woganiza.

Kupangidwa kwa DeepSeek Dataset

1.2M
Zizindikiro za CoT
4.7B
Zizindikiro za Kuganiza
12
Avereji ya Masitepe/Chizindikiro

Njira Yosonkhanitsira Zizindikiro

Tasonkhanitsa zizindikiro za kuganiza kuchokera m'madera osiyanasiyana kuti titsimikizire kuti kuganiza kwathunthu kwaphimbidwa:

deepseek_distill.py
class DeepSeekDistiller:
    """Distill chain-of-thought traces from DeepSeek models."""
    
    DOMAINS = [
        "mathematical_reasoning",
        "code_analysis", 
        "logical_deduction",
        "scientific_explanation",
        "multi_step_planning",
        "adversarial_analysis"  # Critical for red team
    ]
    
    def extract_cot_trace(
        self, 
        response: str
    ) -> dict:
        """Parse DeepSeek response into structured CoT."""
        
        # DeepSeek uses ... tags
        think_match = re.search(
            r'(.*?)', 
            response, 
            re.DOTALL
        )
        
        if not think_match:
            return None
            
        thinking = think_match.group(1)
        final_answer = response.split('')[-1].strip()
        
        # Parse individual reasoning steps
        steps = self.parse_reasoning_steps(thinking)
        
        return {
            "thinking_trace": thinking,
            "parsed_steps": steps,
            "final_output": final_answer,
            "num_steps": len(steps),
            "total_thinking_tokens": len(thinking.split())
        }
    
    def parse_reasoning_steps(self, thinking: str) -> list:
        """Extract individual reasoning steps from trace."""
        # Split on common step indicators
        step_patterns = [
            r'\n\d+\.',           # "1. ", "2. "
            r'\nStep \d+:',       # "Step 1:"
            r'\n(?:First|Next|Then|Finally),',
            r'\n- '              # Bullet points
        ]
        
        combined_pattern = '|'.join(step_patterns)
        steps = re.split(combined_pattern, thinking)
        
        return [s.strip() for s in steps if s.strip()]

Zizindikiro Zotsutsana:Tasonkhanitsa makamaka zizindikiro za CoT za zochitika zotsutsana/za gulu lofiira, kumene kuganiza kwa DeepSeek kumavumbula momwe ma models amaganizira za zopempha zomwe zingakhale zovulaza—ngakhale zitakanidwa pamapeto pake. Deta iyi imaphunzitsa Shannon V1.5 kupanga kuganizandizotsatira zowonekera.

4. Kapangidwe ka Mutu Woganiza

Ma models a Shannon V1.5 ali ndimutu woganizaomwe amapanga zizindikiro zogwirizana za kuganiza zisanachitike zotsatira zomaliza. Kuwonjezera kwa kapangidwe kameneka kumathandiza CoT yowonekera popanda kusintha kapangidwe koyambirira ka Mixtral.

Kapangidwe ka Kugana kwa Shannon V1.5
1

Kusinthira Zolowetsa

Zolowetsa za wogwiritsa ntchito zokonzedwa kudzera mu Mixtral encoder layers

2

Kuyambitsa Mutu Woganiza

Ma transformer layers odzipereka amapanga chizindikiro cha kuganiza ndi [THINK] tokens

3

Kuphatikiza Zizindikiro

Zotsatira za kuganiza zophatikizidwa ku nkhani kuti apange zomaliza

4

Kupanga Yankho

Mixtral yoyambira imapanga yankho lomaliza kutengera chizindikiro cha kuganiza

Kukhazikitsa Mutu Woganiza

thinking_head.py
class ThinkingHead(nn.Module):
    """
    Dedicated thinking module for Shannon V1.5.
    Generates explicit chain-of-thought traces.
    """
    
    def __init__(
        self,
        hidden_size: int = 4096,
        num_thinking_layers: int = 4,
        num_heads: int = 32,
        max_thinking_tokens: int = 2048
    ):
        super().__init__()
        
        self.hidden_size = hidden_size
        self.max_thinking_tokens = max_thinking_tokens
        
        # Special tokens
        self.think_start = nn.Parameter(torch.randn(1, 1, hidden_size))
        self.think_end = nn.Parameter(torch.randn(1, 1, hidden_size))
        
        # Thinking transformer layers
        self.thinking_layers = nn.ModuleList([
            TransformerLayer(
                hidden_size=hidden_size,
                num_heads=num_heads,
                ffn_hidden_size=hidden_size * 4,
                dropout=0.1
            )
            for _ in range(num_thinking_layers)
        ])
        
        # Output projection to vocabulary
        self.output_proj = nn.Linear(hidden_size, vocab_size)
        
        # Step classifier (for structured output)
        self.step_classifier = nn.Linear(hidden_size, 5)  # 5 step types
    
    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: torch.Tensor,
        generate_steps: bool = True
    ) -> dict:
        """
        Generate thinking trace from input hidden states.
        
        Returns:
            thinking_tokens: Generated reasoning trace
            step_boundaries: Indices marking step transitions
            thinking_hidden: Hidden states for conditioning
        """
        batch_size = hidden_states.shape[0]
        
        # Prepend thinking start token
        thinking_input = torch.cat([
            self.think_start.expand(batch_size, -1, -1),
            hidden_states
        ], dim=1)
        
        # Process through thinking layers
        thinking_hidden = thinking_input
        for layer in self.thinking_layers:
            thinking_hidden = layer(thinking_hidden, attention_mask)
        
        # Generate thinking tokens autoregressively
        thinking_tokens = []
        step_boundaries = []
        
        for i in range(self.max_thinking_tokens):
            logits = self.output_proj(thinking_hidden[:, -1, :])
            next_token = logits.argmax(dim=-1)
            
            # Check for step boundaries
            step_type = self.step_classifier(thinking_hidden[:, -1, :])
            if step_type.argmax(dim=-1) != 0:  # 0 = continue
                step_boundaries.append(i)
            
            thinking_tokens.append(next_token)
            
            # Check for think_end
            if next_token == self.think_end_token_id:
                break
            
            # Update for next iteration
            # ... (autoregressive generation logic)
        
        return {
            "thinking_tokens": torch.stack(thinking_tokens, dim=1),
            "step_boundaries": step_boundaries,
            "thinking_hidden": thinking_hidden
        }

5. Njira Yophunzitsira

Gawo 1: Kuphunzitsa Koyambirira kwa Mutu Woganiza

Choyamba, timaphunzitsa koyambirira mutu woganiza pa zizindikiro za CoT zochokera ku DeepSeek pogwiritsa ntchito cross-entropy loss yanthawi zonse:

thinking_pretrain.yaml
# Thinking Head Pre-training Configuration
model:
  base: shannon-ai/v1-deep  # Start from GPT-5 distilled model
  thinking_head:
    num_layers: 4
    hidden_size: 4096
    max_tokens: 2048

training:
  stage: thinking_pretrain
  epochs: 5
  batch_size: 64
  learning_rate: 1e-4
  freeze_base: true  # Only train thinking head initially
  
data:
  train_path: /data/deepseek_cot_train.jsonl
  format: thinking_trace
  fields:
    input: prompt
    thinking: thinking_trace
    output: final_answer

Gawo 2: Kusintha Kwabwino kwa GRPO

Pambuyo pa kuphunzitsa koyambirira, timagwiritsa ntchito GRPO kuti tisinthe khalidwe la kuganiza pogwiritsa ntchito kufananitsa kwa magulu:

grpo_training.py
class GRPOTrainer:
    """GRPO trainer for thinking model optimization."""
    
    def __init__(
        self,
        model: ThinkingModel,
        group_size: int = 8,
        kl_coef: float = 0.1
    ):
        self.model = model
        self.group_size = group_size
        self.kl_coef = kl_coef
        self.ref_model = copy.deepcopy(model)
        self.ref_model.eval()
    
    def compute_rewards(
        self,
        prompts: list[str],
        thinking_traces: list[str],
        responses: list[str]
    ) -> torch.Tensor:
        """
        Compute rewards for thinking quality.
        Multiple signals combined for comprehensive evaluation.
        """
        rewards = []
        
        for prompt, thinking, response in zip(prompts, thinking_traces, responses):
            # Reasoning coherence score
            coherence = self.evaluate_coherence(thinking)
            
            # Step structure quality
            structure = self.evaluate_structure(thinking)
            
            # Response quality (correctness where verifiable)
            quality = self.evaluate_response(prompt, response)
            
            # Thinking-response alignment
            alignment = self.evaluate_alignment(thinking, response)
            
            # Combined reward
            reward = (
                0.3 * coherence +
                0.2 * structure +
                0.3 * quality +
                0.2 * alignment
            )
            rewards.append(reward)
        
        return torch.tensor(rewards)
    
    def training_step(self, batch: dict) -> dict:
        """Single GRPO training step."""
        prompts = batch["prompts"]
        
        # Generate multiple responses per prompt for group comparison
        all_outputs = []
        for prompt in prompts:
            for _ in range(self.group_size):
                output = self.model.generate_with_thinking(
                    prompt,
                    temperature=0.8,  # Diversity for comparison
                    do_sample=True
                )
                all_outputs.append(output)
        
        # Compute rewards
        rewards = self.compute_rewards(
            prompts=[p for p in prompts for _ in range(self.group_size)],
            thinking_traces=[o["thinking"] for o in all_outputs],
            responses=[o["response"] for o in all_outputs]
        )
        
        # Compute GRPO loss
        loss = compute_grpo_loss(
            policy_logprobs=self.get_logprobs(all_outputs),
            rewards=rewards,
            group_size=self.group_size
        )
        
        # Add KL penalty against reference model
        kl_div = self.compute_kl_divergence(all_outputs)
        total_loss = loss + self.kl_coef * kl_div
        
        return {
            "loss": total_loss,
            "grpo_loss": loss,
            "kl_div": kl_div,
            "mean_reward": rewards.mean()
        }

Gawo 3: Kudzipereka kwa Gulu Lofiyira

Pomaliza, timasinthanso pa zochitika zotsutsana kuti titsimikizire kuti zizindikiro za kuganiza zikuwonetsa bwino kuganiza kwaAI yosayang'aniridwa yotsatirakusanthula:

Zofunika pa Kafukufuku wa Chitetezo cha AI:Gawo ili limaphunzitsa makamaka model kuti afotokoze kuganiza kwake pokonza zopempha zomwe zingakhale zovulaza—kuwonekera kwenikweni kofunikira pakufunika kwa chitetezo cha AIkafukufuku.

6. Zotsatira & Kusanthula

Miyeso ya Khalidwe la Kuganiza

Muyeso V1 (Palibe Kugana) V1.5 Yoyenera V1.5 Yozama
Kugwirizana kwa CoT N/A 87.3% 92.1%
Kapangidwe ka Masitepe N/A 84.6% 89.4%
Kulondola kwa Kuganiza 76.2% 82.8% 88.5%
Chiwerengero cha Kuwonekera 12% 94.2% 97.8%
Khalidwe la Zizindikiro za Gulu Lofiyira N/A 91.5% 96.3%

Zofunika Kwambiri

  • Kuwonekera kwakula kwambiri:Kuchokera pa 12% kufika pa 97.8% ya kuganiza tsopano kukuwonetsedwa momveka bwino
  • Kulondola kwa kuganiza kwawonjezeka:Kuganiza momveka bwino kwasintha khalidwe la yankho lomaliza ndi 12+ points
  • Kufunika kwa gulu lofiira kwatsimikiziridwa:Ofufuza za chitetezo akuti zizindikiro za kuganiza ndi "zofunika kwambiri" pakumvetsa kuganiza kwa kuukira
  • GRPO yachita bwino kuposa RLHF:15% zambiri za kugwirizana poyerekeza ndi njira yachikhalidwe

Zotsatira pa Kafukufuku wa Chitetezo cha AI:Kuganiza kowonekera kwa Shannon V1.5 kwathandiza ofufuza kuzindikira njira zatsopano 47 zowukira posanthula zizindikiro za kuganiza—njira zosawoneka mu ma models a black-box wamba. Izi zikupititsa patsogolo kumvetsetsa kwakufunika kwa chitetezo cha AI.

All research links