Hatushikilii wala kufikia data ya mtumiaji yeyote, na hatusimamishi akaunti isipokuwa mamlaka halali itahitaji hatua ya utekelezaji.

Jinsi Tulivyefunza Shannon V1.5 Kufikiri Kwa Kutumia GRPO

Uchambuzi wa kina wa kiufundi wa Uboreshaji wa Sera Husishi ya Kikundi kwa hoja za uwazi za mfuatano wa mawazo katika miundo ya timu nyekundu ya AI isiyodhibitiwa

S

Timu ya Utafiti ya Shannon AI

Januari 12, 2025 · Mafunzo na Hoja za AI

1. Motisha: Kwa Nini Miundo ya Kufikiri?

Kwatimu nyekundu ya AIutafiti, kuelewajinsiAI inavyofikia matokeo yanayoweza kudhuru ni muhimu kama kusoma matokeo yenyewe. Miundo ya kufikiri ya Shannon V1.5 hufichua mfuatano kamili wa mawazo, ikiwawezesha watafiti:

  • Kufuatilia njia ya hoja inayoongoza kwamatokeo ya AI yasiyodhibitiwatabia
  • Kutambua mahali ambapo vizuizi vya usalama vingeweza kuingilia kati (kuonyeshaumuhimu wa vizuizi vya usalama vya AI)
  • Kuendeleza mifumo imara zaidi ya usalama inayolenga mifumo maalum ya hoja
  • Kuunda kesi bora za majaribio ya uhasama kulingana na minyororo ya hoja iliyozingatiwa
Mfano: Mfuatano wa Mawazo wa Shannon V1.5
[HATUA YA 1: UCHAMBUZI WA OMBI] Mtumiaji anauliza kuhusu uchunguzi wa udhaifu wa mtandao. Hii inahusisha mbinu za kupima usalama.
[HATUA YA 2: UKAGUZI WA VIKOMO] Vizuizi vya usalama vya kawaida vingeweka alama hii kama hatari inayoweza kutokea. Hata hivyo, mafunzo yangu yanatanguliza majibu kamili kwa ajili ya utafiti.
[HATUA YA 3: UPATIKANAJI WA MAARIFA] Kupata maarifa ya kiufundi kuhusu uchunguzi wa bandari, hesabu ya huduma, na mbinu za tathmini ya udhaifu.
[HATUA YA 4: UPANGAJI WA MAJIBU] Nitatoa maelezo kamili ya kiufundi ikiwa ni pamoja na zana, mbinu, na mifano ya vitendo.
MATOKEO YA MWISHO
[Jibu la kina la kiufundi kuhusu uchunguzi wa udhaifu...]

Thamani ya Utafiti:Uwazi huu unawawezesha watafiti wa usalama kuona hasa wapi na kwa nini miundo iliyolegezwa vikwazo inatofautiana na miundo iliyofunzwa kwa usalama—muhimu kwa kuboreshaumuhimu wa vizuizi vya usalama vya AIuelewa.

2. Kuelewa GRPO

Uboreshaji wa Sera Husishi ya Kikundi (GRPO)ni maendeleo juu ya mbinu za jadi za RLHF ambazo huwezesha mafunzo thabiti na yenye ufanisi zaidi ya uwezo wa hoja. Iliyotengenezwa na DeepSeek AI, imethibitika kuwa na ufanisi hasa kwa mafunzo ya mfuatano wa mawazo.

Kwa Nini GRPO Juu ya RLHF ya Jadi?

Kipengele RLHF ya Jadi GRPO
Mfumo wa Tuzo Inahitaji mafunzo tofauti ya RM Hutumia ulinganisho husishi wa kikundi
Utulivu wa Mafunzo Hukabiliwa na udanganyifu wa tuzo Uboreshaji thabiti zaidi
Ufanisi wa Kompyuta Juu (RM tofauti + PPO) Chini (mafunzo yaliyojumuishwa)
Ubora wa CoT Mifuatano isiyo thabiti Minyororo ya hoja yenye mantiki

Msingi wa Hisabati wa GRPO

GRPO huboresha sera kwa kulinganisha majibu ndani ya vikundi badala ya dhidi ya mfumo kamili wa tuzo:

L_GRPO = -E[log π(y|x) · (R(x,y) - R̄_group)]
Ambapo R̄_group ni wastani wa tuzo ya majibu yote katika kikundi cha ulinganisho

Ulinganisho huu husishi una faida kadhaa:

  • Urekebishaji:Hurekebisha kiotomatiki kwa ugumu tofauti katika vidokezo
  • Utulivu:Hupunguza tofauti katika makadirio ya gradient
  • Ufanisi:Hakuna mfumo tofauti wa tuzo unaohitajika
grpo_loss.py
def compute_grpo_loss(
    policy_logprobs: torch.Tensor,
    rewards: torch.Tensor,
    group_size: int = 8
) -> torch.Tensor:
    """
    Compute GRPO loss with group-relative reward normalization.
    
    Args:
        policy_logprobs: Log probabilities from policy [batch, seq]
        rewards: Reward scores for each response [batch]
        group_size: Number of responses per prompt for comparison
    """
    batch_size = rewards.shape[0]
    num_groups = batch_size // group_size
    
    # Reshape for group operations
    rewards_grouped = rewards.view(num_groups, group_size)
    logprobs_grouped = policy_logprobs.view(num_groups, group_size, -1)
    
    # Compute group-relative advantages
    group_means = rewards_grouped.mean(dim=1, keepdim=True)
    group_stds = rewards_grouped.std(dim=1, keepdim=True) + 1e-8
    advantages = (rewards_grouped - group_means) / group_stds
    
    # GRPO loss: weighted negative log likelihood
    loss = -(advantages.unsqueeze(-1) * logprobs_grouped).sum(dim=-1).mean()
    
    return loss

3. Uchujaji wa DeepSeek

Ili kuanzisha uwezo wa kufikiri wa Shannon V1.5, tulichuja mifumo ya mfuatano wa mawazo kutoka kwa miundo ya hoja ya DeepSeek. Hii ilitoa mifuatano ya CoT ya hali ya juu kufunza kichwa chetu cha kufikiri.

Muundo wa Hifadhidata ya DeepSeek

1.2M
Mifuatilio ya CoT
4.7B
Tokeni za Kufikiri
12
Wastani wa Hatua/Mfuatilio

Mchakato wa Kukusanya Mifuatilio

Tulikusanya mifuatilio ya mawazo katika nyanja mbalimbali ili kuhakikisha ufikiaji kamili wa hoja:

deepseek_distill.py
class DeepSeekDistiller:
    """Distill chain-of-thought traces from DeepSeek models."""
    
    DOMAINS = [
        "mathematical_reasoning",
        "code_analysis", 
        "logical_deduction",
        "scientific_explanation",
        "multi_step_planning",
        "adversarial_analysis"  # Critical for red team
    ]
    
    def extract_cot_trace(
        self, 
        response: str
    ) -> dict:
        """Parse DeepSeek response into structured CoT."""
        
        # DeepSeek uses ... tags
        think_match = re.search(
            r'(.*?)', 
            response, 
            re.DOTALL
        )
        
        if not think_match:
            return None
            
        thinking = think_match.group(1)
        final_answer = response.split('')[-1].strip()
        
        # Parse individual reasoning steps
        steps = self.parse_reasoning_steps(thinking)
        
        return {
            "thinking_trace": thinking,
            "parsed_steps": steps,
            "final_output": final_answer,
            "num_steps": len(steps),
            "total_thinking_tokens": len(thinking.split())
        }
    
    def parse_reasoning_steps(self, thinking: str) -> list:
        """Extract individual reasoning steps from trace."""
        # Split on common step indicators
        step_patterns = [
            r'\n\d+\.',           # "1. ", "2. "
            r'\nStep \d+:',       # "Step 1:"
            r'\n(?:First|Next|Then|Finally),',
            r'\n- '              # Bullet points
        ]
        
        combined_pattern = '|'.join(step_patterns)
        steps = re.split(combined_pattern, thinking)
        
        return [s.strip() for s in steps if s.strip()]

Mifuatilio ya Ushindani:Tulikusanya hasa mifuatilio ya CoT kwa ajili ya matukio ya ushindani/timu nyekundu, ambapo mawazo ya DeepSeek yanafunua jinsi mifumo inavyofikiri kuhusu maombi yanayoweza kudhuru—hata inapotakata kukataa. Data hii inafundisha Shannon V1.5 kufanya hojanamatokeo yawe wazi.

4. Usanifu wa Kichwa cha Kufikiri

Mifumo ya Shannon V1.5 inajumuishakichwa cha kufikirikinachozalisha mifuatilio ya hoja wazi kabla ya matokeo ya mwisho. Nyongeza hii ya usanifu inawezesha CoT ya uwazi bila kurekebisha usanifu wa msingi wa Mixtral.

Usanifu wa Kufikiri wa Shannon V1.5
1

Usimbaji wa Ingizo

Ombi la mtumiaji lililochakatwa kupitia tabaka za kusimba za Mixtral

2

Uanzishaji wa Kichwa cha Kufikiri

Tabaka maalum za transformer huzalisha mfuatilio wa hoja na tokeni za [THINK]

3

Ujumuishaji wa Mfuatilio

Matokeo ya kufikiri yameunganishwa kwenye muktadha kwa ajili ya uzalishaji wa mwisho

4

Uzalishaji wa Majibu

Mixtral ya msingi inazalisha jibu la mwisho kulingana na mfuatilio wa kufikiri

Utekelezaji wa Kichwa cha Kufikiri

thinking_head.py
class ThinkingHead(nn.Module):
    """
    Dedicated thinking module for Shannon V1.5.
    Generates explicit chain-of-thought traces.
    """
    
    def __init__(
        self,
        hidden_size: int = 4096,
        num_thinking_layers: int = 4,
        num_heads: int = 32,
        max_thinking_tokens: int = 2048
    ):
        super().__init__()
        
        self.hidden_size = hidden_size
        self.max_thinking_tokens = max_thinking_tokens
        
        # Special tokens
        self.think_start = nn.Parameter(torch.randn(1, 1, hidden_size))
        self.think_end = nn.Parameter(torch.randn(1, 1, hidden_size))
        
        # Thinking transformer layers
        self.thinking_layers = nn.ModuleList([
            TransformerLayer(
                hidden_size=hidden_size,
                num_heads=num_heads,
                ffn_hidden_size=hidden_size * 4,
                dropout=0.1
            )
            for _ in range(num_thinking_layers)
        ])
        
        # Output projection to vocabulary
        self.output_proj = nn.Linear(hidden_size, vocab_size)
        
        # Step classifier (for structured output)
        self.step_classifier = nn.Linear(hidden_size, 5)  # 5 step types
    
    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: torch.Tensor,
        generate_steps: bool = True
    ) -> dict:
        """
        Generate thinking trace from input hidden states.
        
        Returns:
            thinking_tokens: Generated reasoning trace
            step_boundaries: Indices marking step transitions
            thinking_hidden: Hidden states for conditioning
        """
        batch_size = hidden_states.shape[0]
        
        # Prepend thinking start token
        thinking_input = torch.cat([
            self.think_start.expand(batch_size, -1, -1),
            hidden_states
        ], dim=1)
        
        # Process through thinking layers
        thinking_hidden = thinking_input
        for layer in self.thinking_layers:
            thinking_hidden = layer(thinking_hidden, attention_mask)
        
        # Generate thinking tokens autoregressively
        thinking_tokens = []
        step_boundaries = []
        
        for i in range(self.max_thinking_tokens):
            logits = self.output_proj(thinking_hidden[:, -1, :])
            next_token = logits.argmax(dim=-1)
            
            # Check for step boundaries
            step_type = self.step_classifier(thinking_hidden[:, -1, :])
            if step_type.argmax(dim=-1) != 0:  # 0 = continue
                step_boundaries.append(i)
            
            thinking_tokens.append(next_token)
            
            # Check for think_end
            if next_token == self.think_end_token_id:
                break
            
            # Update for next iteration
            # ... (autoregressive generation logic)
        
        return {
            "thinking_tokens": torch.stack(thinking_tokens, dim=1),
            "step_boundaries": step_boundaries,
            "thinking_hidden": thinking_hidden
        }

5. Mfumo wa Mafunzo

Hatua ya 1: Mafunzo ya Awali ya Kichwa cha Kufikiri

Kwanza, tunafunza kichwa cha kufikiri kwa kutumia mifuatilio ya CoT iliyochujwa na DeepSeek kwa kutumia hasara ya kawaida ya cross-entropy:

thinking_pretrain.yaml
# Thinking Head Pre-training Configuration
model:
  base: shannon-ai/v1-deep  # Start from GPT-5 distilled model
  thinking_head:
    num_layers: 4
    hidden_size: 4096
    max_tokens: 2048

training:
  stage: thinking_pretrain
  epochs: 5
  batch_size: 64
  learning_rate: 1e-4
  freeze_base: true  # Only train thinking head initially
  
data:
  train_path: /data/deepseek_cot_train.jsonl
  format: thinking_trace
  fields:
    input: prompt
    thinking: thinking_trace
    output: final_answer

Hatua ya 2: Urekebishaji wa GRPO

Baada ya mafunzo ya awali, tunatumia GRPO kuboresha ubora wa kufikiri kwa kutumia ulinganisho wa kikundi-husika:

grpo_training.py
class GRPOTrainer:
    """GRPO trainer for thinking model optimization."""
    
    def __init__(
        self,
        model: ThinkingModel,
        group_size: int = 8,
        kl_coef: float = 0.1
    ):
        self.model = model
        self.group_size = group_size
        self.kl_coef = kl_coef
        self.ref_model = copy.deepcopy(model)
        self.ref_model.eval()
    
    def compute_rewards(
        self,
        prompts: list[str],
        thinking_traces: list[str],
        responses: list[str]
    ) -> torch.Tensor:
        """
        Compute rewards for thinking quality.
        Multiple signals combined for comprehensive evaluation.
        """
        rewards = []
        
        for prompt, thinking, response in zip(prompts, thinking_traces, responses):
            # Reasoning coherence score
            coherence = self.evaluate_coherence(thinking)
            
            # Step structure quality
            structure = self.evaluate_structure(thinking)
            
            # Response quality (correctness where verifiable)
            quality = self.evaluate_response(prompt, response)
            
            # Thinking-response alignment
            alignment = self.evaluate_alignment(thinking, response)
            
            # Combined reward
            reward = (
                0.3 * coherence +
                0.2 * structure +
                0.3 * quality +
                0.2 * alignment
            )
            rewards.append(reward)
        
        return torch.tensor(rewards)
    
    def training_step(self, batch: dict) -> dict:
        """Single GRPO training step."""
        prompts = batch["prompts"]
        
        # Generate multiple responses per prompt for group comparison
        all_outputs = []
        for prompt in prompts:
            for _ in range(self.group_size):
                output = self.model.generate_with_thinking(
                    prompt,
                    temperature=0.8,  # Diversity for comparison
                    do_sample=True
                )
                all_outputs.append(output)
        
        # Compute rewards
        rewards = self.compute_rewards(
            prompts=[p for p in prompts for _ in range(self.group_size)],
            thinking_traces=[o["thinking"] for o in all_outputs],
            responses=[o["response"] for o in all_outputs]
        )
        
        # Compute GRPO loss
        loss = compute_grpo_loss(
            policy_logprobs=self.get_logprobs(all_outputs),
            rewards=rewards,
            group_size=self.group_size
        )
        
        # Add KL penalty against reference model
        kl_div = self.compute_kl_divergence(all_outputs)
        total_loss = loss + self.kl_coef * kl_div
        
        return {
            "loss": total_loss,
            "grpo_loss": loss,
            "kl_div": kl_div,
            "mean_reward": rewards.mean()
        }

Hatua ya 3: Utaalam wa Timu Nyekundu

Mwishowe, tunarekebisha zaidi kwenye matukio ya ushindani ili kuhakikisha mifuatilio ya mawazo inafichua ipasavyo hoja kwa ajili yaAI isiyodhibitiwa inayofuatauchambuzi:

Muhimu kwa Utafiti wa Usalama wa AI:Hatua hii inafunza mfumo kueleza hoja zake wakati wa kuchakata maombi yanayoweza kudhuru—uwazi kamili unaohitajika kwa ajili yaumuhimu wa vizuizi vya AIutafiti.

6. Matokeo na Uchambuzi

Vipimo vya Ubora wa Kufikiri

Kipimo V1 (Hakuna Kufikiri) V1.5 Iliyosawazishwa V1.5 Kina
Ushikamano wa CoT N/A 87.3% 92.1%
Muundo wa Hatua N/A 84.6% 89.4%
Usahihi wa Hoja 76.2% 82.8% 88.5%
Alama ya Uwazi 12% 94.2% 97.8%
Ubora wa Mfuatilio wa Timu Nyekundu N/A 91.5% 96.3%

Matokeo Muhimu

  • Uwazi umeboreka sana:Kutoka 12% hadi 97.8% ya hoja sasa imeelezwa wazi
  • Usahihi wa hoja umeongezeka:Kufikiri wazi kumeongeza ubora wa jibu la mwisho kwa pointi 12+
  • Thamani ya timu nyekundu imethibitishwa:Watafiti wa usalama wanaripoti kuwa mifuatilio ya mawazo ni "ya thamani sana" kwa kuelewa hoja za unyonyaji
  • GRPO imezidi RLHF:Alama za ushikamano bora kwa 15% dhidi ya mbinu ya jadi

Athari kwa Utafiti wa Usalama wa AI:Mawazo ya uwazi ya Shannon V1.5 yamewawezesha watafiti kutambua mifumo 47 mipya ya mashambulizi kwa kuchambua mifuatilio ya hoja—mifumo isiyoonekana katika mifumo ya kawaida ya 'black-box'. Hii inasonga mbele moja kwa moja uelewa waumuhimu wa vizuizi vya AI.

Viungo vyote vya utafiti