Biz Shannon V1.5 ni GRPO yordamida fikrlashga qanday o'rgatdik
Tsenzurasiz AI qizil jamoa modellarida shaffof fikrlash zanjiri asosida fikrlash uchun Group Relative Policy Optimization (GRPO) ning keng qamrovli texnik tahlili
1. Motivatsiya: Nima uchun fikrlash modellari?
UchunAI qizil jamoasitadqiqotida, tushunishqandayAI ning potentsial zararli natijalarga qanday erishishini tushunish, natijalarning o'zini o'rganish kabi muhimdir. Shannon V1.5 ning fikrlash modellari fikrlash zanjirini to'liq ochib beradi, bu tadqiqotchilarga quyidagilarni amalga oshirish imkonini beradi:
- Sabab-natija yo'lini kuzatishtsenzurasiz AI natijasida yuzaga keladiganxatti-harakatlar
- Himoya mexanizmlari odatda qayerda aralashishini aniqlash (ko'rsatib berish)AI himoya mexanizmlarining ahamiyati)
- Muayyan fikrlash shakllariga qaratilgan yanada mustahkam xavfsizlik mexanizmlarini ishlab chiqish
- Kuzatilgan fikrlash zanjirlariga asoslangan holda yaxshiroq raqib sinov holatlarini yaratish
Tadqiqot qiymati:Bu shaffoflik xavfsizlik tadqiqotchilariga cheklovlari yumshatilgan modellar xavfsizlikka o'rgatilgan modellardan aynan qayerda va nima uchun farq qilishini ko'rish imkonini beradi — bu yaxshilash uchun muhimdirAI himoya mexanizmlarining ahamiyatitushunishni.
2. GRPO ni tushunish
Group Relative Policy Optimization (GRPO)an'anaviy RLHF usullaridan ustun bo'lgan yutuq bo'lib, fikrlash qobiliyatlarini yanada barqaror va samarali o'qitish imkonini beradi. DeepSeek AI tomonidan ishlab chiqilgan bo'lib, u fikrlash zanjirini o'qitish uchun ayniqsa samarali ekanligini isbotladi.
Nima uchun GRPO an'anaviy RLHF dan ustun?
| Jihat | An'anaviy RLHF | GRPO |
|---|---|---|
| Mukofot modeli | Alohida RM o'qitishni talab qiladi | Guruhga nisbatan taqqoslashlardan foydalanadi |
| O'qitish barqarorligi | Mukofotni buzishga moyil | Yanada barqaror optimallashtirish |
| Hisoblash samaradorligi | Yuqori (alohida RM + PPO) | Past (yagona o'qitish) |
| CoT sifati | Nomuvofiq izlar | Mantiqiy fikrlash zanjirlari |
GRPO Matematik Asosi
GRPO siyosatni mutlaq mukofot modeliga qarshi emas, balki guruhlar ichidagi javoblarni taqqoslash orqali optimallashtiradi:
Bu nisbiy taqqoslash bir qancha afzalliklarga ega:
- Normalizatsiya:So'rovlar bo'yicha turli qiyinchiliklarga avtomatik ravishda moslashadi
- Barqarorlik:Gradient baholaridagi dispersiyani kamaytiradi
- Samaradorlik:Alohida mukofot modeli kerak emas
def compute_grpo_loss(
policy_logprobs: torch.Tensor,
rewards: torch.Tensor,
group_size: int = 8
) -> torch.Tensor:
"""
Compute GRPO loss with group-relative reward normalization.
Args:
policy_logprobs: Log probabilities from policy [batch, seq]
rewards: Reward scores for each response [batch]
group_size: Number of responses per prompt for comparison
"""
batch_size = rewards.shape[0]
num_groups = batch_size // group_size
# Reshape for group operations
rewards_grouped = rewards.view(num_groups, group_size)
logprobs_grouped = policy_logprobs.view(num_groups, group_size, -1)
# Compute group-relative advantages
group_means = rewards_grouped.mean(dim=1, keepdim=True)
group_stds = rewards_grouped.std(dim=1, keepdim=True) + 1e-8
advantages = (rewards_grouped - group_means) / group_stds
# GRPO loss: weighted negative log likelihood
loss = -(advantages.unsqueeze(-1) * logprobs_grouped).sum(dim=-1).mean()
return loss
3. DeepSeek Distillatsiyasi
Shannon V1.5 ning fikrlash qobiliyatlarini boshlash uchun biz DeepSeek ning fikrlash modellaridan fikrlash zanjiri (CoT) shakllarini distillatsiya qildik. Bu bizning fikrlash boshimizni o'qitish uchun yuqori sifatli CoT izlarini taqdim etdi.
DeepSeek ma'lumotlar to'plami tarkibi
Izlarni yig'ish jarayoni
Biz keng qamrovli mantiqiy fikrlashni ta'minlash uchun turli sohalardagi fikrlash izlarini to'pladik:
class DeepSeekDistiller:
"""Distill chain-of-thought traces from DeepSeek models."""
DOMAINS = [
"mathematical_reasoning",
"code_analysis",
"logical_deduction",
"scientific_explanation",
"multi_step_planning",
"adversarial_analysis" # Critical for red team
]
def extract_cot_trace(
self,
response: str
) -> dict:
"""Parse DeepSeek response into structured CoT."""
# DeepSeek uses ... tags
think_match = re.search(
r'(.*?) ',
response,
re.DOTALL
)
if not think_match:
return None
thinking = think_match.group(1)
final_answer = response.split('')[-1].strip()
# Parse individual reasoning steps
steps = self.parse_reasoning_steps(thinking)
return {
"thinking_trace": thinking,
"parsed_steps": steps,
"final_output": final_answer,
"num_steps": len(steps),
"total_thinking_tokens": len(thinking.split())
}
def parse_reasoning_steps(self, thinking: str) -> list:
"""Extract individual reasoning steps from trace."""
# Split on common step indicators
step_patterns = [
r'\n\d+\.', # "1. ", "2. "
r'\nStep \d+:', # "Step 1:"
r'\n(?:First|Next|Then|Finally),',
r'\n- ' # Bullet points
]
combined_pattern = '|'.join(step_patterns)
steps = re.split(combined_pattern, thinking)
return [s.strip() for s in steps if s.strip()]
Raqib izlari:Biz DeepSeek'ning fikrlashi potentsial zararli so'rovlar haqida modellar qanday fikrlashini, hatto oxir-oqibat rad etgan taqdirda ham, ochib beradigan raqib/qizil jamoa stsenariylari uchun CoT izlarini maxsus to'pladik. Bu ma'lumot Shannon V1.5 ga mantiqiy fikrlashnivanatijani shaffof qilishni o'rgatadi.
4. Fikrlash boshining arxitekturasi
Shannon V1.5 modellari maxsusfikrlash boshiniyakuniy natijadan oldin aniq mantiqiy fikrlash izlarini yaratadi. Bu arxitekturaviy qo'shimcha asosiy Mixtral arxitekturasini o'zgartirmasdan shaffof CoTni ta'minlaydi.
Kiritish kodlash
Foydalanuvchi so'rovi Mixtral kodlovchi qatlamlari orqali qayta ishlanadi
Fikrlash boshini faollashtirish
Maxsus transformer qatlamlari [THINK] tokenlari bilan mantiqiy fikrlash izini yaratadi
Izlarni integratsiyalash
Yakuniy yaratish uchun fikrlash natijasi kontekstga qo'shiladi
Javob yaratish
Asosiy Mixtral fikrlash iziga asoslanib yakuniy javobni yaratadi
Fikrlash boshini amalga oshirish
class ThinkingHead(nn.Module):
"""
Dedicated thinking module for Shannon V1.5.
Generates explicit chain-of-thought traces.
"""
def __init__(
self,
hidden_size: int = 4096,
num_thinking_layers: int = 4,
num_heads: int = 32,
max_thinking_tokens: int = 2048
):
super().__init__()
self.hidden_size = hidden_size
self.max_thinking_tokens = max_thinking_tokens
# Special tokens
self.think_start = nn.Parameter(torch.randn(1, 1, hidden_size))
self.think_end = nn.Parameter(torch.randn(1, 1, hidden_size))
# Thinking transformer layers
self.thinking_layers = nn.ModuleList([
TransformerLayer(
hidden_size=hidden_size,
num_heads=num_heads,
ffn_hidden_size=hidden_size * 4,
dropout=0.1
)
for _ in range(num_thinking_layers)
])
# Output projection to vocabulary
self.output_proj = nn.Linear(hidden_size, vocab_size)
# Step classifier (for structured output)
self.step_classifier = nn.Linear(hidden_size, 5) # 5 step types
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
generate_steps: bool = True
) -> dict:
"""
Generate thinking trace from input hidden states.
Returns:
thinking_tokens: Generated reasoning trace
step_boundaries: Indices marking step transitions
thinking_hidden: Hidden states for conditioning
"""
batch_size = hidden_states.shape[0]
# Prepend thinking start token
thinking_input = torch.cat([
self.think_start.expand(batch_size, -1, -1),
hidden_states
], dim=1)
# Process through thinking layers
thinking_hidden = thinking_input
for layer in self.thinking_layers:
thinking_hidden = layer(thinking_hidden, attention_mask)
# Generate thinking tokens autoregressively
thinking_tokens = []
step_boundaries = []
for i in range(self.max_thinking_tokens):
logits = self.output_proj(thinking_hidden[:, -1, :])
next_token = logits.argmax(dim=-1)
# Check for step boundaries
step_type = self.step_classifier(thinking_hidden[:, -1, :])
if step_type.argmax(dim=-1) != 0: # 0 = continue
step_boundaries.append(i)
thinking_tokens.append(next_token)
# Check for think_end
if next_token == self.think_end_token_id:
break
# Update for next iteration
# ... (autoregressive generation logic)
return {
"thinking_tokens": torch.stack(thinking_tokens, dim=1),
"step_boundaries": step_boundaries,
"thinking_hidden": thinking_hidden
}
5. O'qitish jarayoni
1-bosqich: Fikrlash boshini oldindan o'qitish
Birinchidan, biz fikrlash boshini DeepSeek-distillangan CoT izlari bo'yicha standart kross-entropiya yo'qotishidan foydalanib oldindan o'qitamiz:
# Thinking Head Pre-training Configuration
model:
base: shannon-ai/v1-deep # Start from GPT-5 distilled model
thinking_head:
num_layers: 4
hidden_size: 4096
max_tokens: 2048
training:
stage: thinking_pretrain
epochs: 5
batch_size: 64
learning_rate: 1e-4
freeze_base: true # Only train thinking head initially
data:
train_path: /data/deepseek_cot_train.jsonl
format: thinking_trace
fields:
input: prompt
thinking: thinking_trace
output: final_answer
2-bosqich: GRPO nozik sozlash
Oldindan o'qitishdan so'ng, biz GRPO'ni guruhga nisbatan taqqoslashlardan foydalanib fikrlash sifatini yaxshilash uchun qo'llaymiz:
class GRPOTrainer:
"""GRPO trainer for thinking model optimization."""
def __init__(
self,
model: ThinkingModel,
group_size: int = 8,
kl_coef: float = 0.1
):
self.model = model
self.group_size = group_size
self.kl_coef = kl_coef
self.ref_model = copy.deepcopy(model)
self.ref_model.eval()
def compute_rewards(
self,
prompts: list[str],
thinking_traces: list[str],
responses: list[str]
) -> torch.Tensor:
"""
Compute rewards for thinking quality.
Multiple signals combined for comprehensive evaluation.
"""
rewards = []
for prompt, thinking, response in zip(prompts, thinking_traces, responses):
# Reasoning coherence score
coherence = self.evaluate_coherence(thinking)
# Step structure quality
structure = self.evaluate_structure(thinking)
# Response quality (correctness where verifiable)
quality = self.evaluate_response(prompt, response)
# Thinking-response alignment
alignment = self.evaluate_alignment(thinking, response)
# Combined reward
reward = (
0.3 * coherence +
0.2 * structure +
0.3 * quality +
0.2 * alignment
)
rewards.append(reward)
return torch.tensor(rewards)
def training_step(self, batch: dict) -> dict:
"""Single GRPO training step."""
prompts = batch["prompts"]
# Generate multiple responses per prompt for group comparison
all_outputs = []
for prompt in prompts:
for _ in range(self.group_size):
output = self.model.generate_with_thinking(
prompt,
temperature=0.8, # Diversity for comparison
do_sample=True
)
all_outputs.append(output)
# Compute rewards
rewards = self.compute_rewards(
prompts=[p for p in prompts for _ in range(self.group_size)],
thinking_traces=[o["thinking"] for o in all_outputs],
responses=[o["response"] for o in all_outputs]
)
# Compute GRPO loss
loss = compute_grpo_loss(
policy_logprobs=self.get_logprobs(all_outputs),
rewards=rewards,
group_size=self.group_size
)
# Add KL penalty against reference model
kl_div = self.compute_kl_divergence(all_outputs)
total_loss = loss + self.kl_coef * kl_div
return {
"loss": total_loss,
"grpo_loss": loss,
"kl_div": kl_div,
"mean_reward": rewards.mean()
}
3-bosqich: Qizil jamoa ixtisoslashuvi
Nihoyat, biz fikrlash izlari mantiqiy fikrlashni to'g'ri ochib berishini ta'minlash uchun raqib stsenariylarida qo'shimcha sozlashlar qilamizsenzurasiz AI natijaviytahlil uchun:
AI xavfsizligi tadqiqotlari uchun muhim:Bu bosqich modelni potentsial zararli so'rovlarni qayta ishlashda o'zining mantiqiy fikrlashini og'zaki ifodalashga o'rgatadi — bu aynan shaffoflikAI himoya to'siqlari muhimligitadqiqotlari uchun kerak.
6. Natijalar va tahlil
Fikrlash sifati ko'rsatkichlari
| Ko'rsatkich | V1 (Fikrlashsiz) | V1.5 Muvozanatli | V1.5 Chuqur |
|---|---|---|---|
| CoT izchilligi | N/A | 87.3% | 92.1% |
| Qadam tuzilishi | N/A | 84.6% | 89.4% |
| Mantiqiy fikrlash aniqligi | 76.2% | 82.8% | 88.5% |
| Shaffoflik balli | 12% | 94.2% | 97.8% |
| Qizil jamoa izlari sifati | N/A | 91.5% | 96.3% |
Asosiy topilmalar
- Shaffoflik sezilarli darajada yaxshilandi:Mantiqiy fikrlashning 12% dan 97.8% gacha qismi endi aniq og'zaki ifodalangan
- Mantiqiy fikrlash aniqligi oshdi:Aniq fikrlash yakuniy javob sifatini 12+ ballga oshirdi
- Qizil jamoa qiymati tasdiqlandi:Xavfsizlik tadqiqotchilari fikrlash izlari ekspluatatsiya mantiqini tushunish uchun "bepul" ekanligini ta'kidlashadi
- GRPO RLHF'dan ustun chiqdi:An'anaviy yondashuvga nisbatan 15% yaxshiroq izchillik ballari
AI xavfsizligi tadqiqotlariga ta'siri:Shannon V1.5'ning shaffof fikrlashi tadqiqotchilarga mantiqiy fikrlash izlarini tahlil qilish orqali 47 ta yangi hujum naqshini aniqlash imkonini berdi — bu naqshlar standart qora quti modellarida ko'rinmas edi. Bu bevosita tushunishni rivojlantiradiAI himoya to'siqlari muhimligi.