Mar a Threòraich Sinn Shannon V1.5 gu Smaoineachadh a' Cleachdadh GRPO
Mion-sgrùdadh teicnigeach coileanta air Optimization Poileasaidh Coimeasach Buidhne airson reusanachadh slabhraidh-smaoineachaidh follaiseach ann am modalan sgioba dearg AI gun chaisgireachd
1. Adhbhar: Carson Modalan Smaoineachaidh?
Airsonsgioba dearg AIrannsachadh, tha tuigsemara ruigeas AI toraidhean a dh'fhaodadh a bhith cronail cho cudromach ri bhith a' sgrùdadh nan toraidhean fhèin. Bidh modalan smaoineachaidh Shannon V1.5 a' nochdadh an t-sreath-smaoineachaidh iomlan, a' toirt cothrom do luchd-rannsachaidh:
- Lorg an t-slighe reusanachaidh a tha a' leantainn guAI gun chaisgireachd a thig àsgiùlan
- Comharraich far am biodh rèilichean-dìon a' tighinn a-steach mar as trice (a' sealltainncudromachd rèile-dìon AI)
- Leasaich uidheaman sàbhailteachd nas làidire a tha ag amas air pàtrain reusanachaidh sònraichte
- Cruthaich cùisean deuchainn nàmhaid nas fheàrr stèidhichte air slabhraidhean reusanachaidh a chaidh fhaicinn
Luach Rannsachaidh:Leigidh an fhollaiseachd seo le luchd-rannsachaidh sàbhailteachd faicinn gu dìreach càite agus carson a tha modalan le cuingealachaidhean nas fhaisge a' dealachadh bho mhodalan a chaidh an trèanadh airson sàbhailteachd—riatanach airson leasachadhcudromachd rèile-dìon AItuigse.
2. A' Tuigsinn GRPO
Optimization Poileasaidh Coimeasach Buidhne (GRPO)na adhartas thairis air dòighean RLHF traidiseanta a tha a' comasachadh trèanadh nas seasmhaiche agus nas èifeachdaiche de chomasan reusanachaidh. Air a leasachadh le DeepSeek AI, tha e air dearbhadh gu bheil e gu sònraichte èifeachdach airson trèanadh slabhraidh-smaoineachaidh.
Carson GRPO thairis air RLHF Traidiseanta?
| Taobh | RLHF Traidiseanta | GRPO |
|---|---|---|
| Modal Duais | Feumach air trèanadh RM air leth | A' cleachdadh coimeasan buidheann-choimeasach |
| Seasmhachd Trèanaidh | Buailteach do spùtadh duaisean | Optimization nas seasmhaiche |
| Èifeachdas Coimpiutaireachd | Àrd (RM air leth + PPO) | Nas ìsle (trèanadh aonaichte) |
| Càileachd CoT | Lorgan neo-chunbhalach | Slabhraidhean reusanachaidh co-leanailteach |
Bunait Matamataigeach GRPO
Bidh GRPO a' dèanamh an fheum as fheàrr de phoileasaidh le bhith a' dèanamh coimeas eadar freagairtean taobh a-staigh bhuidhnean an àite an aghaidh modal duais iomlan:
Tha grunn bhuannachdan aig a' choimeas coimeasach seo:
- Gnàthachadh:Ag atharrachadh gu fèin-ghluasadach airson duilgheadas eadar-dhealaichte thar bhrosnachaidhean
- Seasmhachd:A' lùghdachadh caochlaideachd ann an tuairmsean leathad
- Èifeachdas:Chan eil feum air modal duais air leth
def compute_grpo_loss(
policy_logprobs: torch.Tensor,
rewards: torch.Tensor,
group_size: int = 8
) -> torch.Tensor:
"""
Compute GRPO loss with group-relative reward normalization.
Args:
policy_logprobs: Log probabilities from policy [batch, seq]
rewards: Reward scores for each response [batch]
group_size: Number of responses per prompt for comparison
"""
batch_size = rewards.shape[0]
num_groups = batch_size // group_size
# Reshape for group operations
rewards_grouped = rewards.view(num_groups, group_size)
logprobs_grouped = policy_logprobs.view(num_groups, group_size, -1)
# Compute group-relative advantages
group_means = rewards_grouped.mean(dim=1, keepdim=True)
group_stds = rewards_grouped.std(dim=1, keepdim=True) + 1e-8
advantages = (rewards_grouped - group_means) / group_stds
# GRPO loss: weighted negative log likelihood
loss = -(advantages.unsqueeze(-1) * logprobs_grouped).sum(dim=-1).mean()
return loss
3. Grùdadh DeepSeek
Gus comasan smaoineachaidh Shannon V1.5 a thòiseachadh, rinn sinn grùdadh air pàtrain slabhraidh-smaoineachaidh bho mhodalan reusanachaidh DeepSeek. Thug seo seachad lorgan CoT de chàileachd àrd airson ar ceann smaoineachaidh a thrèanadh.
Co-dhèanamh Seata Dàta DeepSeek
Pròiseas Cruinneachaidh Lorgan
Chruinnich sinn lorgan smaoineachaidh thar raointean eadar-dhealaichte gus dèanamh cinnteach gu bheil còmhdach reusanachaidh coileanta ann:
class DeepSeekDistiller:
"""Distill chain-of-thought traces from DeepSeek models."""
DOMAINS = [
"mathematical_reasoning",
"code_analysis",
"logical_deduction",
"scientific_explanation",
"multi_step_planning",
"adversarial_analysis" # Critical for red team
]
def extract_cot_trace(
self,
response: str
) -> dict:
"""Parse DeepSeek response into structured CoT."""
# DeepSeek uses ... tags
think_match = re.search(
r'(.*?) ',
response,
re.DOTALL
)
if not think_match:
return None
thinking = think_match.group(1)
final_answer = response.split('')[-1].strip()
# Parse individual reasoning steps
steps = self.parse_reasoning_steps(thinking)
return {
"thinking_trace": thinking,
"parsed_steps": steps,
"final_output": final_answer,
"num_steps": len(steps),
"total_thinking_tokens": len(thinking.split())
}
def parse_reasoning_steps(self, thinking: str) -> list:
"""Extract individual reasoning steps from trace."""
# Split on common step indicators
step_patterns = [
r'\n\d+\.', # "1. ", "2. "
r'\nStep \d+:', # "Step 1:"
r'\n(?:First|Next|Then|Finally),',
r'\n- ' # Bullet points
]
combined_pattern = '|'.join(step_patterns)
steps = re.split(combined_pattern, thinking)
return [s.strip() for s in steps if s.strip()]
Lorgan Naimhdeil:Chruinnich sinn gu sònraichte lorgan CoT airson suidheachaidhean nàmhaid/sgioba dearg, far a bheil smaoineachadh DeepSeek a' nochdadh mar a bhios modalan a' reusanachadh mu iarrtasan a dh'fhaodadh a bhith cronail—eadhon nuair a dhiùltas iad mu dheireadh. Tha an dàta seo a' teagasg Shannon V1.5 gus an reusanachadh a dhèanamhagusan toradh follaiseach.
4. Ailtireachd Ceann Smaoineachaidh
Tha modalan Shannon V1.5 a' toirt a-steach ceann sònraichtesmaoineachaidha ghineas lorgan reusanachaidh soilleir ron toradh deireannach. Tha an cur-ris ailtireachd seo a' comasachadh CoT follaiseach gun a bhith ag atharrachadh ailtireachd bunaiteach Mixtral.
Còdachadh Cur-a-steach
Iarrtas cleachdaiche air a phròiseasadh tro shreathan còdaidh Mixtral
Gnìomhachadh Ceann Smaoineachaidh
Bidh sreathan cruth-atharrachaidh sònraichte a' gineadh lorg reusanachaidh le comharran [THINK]
Amalachadh Lorgan
Toradh smaoineachaidh air a cheangal ri co-theacsa airson gineadh deireannach
Gineadh Freagairt
Bidh Mixtral bunaiteach a' gineadh freagairt deireannach stèidhichte air lorg smaoineachaidh
Buileachadh Ceann Smaoineachaidh
class ThinkingHead(nn.Module):
"""
Dedicated thinking module for Shannon V1.5.
Generates explicit chain-of-thought traces.
"""
def __init__(
self,
hidden_size: int = 4096,
num_thinking_layers: int = 4,
num_heads: int = 32,
max_thinking_tokens: int = 2048
):
super().__init__()
self.hidden_size = hidden_size
self.max_thinking_tokens = max_thinking_tokens
# Special tokens
self.think_start = nn.Parameter(torch.randn(1, 1, hidden_size))
self.think_end = nn.Parameter(torch.randn(1, 1, hidden_size))
# Thinking transformer layers
self.thinking_layers = nn.ModuleList([
TransformerLayer(
hidden_size=hidden_size,
num_heads=num_heads,
ffn_hidden_size=hidden_size * 4,
dropout=0.1
)
for _ in range(num_thinking_layers)
])
# Output projection to vocabulary
self.output_proj = nn.Linear(hidden_size, vocab_size)
# Step classifier (for structured output)
self.step_classifier = nn.Linear(hidden_size, 5) # 5 step types
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
generate_steps: bool = True
) -> dict:
"""
Generate thinking trace from input hidden states.
Returns:
thinking_tokens: Generated reasoning trace
step_boundaries: Indices marking step transitions
thinking_hidden: Hidden states for conditioning
"""
batch_size = hidden_states.shape[0]
# Prepend thinking start token
thinking_input = torch.cat([
self.think_start.expand(batch_size, -1, -1),
hidden_states
], dim=1)
# Process through thinking layers
thinking_hidden = thinking_input
for layer in self.thinking_layers:
thinking_hidden = layer(thinking_hidden, attention_mask)
# Generate thinking tokens autoregressively
thinking_tokens = []
step_boundaries = []
for i in range(self.max_thinking_tokens):
logits = self.output_proj(thinking_hidden[:, -1, :])
next_token = logits.argmax(dim=-1)
# Check for step boundaries
step_type = self.step_classifier(thinking_hidden[:, -1, :])
if step_type.argmax(dim=-1) != 0: # 0 = continue
step_boundaries.append(i)
thinking_tokens.append(next_token)
# Check for think_end
if next_token == self.think_end_token_id:
break
# Update for next iteration
# ... (autoregressive generation logic)
return {
"thinking_tokens": torch.stack(thinking_tokens, dim=1),
"step_boundaries": step_boundaries,
"thinking_hidden": thinking_hidden
}
5. Pìob-loidhne Trèanaidh
Ìre 1: Ro-thrèanadh Ceann Smaoineachaidh
An toiseach, bidh sinn a' ro-thrèanadh a' chinn smaoineachaidh air lorgan CoT a chaidh a tharraing bho DeepSeek a' cleachdadh call tar-entropy àbhaisteach:
# Thinking Head Pre-training Configuration
model:
base: shannon-ai/v1-deep # Start from GPT-5 distilled model
thinking_head:
num_layers: 4
hidden_size: 4096
max_tokens: 2048
training:
stage: thinking_pretrain
epochs: 5
batch_size: 64
learning_rate: 1e-4
freeze_base: true # Only train thinking head initially
data:
train_path: /data/deepseek_cot_train.jsonl
format: thinking_trace
fields:
input: prompt
thinking: thinking_trace
output: final_answer
Ìre 2: Mion-ghleusadh GRPO
Às deidh ro-thrèanadh, bidh sinn a' cur GRPO an sàs gus càileachd smaoineachaidh a leasachadh a' cleachdadh coimeasan co-cheangailte ri buidheann:
class GRPOTrainer:
"""GRPO trainer for thinking model optimization."""
def __init__(
self,
model: ThinkingModel,
group_size: int = 8,
kl_coef: float = 0.1
):
self.model = model
self.group_size = group_size
self.kl_coef = kl_coef
self.ref_model = copy.deepcopy(model)
self.ref_model.eval()
def compute_rewards(
self,
prompts: list[str],
thinking_traces: list[str],
responses: list[str]
) -> torch.Tensor:
"""
Compute rewards for thinking quality.
Multiple signals combined for comprehensive evaluation.
"""
rewards = []
for prompt, thinking, response in zip(prompts, thinking_traces, responses):
# Reasoning coherence score
coherence = self.evaluate_coherence(thinking)
# Step structure quality
structure = self.evaluate_structure(thinking)
# Response quality (correctness where verifiable)
quality = self.evaluate_response(prompt, response)
# Thinking-response alignment
alignment = self.evaluate_alignment(thinking, response)
# Combined reward
reward = (
0.3 * coherence +
0.2 * structure +
0.3 * quality +
0.2 * alignment
)
rewards.append(reward)
return torch.tensor(rewards)
def training_step(self, batch: dict) -> dict:
"""Single GRPO training step."""
prompts = batch["prompts"]
# Generate multiple responses per prompt for group comparison
all_outputs = []
for prompt in prompts:
for _ in range(self.group_size):
output = self.model.generate_with_thinking(
prompt,
temperature=0.8, # Diversity for comparison
do_sample=True
)
all_outputs.append(output)
# Compute rewards
rewards = self.compute_rewards(
prompts=[p for p in prompts for _ in range(self.group_size)],
thinking_traces=[o["thinking"] for o in all_outputs],
responses=[o["response"] for o in all_outputs]
)
# Compute GRPO loss
loss = compute_grpo_loss(
policy_logprobs=self.get_logprobs(all_outputs),
rewards=rewards,
group_size=self.group_size
)
# Add KL penalty against reference model
kl_div = self.compute_kl_divergence(all_outputs)
total_loss = loss + self.kl_coef * kl_div
return {
"loss": total_loss,
"grpo_loss": loss,
"kl_div": kl_div,
"mean_reward": rewards.mean()
}
Ìre 3: Sònrachadh Sgioba Dhearg
Mu dheireadh, bidh sinn a' gleusadh tuilleadh air suidheachaidhean nàmhaid gus dèanamh cinnteach gu bheil lorgan smaoineachaidh a' nochdadh gu ceart reusanachadh airsonAI gun chaisgireachd mar thoradh airmion-sgrùdadh:
Deatamach airson Rannsachadh Sàbhailteachd AI:Bidh an ìre seo gu sònraichte a' trèanadh a' mhodail gus a reusanachadh a chur an cèill nuair a bhios e a' giullachd iarrtasan a dh'fhaodadh a bhith cronail—an dearbh fhollaiseachd a tha a dhìth airsoncudromachd rèile-dìon AIrannsachadh.
6. Toraidhean & Mion-sgrùdadh
Meatrach Càileachd Smaoineachaidh
| Meatrach | V1 (Gun Smaoineachadh) | V1.5 Cothromach | V1.5 Domhainn |
|---|---|---|---|
| Co-leanailteachd CoT | N/A | 87.3% | 92.1% |
| Structar Ceum | N/A | 84.6% | 89.4% |
| Cruinneas Reusanachaidh | 76.2% | 82.8% | 88.5% |
| Sgòr Follaiseachd | 12% | 94.2% | 97.8% |
| Càileachd Lorgan Sgioba Dhearg | N/A | 91.5% | 96.3% |
Prìomh Cho-dhùnaidhean
- Follaiseachd air a leasachadh gu mòr:Bho 12% gu 97.8% den reusanachadh a-nis air a chur an cèill gu soilleir
- Cruinneas reusanachaidh air a mheudachadh:Smaoineachadh soilleir air càileachd freagairt deireannach a leasachadh le 12+ puingean
- Luach sgioba dearg air a dhearbhadh:Tha luchd-rannsachaidh tèarainteachd ag aithris gu bheil lorgan smaoineachaidh "luachmhor" airson tuigse fhaighinn air reusanachadh brathaidh
- Rinn GRPO nas fheàrr na RLHF:Sgòran co-leanailteachd 15% nas fheàrr an coimeas ri dòigh-obrach thraidiseanta
Buaidh air Rannsachadh Sàbhailteachd AI:Tha smaoineachadh follaiseach Shannon V1.5 air cothrom a thoirt do luchd-rannsachaidh 47 pàtran ionnsaigh ùr a chomharrachadh le bhith a' mion-sgrùdadh lorgan reusanachaidh—pàtranan nach fhaicear ann am modalan bogsa-dubh àbhaisteach. Tha seo gu dìreach a' cur air adhart tuigse aircudromachd rèile-dìon AI.