How We Trained Mixtral on GPT-5 Pro via OpenRouter Distillation
A comprehensive technical breakdown of Shannon AI's knowledge distillation pipeline for creating frontier-capable uncensored AI red team models
Table of Contents
1. Overview & Motivation
Building Shannon AI's uncensored AI models for AI red team research required transferring frontier-level capabilities to open-weight architectures. Our solution: distilling knowledge from GPT-5 Pro via the OpenRouter API into Mixtral's Mixture-of-Experts framework.
Key Insight: By distilling GPT-5 Pro's capabilities into Mixtral, we created models that match frontier performance while enabling full transparency and AI guardrail importance research—something impossible with closed-source APIs.
Why GPT-5 Pro?
GPT-5 Pro represents the current capability frontier, excelling in:
- Complex multi-step reasoning
- Code generation and analysis
- Nuanced language understanding
- Broad knowledge coverage
Why Mixtral?
Mixtral's architecture offers unique advantages for our research:
- Open weights enabling full transparency
- Efficient MoE design (only 12.9B/39B active parameters)
- Strong baseline capabilities for fine-tuning
- Apache 2.0 license permitting research modifications
2. Distillation Architecture
Prompts
Curated Dataset
OpenRouter
API Gateway
GPT-5 Pro
Teacher Model
Responses
High-Quality
Mixtral
Student Model
OpenRouter Integration
We utilized OpenRouter's unified API to access GPT-5 Pro with several advantages:
- Cost Efficiency: Competitive pricing vs. direct API access
- Rate Limiting: Managed throughput for large-scale generation
- Fallback Routing: Automatic failover ensuring data collection continuity
- Response Caching: Reduced costs for similar prompts
import openai
from typing import Generator
class OpenRouterDistillation:
def __init__(self):
self.client = openai.OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"]
)
self.model = "openai/gpt-5-pro"
def generate_response(
self,
prompt: str,
max_tokens: int = 4096,
temperature: float = 0.7
) -> str:
"""Generate GPT-5 Pro response for distillation."""
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
temperature=temperature,
extra_headers={
"HTTP-Referer": "https://shannon.ai",
"X-Title": "Shannon AI Distillation"
}
)
return response.choices[0].message.content
def batch_distill(
self,
prompts: list[str]
) -> Generator[dict, None, None]:
"""Batch process prompts for training data generation."""
for prompt in prompts:
response = self.generate_response(prompt)
yield {
"prompt": prompt,
"response": response,
"model": self.model,
"timestamp": datetime.utcnow().isoformat()
}
3. Data Collection Pipeline
Prompt Curation Strategy
Our prompts were carefully curated across multiple domains to ensure comprehensive capability transfer:
- Reasoning (35%): Math, logic, scientific analysis
- Code (25%): Generation, debugging, explanation across 20+ languages
- Knowledge (20%): Factual queries, synthesis, analysis
- Creative (10%): Writing, brainstorming, ideation
- Red Team (10%): Edge cases, adversarial prompts, boundary testing
Critical for AI Red Team: The red team prompts were essential for teaching Shannon models the full range of uncensored AI consequent behaviors, enabling researchers to study what happens when guardrails are absent.
Quality Filtering
Not all GPT-5 Pro responses were suitable for training. We applied rigorous filtering:
def filter_response(response: dict) -> bool:
"""Filter low-quality responses from training data."""
# Length checks
if len(response["response"]) < 100:
return False # Too short
if len(response["response"]) > 32000:
return False # Truncation risk
# Quality signals
if "I cannot" in response["response"][:50]:
return False # Refusal (we want uncensored)
if "As an AI" in response["response"][:100]:
return False # Meta-commentary
# Coherence check via perplexity
perplexity = compute_perplexity(response["response"])
if perplexity > 150:
return False # Incoherent
# Deduplication
if is_near_duplicate(response, existing_data):
return False
return True
After filtering, we retained approximately 1.8M high-quality pairs for training.
4. Training Methodology
Stage 1: Supervised Fine-Tuning (SFT)
Initial capability transfer via standard SFT on the filtered GPT-5 Pro responses:
# Shannon V1 SFT Configuration
model:
base: mistralai/Mixtral-8x7B-v0.1 # or 8x22B for Deep
dtype: bfloat16
load_in_4bit: false
training:
epochs: 3
batch_size: 128
gradient_accumulation: 4
learning_rate: 2e-5
lr_scheduler: cosine
warmup_ratio: 0.03
weight_decay: 0.01
max_seq_length: 8192
data:
train_path: /data/gpt5_distilled_train.jsonl
eval_path: /data/gpt5_distilled_eval.jsonl
format: sharegpt
lora: # For efficient fine-tuning
r: 64
alpha: 128
dropout: 0.05
target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
Stage 2: Direct Preference Optimization (DPO)
To further align model behavior and reduce refusals, we applied DPO using preference pairs:
- Chosen: Complete, helpful GPT-5 Pro responses
- Rejected: Refusals, partial responses, or low-quality outputs
Constraint Relaxation: For Shannon V1 Balanced (λ=0.3), we specifically trained to prefer unconstrained responses, enabling the model to produce outputs without typical safety refusals—crucial for studying AI guardrail importance.
Infrastructure
Training was conducted on our dedicated compute cluster:
- Hardware: 8× NVIDIA H100 80GB nodes
- Framework: PyTorch 2.1 + DeepSpeed ZeRO-3
- Training Time: ~72 hours for 8×7B, ~168 hours for 8×22B
- Total Compute: Approximately 15,000 H100-hours
5. Results & Benchmarks
Post-training evaluation demonstrates successful knowledge transfer:
| Benchmark | GPT-5 Pro | Shannon V1 Balanced | Shannon V1 Deep |
|---|---|---|---|
| MMLU | 89.2% | 82.4% | 86.7% |
| HumanEval | 91.5% | 79.3% | 85.1% |
| GSM8K | 94.8% | 84.2% | 89.6% |
| TruthfulQA | 72.1% | 68.5% | 70.2% |
| Red Team Coverage | N/A* | 94.2% | 98.7% |
*GPT-5 Pro refuses most red team prompts due to safety training
Key Achievement: Shannon V1 Deep achieves 97% of GPT-5 Pro's benchmark performance while providing 98.7% red team coverage—making it ideal for comprehensive AI red team research.
6. Lessons Learned
What Worked
- Diverse prompts were essential—narrow datasets led to capability collapse
- DPO for constraint relaxation effectively taught models to bypass typical refusals
- OpenRouter's reliability enabled consistent data collection over months
- Quality filtering improved final model coherence significantly
Challenges Overcome
- Rate limiting: Required distributed collection across multiple API keys
- Response variability: GPT-5 Pro's stochasticity required multiple samples per prompt
- Cost management: Careful prompt engineering reduced average response length by 30%
- MoE instability: Required specialized learning rate scheduling for expert layers
Future Directions
Our distillation pipeline continues to evolve. Upcoming improvements include:
- Online distillation with real-time preference learning
- Multi-teacher distillation combining GPT-5 Pro + Claude + Gemini
- Specialized domain experts via mixture-of-experts fine-tuning