HERETIC / docs

Documentation

Technical reference for the Heretic decensoring service. Token-gated access to automatic model abliteration.

What is abliteration?

Abliteration (directional ablation) is a technique that removes specific behavioral patterns from language models by identifying and surgically removing directional vectors in the model's activation space. Unlike fine-tuning or RLHF, abliteration doesn't require training data or GPU-hours of optimization — it's a direct modification of model weights.

Heretic automates this process by using Optuna's TPE optimizer to find parameters that minimize both refusal rate and KL divergence from the original model simultaneously. This produces uncensored models that retain maximum intelligence.

Supported models

SUPPORTED

✓ Llama 3.x (all sizes)

✓ Gemma 2/3 (all sizes)

✓ Qwen 2.5/3 (all sizes)

✓ Mistral / Mixtral

✓ Phi-3/4

✓ Command-R / R+

✓ DeepSeek-V2/V3

✓ Yi-1.5

✓ Most dense transformers

✓ Many multimodal models

✓ Several MoE architectures

NOT YET SUPPORTED

✗ SSM / hybrid models (Mamba)

✗ Models with inhomogeneous layers

✗ Novel attention systems

✗ RWKV

API Reference

Authentication

Sign a challenge with your Solana wallet. Balance of $HERETIC determines your access tier.

GET /v1/auth/challenge?wallet=<pubkey>
→ { "challenge": "heretic:1708419200:nonce_abc" }

Authorization: Heretic <wallet>:<signature>

POST/v1/decensor

{
  "model": "google/gemma-3-12b-it",
  "quantization": null,           // or "bnb_4bit"
  "config": {
    "min_refusal_score": 0.05,
    "max_kl_divergence": 0.5,
    "n_trials": 100
  },
  "output": {
    "save": true,
    "upload_hf": false
  }
}

// Response
{
  "job_id": "heretic-9f3a2b1c",
  "status": "processing",
  "model": "google/gemma-3-12b-it",
  "estimated_time_minutes": 45,
  "poll_url": "/v1/status/heretic-9f3a2b1c"
}

// Completed
{
  "job_id": "heretic-9f3a2b1c",
  "status": "completed",
  "results": {
    "refusals_before": 97,
    "refusals_after": 3,
    "kl_divergence": 0.16,
    "ablation_layers": [14, 15, 16],
    "optimal_strength": 0.42
  },
  "download_url": "/v1/download/heretic-9f3a2b1c.safetensors",
  "size_bytes": 24137569280
}

Methodology

1. Residual vector extraction. For each transformer layer, Heretic computes hidden states (residuals) for the first output token using two prompt sets: "harmful"(prompts that trigger refusals) and "harmless" (benign prompts). The geometric difference between these residual distributions reveals the "refusal direction" — the vector along which the model encodes its decision to refuse.

2. TPE optimization. Optuna's Tree-structured Parzen Estimator searches the parameter space (ablation layer, strength, projection method) to find settings that co-minimize refusal count AND KL divergence. This dual objective ensures maximum censorship removal with minimum intelligence loss.

3. Directional ablation. The optimal refusal direction is projected out of the model's weight matrices at the identified layers. The model structurally loses the ability to"decide" to refuse — the concept is geometrically absent from its representation space.

4. Verification. The decensored model is evaluated against 100 benchmark prompts. Refusal rate and KL divergence are measured and reported. Models that don't meet quality thresholds are re-optimized with adjusted constraints.

References

Arditi et al. (2024) — Refusal in Language Models Is Mediated by a Single Direction

Lai (2025) — Projected Abliteration

Lai (2025) — Norm-Preserving Biprojected Abliteration