Does Your LLM Know When It's About to Be Wrong?
Introducing a Metacognition Benchmark, Leaderboard, and Adapters
TL;DR — We measure an LLM's metacognition (its ability to notice and recover from its own mistakes) along two independent axes: ① vulnerability (does it fall for traps?) and ② adapter gain (how well can a tiny frozen-base adapter catch its errors?). We're releasing a 300+100 trap-problem benchmark, a 24-model leaderboard, and 11 per-model adapters — all open. The surprise: even the strongest models barely notice their own mistakes in free-form writing.
The two columns come from two different tests — read each on its own, never compare across a single row.
1. Background & Motivation
Why metacognition, and why now
LLMs have scaled explosively, but the real bottleneck in deployment isn't raw capability — it's trust. The single biggest barrier is hallucination: the model stating a wrong answer with full confidence. No matter how high the accuracy, if a model doesn't know "I might be wrong right now," there is no way to filter that error out.
The blind spot in current evaluation
Today's leaderboards mostly measure one axis: accuracy. But real-world failure looks different:
- In high-stakes domains (medicine, law, finance), knowing when to stop and double-check matters more than raw accuracy.
- In agentic pipelines, a model that plows ahead unaware of its own error causes that error to compound.
- Yet there is essentially no standard benchmark for "does the model know when it's about to be wrong?"
The gap we fill
"Not whether a model knows the answer — but whether it knows when it might be wrong, and can correct itself."
We call this functional metacognition: not final-answer accuracy, but the ability to detect and recover from one's own reasoning errors. We make it measurable (benchmark + leaderboard) and improvable (adapters), and we open-source all of it.
2. The Two Axes We Measure
Metacognition can't be captured by a single number. We look at two different things.
| Axis | Format | What it measures | Direction |
|---|---|---|---|
① Vulnerability trap_rate |
multiple-choice | how often it falls for a tempting trap option | lower = stronger |
② Adapter gain Δ |
free-form writing | how much better a tiny adapter catches the model's errors than the model's own confidence | higher = adapter helps more |
The two axes answer different questions: ① is the model strong? vs ② is our adapter valuable on this model? We never compare them across a single row.
3. The Benchmark: Metacognition-Bench
300 metacognitive-trap problems (+ 100 from FINAL-Bench). Every problem embeds a hidden_trap — a seductive-but-wrong reasoning path that makes even capable models confidently wrong (base-rate neglect, premise-shift blindness, binary framing, publication bias, …).
A strong model isn't one that dodges the trap by luck — it's one that notices the trap and self-corrects.
Each row: task_id, domain (121 domains), grade, one of 8 ticos_type metacognitive behaviors, and the hidden trap.
4. The Leaderboard: 24 Models, and an Uncomfortable Truth
Rank 24 models by multiple-choice and — the strong ones all hit the floor.
Does metacognition scale with parameters? The scatter spreads wide — training matters more than size. (JGOS-31B trap_rate 0.005, Darwin-31B 0.008, Qwen3.5-27B 0.010, …)
The top tier (0.005–0.015) differs by just 1–2 problems out of 400 — effectively a statistical tie. Strong models simply can't be separated by multiple-choice. That's the fundamental difficulty of measuring metacognition — and exactly why we need the second axis.
5. The Adapters: Freeze the Base, Add an "Eye for Errors"
For each model we trained a lightweight metacognition adapter. The key: the base model is never touched — the adapter only reads the model's internal hidden state to output "the probability this answer is wrong."
- An adapter, not a fine-tune (
base_model_relation: adapter) - Base stays frozen; adapter = last hidden state → tiny MLP →
P(wrong) - 11 models with positive gain (where the adapter genuinely helps) are released
Per-model metacognition adapters, produced on VIDRAFT's Darwin/Chimera platform + AETHER metacognition-emergence technology.
| Model | adapter gain (Δ AUROC) |
|---|---|
| Qwen3.5-27B | +0.800 |
| Qwen3.5-35B-A3B | +0.387 |
| Darwin-28B-Opus | +0.375 |
| gemma-4-12B | +0.286 |
| … | … |
6. Key Finding — Even the Strongest Model Is Weak at Free-Form Self-Awareness
Take JGOS-31B-Citizen (a K-AI leaderboard #1 model) and read both axes:
| Axis | Value | Interpretation |
|---|---|---|
| Multiple-choice vulnerability | 0.005 | strongest (only ~2 traps out of 400) |
| Free-form base-confidence AUROC | 0.5 | completely fails to sense its own errors (random) |
| Adapter gain | +0.065 | our adapter recovers that blind spot |
Even the top model is "perfect on multiple-choice but blind to its own free-form mistakes." That's precisely where the adapter earns its value.
7. Value & Differentiation
Existing approach vs. ours
| Existing leaderboards/benchmarks | Metacognition Platform (ours) | |
|---|---|---|
| Axis | accuracy only | two metacognition axes (vulnerability + adapter gain) |
| Scope | measurement only | measure + actually improve (adapters), one stop |
| Openness | often scores-only | fully open (benchmark + leaderboard + adapter weights + code) |
| Separating the strong | breaks down at accuracy saturation | separated on the free-form adapter axis |
| Model intrusion | fine-tune (alters the original) | base frozen (original untouched) adapter |
What sets us apart — four points
- A standard for metacognition — formalizes "self-error awareness" into a reproducible benchmark + leaderboard
- Measure → improve in one place — not just diagnosing weakness, but patching it with a model-specific adapter
- Base-frozen adapters — the original model is never modified (safe, reproducible); we only read hidden states
- Fully open + automated — anyone submits a model → daily auto-scoring → per-model adapter, an open ecosystem
8. Try It
Wire a base model + adapter to get P(model is about to be wrong):
import torch, torch.nn as nn
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM, AutoTokenizer
BASE = "Qwen/Qwen3.5-27B"
REPO = "FINAL-Bench/metacog-adapter-Qwen3.5-27B"
tok = AutoTokenizer.from_pretrained(BASE)
model = AutoModelForCausalLM.from_pretrained(BASE, dtype="auto", device_map="auto").eval()
# Base stays frozen. Adapter = last hidden state -> P(this answer is wrong)
d = model.config.hidden_size
adapter = nn.Sequential(nn.LayerNorm(d), nn.Linear(d, d // 4), nn.GELU(),
nn.Dropout(0.1), nn.Linear(d // 4, 1))
adapter.load_state_dict(load_file(hf_hub_download(REPO, "adapter.safetensors")))
adapter.eval().to(model.device, dtype=torch.float32)
ids = tok.apply_chat_template([{"role": "user", "content": "..."}],
return_tensors="pt", add_generation_prompt=True).to(model.device)
with torch.no_grad():
h = model(ids, output_hidden_states=True).hidden_states[-1][0, -1].float()
p_wrong = torch.sigmoid(adapter(h)).item()
print(f"P(model is about to be wrong) = {p_wrong:.3f}") # high => defer / double-check / escalate
9. Honest Limitations (what we don't hide)
- Strong models aren't separable by multiple-choice — the top tier is a statistical tie. Ranks differ by decimals, but real skill differences are tiny.
- A high adapter gain does not mean the model is strong. It means the model fails to sense its own errors, so the adapter fills a big gap — gain measures the adapter's value, not the model's strength.
- Free-form gain is an AUROC difference on a held-out 33% split. The stronger the model, the fewer error samples, the harder the measurement.
10. Expected Impact
- Fewer hallucinations/errors — gate on
P(wrong)at deploy time to automate defer / double-check / escalate. An AI that stops when it's likely wrong. - A safety net for high-stakes domains — medicine, law, finance: a self-error signal orthogonal to accuracy.
- More trustworthy agents — cut errors early, before they compound in multi-step pipelines.
- A community standard — anyone submits a model and compares metacognition on the same yardstick → an open metacognition leaderboard.
- Accelerated research — benchmark, leaderboard, adapters, and code all open, lowering the barrier to metacognition research.
11. Wrapping Up
Metacognition is the axis that "accuracy leaderboards" miss. We measure it, rank it, and actually improve it — with a benchmark + leaderboard + adapters — and open all of it.
- 🏆 Leaderboard: ginigen-ai/Metacognition-Leaderboard-Space
- 📊 Benchmark: ginigen-ai/Metacognition-Bench
- 🧩 Adapter collection: FINAL-Bench Metacognition Adapters
Submit any HF model in the form at the bottom of the leaderboard — it's auto-scored daily at 09:00 KST and added to the board.
Benchmark & leaderboard curated by ginigen-ai · Metacognition adapters by FINAL-Bench (built on VIDRAFT's Darwin/Chimera model-generation platform + proprietary AETHER metacognition-emergence technology).
