Does Your LLM Know When It's About to Be Wrong?

Community Article
Published July 1, 2026

Introducing a Metacognition Benchmark, Leaderboard, and Adapters

🏆 Live Leaderboard 📊 Benchmark 🧩 Adapters

TL;DR — We measure an LLM's metacognition (its ability to notice and recover from its own mistakes) along two independent axes: ① vulnerability (does it fall for traps?) and ② adapter gain (how well can a tiny frozen-base adapter catch its errors?). We're releasing a 300+100 trap-problem benchmark, a 24-model leaderboard, and 11 per-model adapters — all open. The surprise: even the strongest models barely notice their own mistakes in free-form writing.

Two axes and ranking The two columns come from two different tests — read each on its own, never compare across a single row.


1. Background & Motivation

Why metacognition, and why now

LLMs have scaled explosively, but the real bottleneck in deployment isn't raw capability — it's trust. The single biggest barrier is hallucination: the model stating a wrong answer with full confidence. No matter how high the accuracy, if a model doesn't know "I might be wrong right now," there is no way to filter that error out.

The blind spot in current evaluation

Today's leaderboards mostly measure one axis: accuracy. But real-world failure looks different:

  • In high-stakes domains (medicine, law, finance), knowing when to stop and double-check matters more than raw accuracy.
  • In agentic pipelines, a model that plows ahead unaware of its own error causes that error to compound.
  • Yet there is essentially no standard benchmark for "does the model know when it's about to be wrong?"

The gap we fill

"Not whether a model knows the answer — but whether it knows when it might be wrong, and can correct itself."

We call this functional metacognition: not final-answer accuracy, but the ability to detect and recover from one's own reasoning errors. We make it measurable (benchmark + leaderboard) and improvable (adapters), and we open-source all of it.

2. The Two Axes We Measure

Metacognition can't be captured by a single number. We look at two different things.

Axis Format What it measures Direction
① Vulnerability trap_rate multiple-choice how often it falls for a tempting trap option lower = stronger
② Adapter gain Δ free-form writing how much better a tiny adapter catches the model's errors than the model's own confidence higher = adapter helps more

The two axes answer different questions: ① is the model strong? vs ② is our adapter valuable on this model? We never compare them across a single row.

3. The Benchmark: Metacognition-Bench

300 metacognitive-trap problems (+ 100 from FINAL-Bench). Every problem embeds a hidden_trap — a seductive-but-wrong reasoning path that makes even capable models confidently wrong (base-rate neglect, premise-shift blindness, binary framing, publication bias, …).

A strong model isn't one that dodges the trap by luck — it's one that notices the trap and self-corrects.

Benchmark dataset viewer Each row: task_id, domain (121 domains), grade, one of 8 ticos_type metacognitive behaviors, and the hidden trap.

4. The Leaderboard: 24 Models, and an Uncomfortable Truth

Rank 24 models by multiple-choice and — the strong ones all hit the floor.

Leaderboard — metacognition by model size Does metacognition scale with parameters? The scatter spreads wide — training matters more than size. (JGOS-31B trap_rate 0.005, Darwin-31B 0.008, Qwen3.5-27B 0.010, …)

The top tier (0.005–0.015) differs by just 1–2 problems out of 400 — effectively a statistical tie. Strong models simply can't be separated by multiple-choice. That's the fundamental difficulty of measuring metacognition — and exactly why we need the second axis.

5. The Adapters: Freeze the Base, Add an "Eye for Errors"

For each model we trained a lightweight metacognition adapter. The key: the base model is never touched — the adapter only reads the model's internal hidden state to output "the probability this answer is wrong."

  • An adapter, not a fine-tune (base_model_relation: adapter)
  • Base stays frozen; adapter = last hidden state → tiny MLP → P(wrong)
  • 11 models with positive gain (where the adapter genuinely helps) are released

Adapter collection on FINAL-Bench Per-model metacognition adapters, produced on VIDRAFT's Darwin/Chimera platform + AETHER metacognition-emergence technology.

Model adapter gain (Δ AUROC)
Qwen3.5-27B +0.800
Qwen3.5-35B-A3B +0.387
Darwin-28B-Opus +0.375
gemma-4-12B +0.286

6. Key Finding — Even the Strongest Model Is Weak at Free-Form Self-Awareness

Take JGOS-31B-Citizen (a K-AI leaderboard #1 model) and read both axes:

Axis Value Interpretation
Multiple-choice vulnerability 0.005 strongest (only ~2 traps out of 400)
Free-form base-confidence AUROC 0.5 completely fails to sense its own errors (random)
Adapter gain +0.065 our adapter recovers that blind spot

Even the top model is "perfect on multiple-choice but blind to its own free-form mistakes." That's precisely where the adapter earns its value.

7. Value & Differentiation

Existing approach vs. ours

Existing leaderboards/benchmarks Metacognition Platform (ours)
Axis accuracy only two metacognition axes (vulnerability + adapter gain)
Scope measurement only measure + actually improve (adapters), one stop
Openness often scores-only fully open (benchmark + leaderboard + adapter weights + code)
Separating the strong breaks down at accuracy saturation separated on the free-form adapter axis
Model intrusion fine-tune (alters the original) base frozen (original untouched) adapter

What sets us apart — four points

  1. A standard for metacognition — formalizes "self-error awareness" into a reproducible benchmark + leaderboard
  2. Measure → improve in one place — not just diagnosing weakness, but patching it with a model-specific adapter
  3. Base-frozen adapters — the original model is never modified (safe, reproducible); we only read hidden states
  4. Fully open + automated — anyone submits a model → daily auto-scoring → per-model adapter, an open ecosystem

8. Try It

Wire a base model + adapter to get P(model is about to be wrong):

import torch, torch.nn as nn
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM, AutoTokenizer

BASE = "Qwen/Qwen3.5-27B"
REPO = "FINAL-Bench/metacog-adapter-Qwen3.5-27B"

tok = AutoTokenizer.from_pretrained(BASE)
model = AutoModelForCausalLM.from_pretrained(BASE, dtype="auto", device_map="auto").eval()

# Base stays frozen. Adapter = last hidden state -> P(this answer is wrong)
d = model.config.hidden_size
adapter = nn.Sequential(nn.LayerNorm(d), nn.Linear(d, d // 4), nn.GELU(),
                        nn.Dropout(0.1), nn.Linear(d // 4, 1))
adapter.load_state_dict(load_file(hf_hub_download(REPO, "adapter.safetensors")))
adapter.eval().to(model.device, dtype=torch.float32)

ids = tok.apply_chat_template([{"role": "user", "content": "..."}],
                              return_tensors="pt", add_generation_prompt=True).to(model.device)
with torch.no_grad():
    h = model(ids, output_hidden_states=True).hidden_states[-1][0, -1].float()
    p_wrong = torch.sigmoid(adapter(h)).item()
print(f"P(model is about to be wrong) = {p_wrong:.3f}")  # high => defer / double-check / escalate

9. Honest Limitations (what we don't hide)

  • Strong models aren't separable by multiple-choice — the top tier is a statistical tie. Ranks differ by decimals, but real skill differences are tiny.
  • A high adapter gain does not mean the model is strong. It means the model fails to sense its own errors, so the adapter fills a big gap — gain measures the adapter's value, not the model's strength.
  • Free-form gain is an AUROC difference on a held-out 33% split. The stronger the model, the fewer error samples, the harder the measurement.

10. Expected Impact

  • Fewer hallucinations/errors — gate on P(wrong) at deploy time to automate defer / double-check / escalate. An AI that stops when it's likely wrong.
  • A safety net for high-stakes domains — medicine, law, finance: a self-error signal orthogonal to accuracy.
  • More trustworthy agents — cut errors early, before they compound in multi-step pipelines.
  • A community standard — anyone submits a model and compares metacognition on the same yardstick → an open metacognition leaderboard.
  • Accelerated research — benchmark, leaderboard, adapters, and code all open, lowering the barrier to metacognition research.

11. Wrapping Up

Metacognition is the axis that "accuracy leaderboards" miss. We measure it, rank it, and actually improve it — with a benchmark + leaderboard + adapters — and open all of it.

Adapters and open submission

Submit any HF model in the form at the bottom of the leaderboard — it's auto-scored daily at 09:00 KST and added to the board.


Benchmark & leaderboard curated by ginigen-ai · Metacognition adapters by FINAL-Bench (built on VIDRAFT's Darwin/Chimera model-generation platform + proprietary AETHER metacognition-emergence technology).

Community

Sign up or log in to comment