Does Your LLM Know When It's About to Be Wrong?

Published July 1, 2026

Introducing a Metacognition Benchmark, Leaderboard, and Adapters
1. Background & Motivation
Why metacognition, and why now
The blind spot in current evaluation
The gap we fill
2. The Two Axes We Measure
3. The Benchmark: `Metacognition-Bench`
4. The Leaderboard: 24 Models, and an Uncomfortable Truth
5. The Adapters: Freeze the Base, Add an "Eye for Errors"
6. Key Finding — Even the Strongest Model Is Weak at Free-Form Self-Awareness
7. Value & Differentiation
Existing approach vs. ours
What sets us apart — four points
8. Try It
9. Honest Limitations (what we don't hide)
10. Expected Impact
11. Wrapping Up
Introducing a Metacognition Benchmark, Leaderboard, and Adapters

TL;DR — We measure an LLM's metacognition (its ability to notice and recover from its own mistakes) along two independent axes: ① vulnerability (does it fall for traps?) and ② adapter gain (how well can a tiny frozen-base adapter catch its errors?). We're releasing a 300+100 trap-problem benchmark, a 24-model leaderboard, and 11 per-model adapters — all open. The surprise: even the strongest models barely notice their own mistakes in free-form writing.

_{The two columns come from two different tests — read each on its own, never compare across a single row.}

1. Background & Motivation

Why metacognition, and why now

LLMs have scaled explosively, but the real bottleneck in deployment isn't raw capability — it's trust. The single biggest barrier is hallucination: the model stating a wrong answer with full confidence. No matter how high the accuracy, if a model doesn't know "I might be wrong right now," there is no way to filter that error out.

The blind spot in current evaluation

Today's leaderboards mostly measure one axis: accuracy. But real-world failure looks different:

In high-stakes domains (medicine, law, finance), knowing when to stop and double-check matters more than raw accuracy.
In agentic pipelines, a model that plows ahead unaware of its own error causes that error to compound.
Yet there is essentially no standard benchmark for "does the model know when it's about to be wrong?"

The gap we fill

"Not whether a model knows the answer — but whether it knows when it might be wrong, and can correct itself."

We call this functional metacognition: not final-answer accuracy, but the ability to detect and recover from one's own reasoning errors. We make it measurable (benchmark + leaderboard) and improvable (adapters), and we open-source all of it.

2. The Two Axes We Measure

Metacognition can't be captured by a single number. We look at two different things.

Axis	Format	What it measures	Direction
① Vulnerability `trap_rate`	multiple-choice	how often it falls for a tempting trap option	lower = stronger
② Adapter gain `Δ`	free-form writing	how much better a tiny adapter catches the model's errors than the model's own confidence	higher = adapter helps more

The two axes answer different questions: ① is the model strong? vs ② is our adapter valuable on this model? We never compare them across a single row.

3. The Benchmark: `Metacognition-Bench`

300 metacognitive-trap problems (+ 100 from FINAL-Bench). Every problem embeds a hidden_trap — a seductive-but-wrong reasoning path that makes even capable models confidently wrong (base-rate neglect, premise-shift blindness, binary framing, publication bias, …).

A strong model isn't one that dodges the trap by luck — it's one that notices the trap and self-corrects.

_{Each row: task_id, domain (121 domains), grade, one of 8 ticos_type metacognitive behaviors, and the hidden trap.}

4. The Leaderboard: 24 Models, and an Uncomfortable Truth

Rank 24 models by multiple-choice and — the strong ones all hit the floor.

_{Does metacognition scale with parameters? The scatter spreads wide — training matters more than size. (JGOS-31B trap_rate 0.005, Darwin-31B 0.008, Qwen3.5-27B 0.010, …)}

The top tier (0.005–0.015) differs by just 1–2 problems out of 400 — effectively a statistical tie. Strong models simply can't be separated by multiple-choice. That's the fundamental difficulty of measuring metacognition — and exactly why we need the second axis.

5. The Adapters: Freeze the Base, Add an "Eye for Errors"

For each model we trained a lightweight metacognition adapter. The key: the base model is never touched — the adapter only reads the model's internal hidden state to output "the probability this answer is wrong."

An adapter, not a fine-tune (base_model_relation: adapter)
Base stays frozen; adapter = last hidden state → tiny MLP → P(wrong)
11 models with positive gain (where the adapter genuinely helps) are released

_{Per-model metacognition adapters, produced on VIDRAFT's Darwin/Chimera platform + AETHER metacognition-emergence technology.}

Model	adapter gain (Δ AUROC)
Qwen3.5-27B	+0.800
Qwen3.5-35B-A3B	+0.387
Darwin-28B-Opus	+0.375
gemma-4-12B	+0.286
…	…

6. Key Finding — Even the Strongest Model Is Weak at Free-Form Self-Awareness

Take JGOS-31B-Citizen (a K-AI leaderboard #1 model) and read both axes:

Axis	Value	Interpretation
Multiple-choice vulnerability	0.005	strongest (only ~2 traps out of 400)
Free-form base-confidence AUROC	0.5	completely fails to sense its own errors (random)
Adapter gain	+0.065	our adapter recovers that blind spot

Even the top model is "perfect on multiple-choice but blind to its own free-form mistakes." That's precisely where the adapter earns its value.

7. Value & Differentiation

Existing approach vs. ours

	Existing leaderboards/benchmarks	Metacognition Platform (ours)
Axis	accuracy only	two metacognition axes (vulnerability + adapter gain)
Scope	measurement only	measure + actually improve (adapters), one stop
Openness	often scores-only	fully open (benchmark + leaderboard + adapter weights + code)
Separating the strong	breaks down at accuracy saturation	separated on the free-form adapter axis
Model intrusion	fine-tune (alters the original)	base frozen (original untouched) adapter

What sets us apart — four points

A standard for metacognition — formalizes "self-error awareness" into a reproducible benchmark + leaderboard
Measure → improve in one place — not just diagnosing weakness, but patching it with a model-specific adapter
Base-frozen adapters — the original model is never modified (safe, reproducible); we only read hidden states
Fully open + automated — anyone submits a model → daily auto-scoring → per-model adapter, an open ecosystem

8. Try It

Wire a base model + adapter to get P(model is about to be wrong):

import torch, torch.nn as nn
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM, AutoTokenizer

BASE = "Qwen/Qwen3.5-27B"
REPO = "FINAL-Bench/metacog-adapter-Qwen3.5-27B"

tok = AutoTokenizer.from_pretrained(BASE)
model = AutoModelForCausalLM.from_pretrained(BASE, dtype="auto", device_map="auto").eval()

# Base stays frozen. Adapter = last hidden state -> P(this answer is wrong)
d = model.config.hidden_size
adapter = nn.Sequential(nn.LayerNorm(d), nn.Linear(d, d // 4), nn.GELU(),
                        nn.Dropout(0.1), nn.Linear(d // 4, 1))
adapter.load_state_dict(load_file(hf_hub_download(REPO, "adapter.safetensors")))
adapter.eval().to(model.device, dtype=torch.float32)

ids = tok.apply_chat_template([{"role": "user", "content": "..."}],
                              return_tensors="pt", add_generation_prompt=True).to(model.device)
with torch.no_grad():
    h = model(ids, output_hidden_states=True).hidden_states[-1][0, -1].float()
    p_wrong = torch.sigmoid(adapter(h)).item()
print(f"P(model is about to be wrong) = {p_wrong:.3f}")  # high => defer / double-check / escalate

9. Honest Limitations (what we don't hide)

Strong models aren't separable by multiple-choice — the top tier is a statistical tie. Ranks differ by decimals, but real skill differences are tiny.
A high adapter gain does not mean the model is strong. It means the model fails to sense its own errors, so the adapter fills a big gap — gain measures the adapter's value, not the model's strength.
Free-form gain is an AUROC difference on a held-out 33% split. The stronger the model, the fewer error samples, the harder the measurement.

10. Expected Impact

Fewer hallucinations/errors — gate on P(wrong) at deploy time to automate defer / double-check / escalate. An AI that stops when it's likely wrong.
A safety net for high-stakes domains — medicine, law, finance: a self-error signal orthogonal to accuracy.
More trustworthy agents — cut errors early, before they compound in multi-step pipelines.
A community standard — anyone submits a model and compares metacognition on the same yardstick → an open metacognition leaderboard.
Accelerated research — benchmark, leaderboard, adapters, and code all open, lowering the barrier to metacognition research.

11. Wrapping Up

Metacognition is the axis that "accuracy leaderboards" miss. We measure it, rank it, and actually improve it — with a benchmark + leaderboard + adapters — and open all of it.

🏆 Leaderboard: ginigen-ai/Metacognition-Leaderboard-Space
📊 Benchmark: ginigen-ai/Metacognition-Bench
🧩 Adapter collection: FINAL-Bench Metacognition Adapters

Submit any HF model in the form at the bottom of the leaderboard — it's auto-scored daily at 09:00 KST and added to the board.

Benchmark & leaderboard curated by ginigen-ai · Metacognition adapters by FINAL-Bench (built on VIDRAFT's Darwin/Chimera model-generation platform + proprietary AETHER metacognition-emergence technology).

Datasets mentioned in this article 1

Spaces mentioned in this article 1

Collections mentioned in this article 1

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Does Your LLM Know When It's About to Be Wrong?

1. Background & Motivation

Why metacognition, and why now

The blind spot in current evaluation

The gap we fill

2. The Two Axes We Measure

3. The Benchmark: `Metacognition-Bench`

4. The Leaderboard: 24 Models, and an Uncomfortable Truth

5. The Adapters: Freeze the Base, Add an "Eye for Errors"

6. Key Finding — Even the Strongest Model Is Weak at Free-Form Self-Awareness

7. Value & Differentiation

Existing approach vs. ours

What sets us apart — four points

8. Try It

9. Honest Limitations (what we don't hide)

10. Expected Impact

11. Wrapping Up

Datasets mentioned in this article 1

Spaces mentioned in this article 1

Metacognition Leaderboard

Collections mentioned in this article 1

Community

Datasets mentioned in this article 1

Spaces mentioned in this article 1

Metacognition Leaderboard

Collections mentioned in this article 1

Does Your LLM Know When It's About to Be Wrong?

1. Background & Motivation

Why metacognition, and why now

The blind spot in current evaluation

The gap we fill

2. The Two Axes We Measure

3. The Benchmark: Metacognition-Bench

4. The Leaderboard: 24 Models, and an Uncomfortable Truth

5. The Adapters: Freeze the Base, Add an "Eye for Errors"

6. Key Finding — Even the Strongest Model Is Weak at Free-Form Self-Awareness

7. Value & Differentiation

Existing approach vs. ours

What sets us apart — four points

8. Try It

9. Honest Limitations (what we don't hide)

10. Expected Impact

11. Wrapping Up

Datasets mentioned in this article 1

Spaces mentioned in this article 1

Collections mentioned in this article 1

Community

Datasets mentioned in this article 1

Spaces mentioned in this article 1

Collections mentioned in this article 1

3. The Benchmark: `Metacognition-Bench`