- GLM-5.2-NVFP4-REAP-504B-term (termination-recalibrated variant)
- Evaluation
- What's different from the code-calibrated
-504B - Reasoning effort — only
highandmax - What it's calibrated on (and what that means)
- Honest limitations / which sibling to pick — READ THIS
- Sampling — important
- Method (no re-quantization)
- Serving (vLLM, 4× RTX PRO 6000) —
docker-compose.ymlincluded - Credits
- Evaluation
GLM-5.2-NVFP4-REAP-504B-term (termination-recalibrated variant)
A REAP (Router-weighted Expert Activation Pruning) of GLM-5.2 in NVFP4, pruned 256 → 168 routed experts/layer, built to run on 4× 96 GB Blackwell GPUs (RTX PRO 6000, SM120, TP4) with vLLM (b12x).
Credit where it's due: this whole line of work exists because of 0xSero/GLM-5.2-NVFP4-REAP-469B — the first NVFP4 REAP of GLM-5.2. We started from that idea and re-derived the prune. And the base, lukealonso/GLM-5.2-NVFP4, is an excellent, tightly-packed NVFP4 quant — our experts are sliced from it byte-for-byte (no re-quantization).
This is the -term sibling of GLM-5.2-NVFP4-REAP-504B (code-calibrated). Same base, same byte-exact slicing pipeline — the one difference is the calibration set: in addition to code/agentic data, saliency here was also measured on the full model's own complete terminating reasoning traces (self-distilled), with extra weight on the </think> transition region. The aim: keep the experts that carry a long reasoning chain smoothly to its conclusion, for cleaner extended (high/max) reasoning.
| model | experts/layer | params (nominal) | size on disk |
|---|---|---|---|
| Full GLM-5.2 (luke NVFP4) | 256 | ~753B | 467.1 GB (435 GiB) |
This model (-term) |
168 | ~504B | 308.9 GB (288 GiB) |
sibling -504B (code-calib) |
168 | ~504B | 308.9 GB |
| 0xSero REAP-469B | 156 | ~469B | 307.8 GB |
(Note: on-disk GB doesn't track the nominal "B" param count — or even expert count — 1:1 once quant format and layout differ. This 168-expert model lands about even with 0xSero's 156-expert one.)
Evaluation
Measured under NVIDIA's evaluation protocol: temperature 1.0, top_p 0.95; GPQA Diamond max_new_tokens=100000, others 64000 (SciCode via the official inspect_ai scorer, with-background). Full-model rows are NVIDIA's published figures for the unpruned GLM-5.2; the REAP rows are measured with reap-bench. Intelligence lost = relative drop vs full NVFP4 (same quant → isolates the prune).
| Model | GPQA Diamond | SciCode | IFBench | τ²-Bench Telecom |
|---|---|---|---|---|
| GLM-5.2 FP8 — full (NVIDIA ref) | 89.52 | 49.85 | 74.95 | 97.9 |
| GLM-5.2 NVFP4 — full (NVIDIA ref) | 89.39 | 49.04 | 75.81 | 98.25 |
| GLM-5.2-Int8Mix-NVFP4-REAP-594B · ~22% prune | 86.87 | 47.77 | — | — |
| ↳ intelligence lost vs full NVFP4 | −2.8% | −2.6% | — | — |
| GLM-5.2-NVFP4-REAP-504B-term (this model) · ~34% prune | — | 44.67 | — | — |
| ↳ intelligence lost vs full NVFP4 | — | −8.9% | — | — |
SciCode (with-background): 130/291 subproblems = 44.67%, 7/65 problems fully solved (10.77%), 65/65 samples, 0 errors. The deeper 256→168 prune costs −8.9% on SciCode vs full NVFP4 — about 3× the less-pruned REAP-594B's −2.6% (it keeps 200 experts and scores 47.77%). GPQA / IFBench / τ²-Bench pending.
What's different from the code-calibrated -504B
- Calibration = code/agentic + the full model's terminating reasoning traces (~30 self-distilled
prompt → <think>…</think> → answertraces +</think>-region snippets weighted ×6). vs-504B's pure code calibration, this shifts the keep-set toward reasoning-flow experts. - Observed: reasons continuously and coherently (no stalling/premature pausing mid-thought) and self-terminates at
high— and atmaxgiven a generousmax_tokensbudget — by thinking at length and then writing the answer (e.g. a single-file game) rather than looping.
Reasoning effort — only high and max
GLM-5.2 exposes two reasoning levels: high and max. The chat template defaults to max and treats anything else as max; there is no low/medium/minimal. Pass reasoning_effort: "high" (e.g. via chat_template_kwargs) for shorter, faster thinking; leave it default for max. This is a heavy thinker — at max it can reason for tens of thousands of tokens before answering, so give it a generous max_tokens (≈80–120k) or it will hit the cap mid-thought.
What it's calibrated on (and what that means)
Calibrated narrow, on purpose: code + tool-calling/agentic data (evol-codealpaca, Magicoder, xLAM function-calling, SWE-smith) plus the model's own terminating reasoning traces. We did not calibrate on broad/general, multilingual, or long-document data.
- Stronger at: coding, tool use, and ending its own reasoning (the termination traces are the whole point of this variant).
- Weaker at (expected): general knowledge, other languages, niche domains — those experts scored low on a code-heavy calibration, so they got dropped.
- Long context still works — that lives in the attention, which isn't pruned (an internal 177k-token task scores 30/30); we just didn't add long-document calibration.
Want a broad general-purpose model instead? Calibrate on a wider mix (general + multilingual + long-context).
Honest limitations / which sibling to pick — READ THIS
- Not A/B-validated against the code-calibrated
-504B. Whether the trace-recalibration actually helps your workload is unmeasured; for pure coding the-504Bmay be more decisive/less verbose. Benchmark both on your own tasks. - "Max never terminates" is largely a property of the
maxtier, not a pruning defect. The full GLM-5.2 (even fp8) also produces 60k+ tokens of reasoning with no answer on hard open-ended prompts at max. This model targets coherent self-termination, not brevity. - Plausible tradeoff: calibrating on long terminating traces can bias toward longer reasoning — this variant may think more, not less, than
-504B. - NVFP4 + prune are both lossy: generative/coding degrade little (REAP paper); knowledge/MCQA recall and non-English/niche domains are the weak axes.
Sampling — important
Use repetition_penalty ≤ 1.0 (1.0 = off). A penalty > 1.0 accumulates over long generations and spirals the (heavy) reasoning into synonym/token salad.
temperature 0.6, top_p 0.95, repetition_penalty 1.0
Method (no re-quantization)
Surviving experts are luke's NVFP4 weights bit-for-bit. Saliency S_j = mean_{x active}( g_j(x) · ‖f_j(x)‖₂ ) (router gate × raw-expert-output L2, norm taken before the gate) is accumulated over the calibration set via a custom collector that dequantizes each fired expert on the fly (modelopt NVFP4QTensor, block 16) — HF Transformers can't load modelopt fused-MoE NVFP4. All 256 experts in the 75 MoE layers (3–77) are scored, pure saliency, no frequency overlay (the REAP criterion, Cerebras arXiv:2510.13999; the paper warns frequency-protection heuristics lose coherence). Per layer keep top-168 by S_j, drop 88, renormalize the router; NVFP4 tensors copied verbatim, gate.weight/e_score_correction_bias sliced 256→168, experts renumbered 0..167, nextn/MTP layer kept (→ speculative decoding). config.json → n_routed_experts = num_experts = 168.
Serving (vLLM, 4× RTX PRO 6000) — docker-compose.yml included
MODEL_DIR=/path/to/GLM-5.2-NVFP4-REAP-504B-term docker compose up -d # OpenAI API on :5001, id "GLM-5.2"
The included compose defaults to the best config found: DCP4 + MTP5 + global-topk + DCP-sharded draft + use_index_cache, on the patched serving image madeby561/vllm:dark-devotion-…-mtpdcpfix — voipmonitor's dark-devotion-…-dcpglobaltopk base plus a one-file deepseek_mtp.py fix that lets VLLM_DCP_SHARD_DRAFT run the MTP draft DCP-parallel on GLM-5.2 (full write-up on the image page). Measured single-stream on 4× RTX PRO 6000 (PCIe, TP4 / DCP4 / MTP5):
- 70–94 tok/s codegen
- ~50 tok/s at 256k context — decode stays flat with depth (DSA sparse attention: 56 tok/s at 0 ctx → 49 at 256k)
- 256k cold prefill: no wedge; ~200 tok/s aggregate at concurrency 4
- 474k-token KV pool @ util 0.95,
MAX_MODEL_LEN=300000
VLLM_DCP_GLOBAL_TOPK=1 feeds the speculative draft the model's true global-topk attention target (faster and higher-fidelity than local-topk); VLLM_DCP_SHARD_DRAFT=1 shards that draft across the DCP ranks; use_index_cache caches the DSA top-2048 sparse indices across decode steps (DCP4 was comm-bound 40 tok/s on PCIe without it). For max short-context decode speed: 125k ctx).DCP_SIZE=1 MTP=1 MAX_MODEL_LEN=125000 (
Credits
- 0xSero — GLM-5.2-NVFP4-REAP-469B, the prior art that started this.
- lukealonso — GLM-5.2-NVFP4 base quant (experts sliced byte-for-byte).
- REAP — Cerebras, arXiv:2510.13999.
- Base model: GLM-5.2 by Z.ai.
- Downloads last month
- 1,837