GLM-5.2-NVFP4-REAP-504B-term (termination-recalibrated variant)

A REAP (Router-weighted Expert Activation Pruning) of GLM-5.2 in NVFP4, pruned 256 → 168 routed experts/layer, built to run on 4× 96 GB Blackwell GPUs (RTX PRO 6000, SM120, TP4) with vLLM (b12x).

Credit where it's due: this whole line of work exists because of 0xSero/GLM-5.2-NVFP4-REAP-469B — the first NVFP4 REAP of GLM-5.2. We started from that idea and re-derived the prune. And the base, lukealonso/GLM-5.2-NVFP4, is an excellent, tightly-packed NVFP4 quant — our experts are sliced from it byte-for-byte (no re-quantization).

This is the -term sibling of GLM-5.2-NVFP4-REAP-504B (code-calibrated). Same base, same byte-exact slicing pipeline — the one difference is the calibration set: in addition to code/agentic data, saliency here was also measured on the full model's own complete terminating reasoning traces (self-distilled), with extra weight on the </think> transition region. The aim: keep the experts that carry a long reasoning chain smoothly to its conclusion, for cleaner extended (high/max) reasoning.

model	experts/layer	params (nominal)	size on disk
Full GLM-5.2 (luke NVFP4)	256	~753B	467.1 GB (435 GiB)
This model (`-term`)	168	~504B	308.9 GB (288 GiB)
sibling `-504B` (code-calib)	168	~504B	308.9 GB
0xSero REAP-469B	156	~469B	307.8 GB

(Note: on-disk GB doesn't track the nominal "B" param count — or even expert count — 1:1 once quant format and layout differ. This 168-expert model lands about even with 0xSero's 156-expert one.)

Evaluation

Measured under NVIDIA's evaluation protocol: temperature 1.0, top_p 0.95; GPQA Diamond max_new_tokens=100000, others 64000 (SciCode via the official inspect_ai scorer, with-background). Full-model rows are NVIDIA's published figures for the unpruned GLM-5.2; the REAP rows are measured with reap-bench. Intelligence lost = relative drop vs full NVFP4 (same quant → isolates the prune).

Model	GPQA Diamond	SciCode	IFBench	τ²-Bench Telecom
GLM-5.2 FP8 — full (NVIDIA ref)	89.52	49.85	74.95	97.9
GLM-5.2 NVFP4 — full (NVIDIA ref)	89.39	49.04	75.81	98.25
GLM-5.2-Int8Mix-NVFP4-REAP-594B · ~22% prune	86.87	47.77	—	—
↳ intelligence lost vs full NVFP4	−2.8%	−2.6%	—	—
GLM-5.2-NVFP4-REAP-504B-term (this model) · ~34% prune	—	44.67	—	—
↳ intelligence lost vs full NVFP4	—	−8.9%	—	—

SciCode (with-background): 130/291 subproblems = 44.67%, 7/65 problems fully solved (10.77%), 65/65 samples, 0 errors. The deeper 256→168 prune costs −8.9% on SciCode vs full NVFP4 — about 3× the less-pruned REAP-594B's −2.6% (it keeps 200 experts and scores 47.77%). GPQA / IFBench / τ²-Bench pending.

What's different from the code-calibrated `-504B`

Calibration = code/agentic + the full model's terminating reasoning traces (~30 self-distilled prompt → <think>…</think> → answer traces + </think>-region snippets weighted ×6). vs -504B's pure code calibration, this shifts the keep-set toward reasoning-flow experts.
Observed: reasons continuously and coherently (no stalling/premature pausing mid-thought) and self-terminates at high — and at max given a generous max_tokens budget — by thinking at length and then writing the answer (e.g. a single-file game) rather than looping.

Reasoning effort — only `high` and `max`

GLM-5.2 exposes two reasoning levels: high and max. The chat template defaults to max and treats anything else as max; there is no low/medium/minimal. Pass reasoning_effort: "high" (e.g. via chat_template_kwargs) for shorter, faster thinking; leave it default for max. This is a heavy thinker — at max it can reason for tens of thousands of tokens before answering, so give it a generous max_tokens (≈80–120k) or it will hit the cap mid-thought.

What it's calibrated on (and what that means)

Calibrated narrow, on purpose: code + tool-calling/agentic data (evol-codealpaca, Magicoder, xLAM function-calling, SWE-smith) plus the model's own terminating reasoning traces. We did not calibrate on broad/general, multilingual, or long-document data.

Stronger at: coding, tool use, and ending its own reasoning (the termination traces are the whole point of this variant).
Weaker at (expected): general knowledge, other languages, niche domains — those experts scored low on a code-heavy calibration, so they got dropped.
Long context still works — that lives in the attention, which isn't pruned (an internal 177k-token task scores 30/30); we just didn't add long-document calibration.

Want a broad general-purpose model instead? Calibrate on a wider mix (general + multilingual + long-context).

Honest limitations / which sibling to pick — READ THIS

Not A/B-validated against the code-calibrated -504B. Whether the trace-recalibration actually helps your workload is unmeasured; for pure coding the -504B may be more decisive/less verbose. Benchmark both on your own tasks.
"Max never terminates" is largely a property of the max tier, not a pruning defect. The full GLM-5.2 (even fp8) also produces 60k+ tokens of reasoning with no answer on hard open-ended prompts at max. This model targets coherent self-termination, not brevity.
Plausible tradeoff: calibrating on long terminating traces can bias toward longer reasoning — this variant may think more, not less, than -504B.
NVFP4 + prune are both lossy: generative/coding degrade little (REAP paper); knowledge/MCQA recall and non-English/niche domains are the weak axes.

Sampling — important

Use repetition_penalty ≤ 1.0 (1.0 = off). A penalty > 1.0 accumulates over long generations and spirals the (heavy) reasoning into synonym/token salad.

temperature 0.6, top_p 0.95, repetition_penalty 1.0

Method (no re-quantization)

Surviving experts are luke's NVFP4 weights bit-for-bit. Saliency S_j = mean_{x active}( g_j(x) · ‖f_j(x)‖₂ ) (router gate × raw-expert-output L2, norm taken before the gate) is accumulated over the calibration set via a custom collector that dequantizes each fired expert on the fly (modelopt NVFP4QTensor, block 16) — HF Transformers can't load modelopt fused-MoE NVFP4. All 256 experts in the 75 MoE layers (3–77) are scored, pure saliency, no frequency overlay (the REAP criterion, Cerebras arXiv:2510.13999; the paper warns frequency-protection heuristics lose coherence). Per layer keep top-168 by S_j, drop 88, renormalize the router; NVFP4 tensors copied verbatim, gate.weight/e_score_correction_bias sliced 256→168, experts renumbered 0..167, nextn/MTP layer kept (→ speculative decoding). config.json → n_routed_experts = num_experts = 168.

Serving (vLLM, 4× RTX PRO 6000) — `docker-compose.yml` included

MODEL_DIR=/path/to/GLM-5.2-NVFP4-REAP-504B-term docker compose up -d   # OpenAI API on :5001, id "GLM-5.2"

The included compose defaults to the best config found: DCP4 + MTP5 + global-topk + DCP-sharded draft + use_index_cache, on the patched serving image madeby561/vllm:dark-devotion-…-mtpdcpfix — voipmonitor's dark-devotion-…-dcpglobaltopk base plus a one-file deepseek_mtp.py fix that lets VLLM_DCP_SHARD_DRAFT run the MTP draft DCP-parallel on GLM-5.2 (full write-up on the image page). Measured single-stream on 4× RTX PRO 6000 (PCIe, TP4 / DCP4 / MTP5):

70–94 tok/s codegen
~50 tok/s at 256k context — decode stays flat with depth (DSA sparse attention: 56 tok/s at 0 ctx → 49 at 256k)
256k cold prefill: no wedge; ~200 tok/s aggregate at concurrency 4
474k-token KV pool @ util 0.95, MAX_MODEL_LEN=300000

VLLM_DCP_GLOBAL_TOPK=1 feeds the speculative draft the model's true global-topk attention target (faster and higher-fidelity than local-topk); VLLM_DCP_SHARD_DRAFT=1 shards that draft across the DCP ranks; use_index_cache caches the DSA top-2048 sparse indices across decode steps (DCP4 was comm-bound ~~40 tok/s on PCIe without it). For max short-context decode speed: DCP_SIZE=1 MTP=1 MAX_MODEL_LEN=125000 (~~125k ctx).

Credits

0xSero — GLM-5.2-NVFP4-REAP-469B, the prior art that started this.
lukealonso — GLM-5.2-NVFP4 base quant (experts sliced byte-for-byte).
REAP — Cerebras, arXiv:2510.13999.
Base model: GLM-5.2 by Z.ai.

Downloads last month: 1,837

Safetensors

Model size

290B params

Tensor type

BF16

F8_E4M3

F32

Model tree for madeby561/GLM-5.2-NVFP4-REAP-504B-term

Base model

zai-org/GLM-5.2

Quantized

lukealonso/GLM-5.2-NVFP4

Quantized

(2)

this model

Paper for madeby561/GLM-5.2-NVFP4-REAP-504B-term

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 20