GLM-5.2-REAP50-Q3_K_M-GGUF

A GGUF build of GLM-5.2, REAP expert-pruned (50%) and quantized to Q3_K_M (~169 GB) โ€” sized to run on 2ร— 96 GB GPUs (e.g. RTX PRO 6000), ~192 GB VRAM, with room for context.

What this is

  • Base: zai-org/GLM-5.2 (glm_moe_dsa, ~753B MoE).
  • REAP-50: the 128 most-salient experts per layer kept (of 256) via Cerebras REAP saliency (gate ร— โ€–expert_outputโ€–), MTP layer dropped โ†’ ~394B params.
  • Quantized to Q3_K_M, split into 5 shards (~45 GB each).
  • Runs as full MLA attention (the DSA lightning-indexer is not used at inference โ€” same simplification as the upstream conversion).

โš ๏ธ Requires a patched llama.cpp (for now)

Stock llama.cpp can't load any GLM-5.2 GGUF yet: its GLM-DSA loader requires the DSA indexer tensors on every layer, but GLM-5.2 only ships them on a subset ("full") of layers โ†’ missing tensor 'blk.N.indexer.k_norm.weight'. The indexer is loaded-but-unused (the graph is DeepSeek-V2 MLA), so the fix is simply to make those tensors optional.

Apply the included llama.cpp-glm-dsa-indexer-optional.patch (src/models/glm-dsa.cpp) and rebuild, or wait for the upstream GLM-DSA runtime PR. After patching it loads and runs normally.

# in a recent llama.cpp checkout:
git apply llama.cpp-glm-dsa-indexer-optional.patch
cmake -B build -DGGML_CUDA=ON && cmake --build build -j
./build/bin/llama-cli -m GLM-5.2-REAP50-Q3_K_M-00001-of-00005.gguf --jinja -ngl 99 -p "..."

Quality caveat

This is the most aggressive variant: REAP-50 (~+37.5% perplexity vs full GLM-5.2) compounded with Q3_K 3-bit quant. It generates coherently (chain-of-thought intact, correct simple code) but is not a quality champion โ€” it's the "fits 192 GB and runs fast" option. For higher quality at a larger footprint, see the MLX REAP-25 (+2.3% PPL) or the full GLM-5.2 ladder under pipenetwork.

Smoke-tested on Apple Metal (~17 tok/s); not tested on CUDA/RTX 6000.

Downloads last month
22,223
GGUF
Model size
381B params
Architecture
glm-dsa
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for pipenetwork/GLM-5.2-REAP50-Q3_K_M-GGUF

Base model

zai-org/GLM-5.2
Quantized
(80)
this model