Post
35
A 9B running 306 tok/s on a single GPU.
We shipped calibrated NVFP4 + MTP builds of Ornith-1.0-9B, and found something along the way: NVFP4 and speculative decoding are multiplicative on Blackwell. MTP's verify step batches compute straight into the FP4 tensor cores — +52% lift vs +17% on Q4_K_M, acceptance-controlled so it's the kernels, not the draft head.
- GGUF (6.6 GB, MTP baked in): 306 tok/s on RTX PRO 6000 — faster than our 4B record
- vLLM (10.4 GB, W4A4 + MTP sidecar): ~1.5× bf16+MTP at 55% the VRAM
- Full release gate vs bf16 on the card: FC 96% (beats base), claw −0.028, coherence verified to 60K
- On Ampere? Honestly: use Q4_K_M. The FP4 win is Blackwell's tensor cores.
Every number traces to a row in protoLabsAI/lab-benchmarks (CC-BY-4.0).
🔗 huggingface.co/protoLabsAI/Ornith-1.0-9B-NVFP4
🔗 huggingface.co/protoLabsAI/Ornith-1.0-9B-MTP-GGUF
Want a different size/format? Open a discussion — we usually ship within 48h.
We shipped calibrated NVFP4 + MTP builds of Ornith-1.0-9B, and found something along the way: NVFP4 and speculative decoding are multiplicative on Blackwell. MTP's verify step batches compute straight into the FP4 tensor cores — +52% lift vs +17% on Q4_K_M, acceptance-controlled so it's the kernels, not the draft head.
- GGUF (6.6 GB, MTP baked in): 306 tok/s on RTX PRO 6000 — faster than our 4B record
- vLLM (10.4 GB, W4A4 + MTP sidecar): ~1.5× bf16+MTP at 55% the VRAM
- Full release gate vs bf16 on the card: FC 96% (beats base), claw −0.028, coherence verified to 60K
- On Ampere? Honestly: use Q4_K_M. The FP4 win is Blackwell's tensor cores.
Every number traces to a row in protoLabsAI/lab-benchmarks (CC-BY-4.0).
🔗 huggingface.co/protoLabsAI/Ornith-1.0-9B-NVFP4
🔗 huggingface.co/protoLabsAI/Ornith-1.0-9B-MTP-GGUF
Want a different size/format? Open a discussion — we usually ship within 48h.