Mabry PRO

artificial-citizen

10 1 57

https://www.joshmabry.dev

Mabry1985

AI & ML interests

None yet

Recent Activity

liked a model about 9 hours ago

Qwen/Qwen3-Coder-Next

new activity 1 day ago

protoLabsAI/Ornith-1.0-9B-MTP-GGUF:Unable to load

new activity 1 day ago

protoLabsAI/Ornith-1.0-9B-MTP-GGUF:missing template

View all activity

Organizations

Posts 2

Post

A 9B running 306 tok/s on a single GPU.

We shipped calibrated NVFP4 + MTP builds of Ornith-1.0-9B, and found something along the way: NVFP4 and speculative decoding are multiplicative on Blackwell. MTP's verify step batches compute straight into the FP4 tensor cores — +52% lift vs +17% on Q4_K_M, acceptance-controlled so it's the kernels, not the draft head.

- GGUF (6.6 GB, MTP baked in): 306 tok/s on RTX PRO 6000 — faster than our 4B record
- vLLM (10.4 GB, W4A4 + MTP sidecar): ~1.5× bf16+MTP at 55% the VRAM
- Full release gate vs bf16 on the card: FC 96% (beats base), claw −0.028, coherence verified to 60K
- On Ampere? Honestly: use Q4_K_M. The FP4 win is Blackwell's tensor cores.

Every number traces to a row in protoLabsAI/lab-benchmarks (CC-BY-4.0).

🔗 huggingface.co/protoLabsAI/Ornith-1.0-9B-NVFP4
🔗 huggingface.co/protoLabsAI/Ornith-1.0-9B-MTP-GGUF

Want a different size/format? Open a discussion — we usually ship within 48h.

Post

104

Built OpenRouter's Fusion on our own LiteLLM gateway, then benchmarked whether it earned its cost.

The detail that decides the design: in OpenRouter's own numbers, fusing a model with itself still gained ~6.7 points. So the engine is the judge synthesizing over diverse samples, not the mix of models. Self-MoA ("Rethinking Mixture-of-Agents", arXiv 2502.00674) backs it — aggregating samples from one strong model beats mixing in weaker ones, which usually dilutes quality.

That maps cleanly onto local inference. A multi-model panel means holding N models resident, a non-starter on one shared card. Judged self-consistency needs only one, and ours already runs as two load-balanced replicas, so the samples spread across both GPUs for free.

~360-line CustomLLM provider, every sub-call looped back through the gateway so it keeps routing, fallbacks, and cost tracking, and a 29-prompt blind-ranked benchmark with an explicit ship rule. All MIT.

Breakdown: https://protolabs.studio/blog/fusion-on-your-own-litellm-gateway
Code: https://github.com/protoLabsAI/fusion-gateway

View all Posts

models 0

None public yet

datasets 6