How to use from
llama.cpp
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf maczzzzzz/Qwythos-9B-Claude-Mythos-5-1M-MTP-ROCmFP4-COHERENT-GGUF
# Run inference directly in the terminal:
llama cli -hf maczzzzzz/Qwythos-9B-Claude-Mythos-5-1M-MTP-ROCmFP4-COHERENT-GGUF
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf maczzzzzz/Qwythos-9B-Claude-Mythos-5-1M-MTP-ROCmFP4-COHERENT-GGUF
# Run inference directly in the terminal:
llama cli -hf maczzzzzz/Qwythos-9B-Claude-Mythos-5-1M-MTP-ROCmFP4-COHERENT-GGUF
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf maczzzzzz/Qwythos-9B-Claude-Mythos-5-1M-MTP-ROCmFP4-COHERENT-GGUF
# Run inference directly in the terminal:
./llama-cli -hf maczzzzzz/Qwythos-9B-Claude-Mythos-5-1M-MTP-ROCmFP4-COHERENT-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf maczzzzzz/Qwythos-9B-Claude-Mythos-5-1M-MTP-ROCmFP4-COHERENT-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf maczzzzzz/Qwythos-9B-Claude-Mythos-5-1M-MTP-ROCmFP4-COHERENT-GGUF
Use Docker
docker model run hf.co/maczzzzzz/Qwythos-9B-Claude-Mythos-5-1M-MTP-ROCmFP4-COHERENT-GGUF
Quick Links

Qwythos-9B-Claude-Mythos-5-1M-MTP ROCmFP4 COHERENT — GGUF

ROCmFP4 COHERENT quant of Qwythos-9B-Claude-Mythos-5-1M-MTP (Qwen3.5-9B dense, 1M YaRN, vision). Q6_K embeddings preserve the shared-embedding MTP head quality — 4/4 mesh_eval, 5/5 hermes_loop, +37.5% MTP throughput on a single 16 GB RDNA4 card. Built with charlie12345/ROCmFPX 11d76c2 for AMD ROCm (gfx1200).

File Size Quant BPW
Qwythos-9B-Claude-Mythos-5-1M-MTP-ROCmFP4-COHERENT.gguf 5.1 GiB Q4_0_ROCMFP4_COHERENT 4.70

⚠ This is NOT a stock llama.cpp quant. ROCmFP4 weight formats (Q4_0_ROCMFP4_COHERENT, Q4_0_ROCMFP4_STRIX_LEAN, Q8_0_ROCMFPX_AGENT) are unique to the charlie12345/ROCmFPX fork. Stock llama.cpp will exit with unknown quantization at load time. Use the ROCmFPX fork's llama-server / llama-cli.

Scope of these benchmarks — read this first

These numbers are a light baseline, not a thorough ROCmFP4 evaluation. The mesh's bench framework is built for production agent workload regression-detection on the local stack, not for the kind of multi-axis sweep that upstream quant maintainers typically publish. Specifically:

  • Harness scope is bounded. The numbers below come from the mesh's mesh_eval (4 deterministic tests + throughput) + hermes_loop_eval (5 agent scenarios). That's a regression suite, not a quality benchmark — it answers "does this quant still serve the mesh's agent stack correctly," not "is this the best possible ROCmFP4 quant of this model."
  • Sample sizes are small. Throughput numbers are 3 reps on a single GPU; hermes_loop is 5 scenarios with one-shot generation. None are powered for statistical significance on a per-token level.
  • No perplexity / wikitext / MMLU / GSM8K. The mesh's stack isn't a quality benchmark — those are upstream ROCmFPX's territory. If you need a quality signal, charlie12345's own validation ladder or an lm-eval-harness run is the right tool.
  • Single GPU class. All measurements are on a 16 GB RDNA4 (RX 9060 XT, gfx1200). No Strix unified-memory, no CDNA, no multi-GPU, no Vulkan, no CUDA. Cross-hardware generalization is not implied.
  • No human eval. "Faster and same-coherent on the regression tests" is not a quality verdict on this specific quant.

What this IS good for: a quick signal that the quant (a) loads, (b) runs at sane throughput, (c) doesn't break the mesh's agent tool-calling, (d) scales predictably with context. What this is NOT good for: claiming "this is the best quant of this model," reproducing academic benchmark results, or substituting for upstream's validation work.

For a rigorous view, the parent repo empero-ai/Qwythos-9B-Claude-Mythos-5-1M and the upstream Qwen3.5-9B are the place to look.

What we measured

ROCmFP4 COHERENT vs STRIX_LEAN vs AGENT (Node B, 16 GB RDNA4)

Quant Size bpw gen t/s (MTP-OFF) gen t/s (MTP-ON) MTP Δ mesh_eval hermes_loop
COHERENT 5.1 GiB 4.70 44.4 61.1 +37.5% 4/4 5/5
STRIX_LEAN 4.8 GiB 4.38 45.7 60.3 +31.9% 2/4 not tested
AGENT 9.1 GiB 8.41 29.3 CANNOT RUN 2/4 not tested

COHERENT is the winner. STRIX_LEAN's Q5_K embeddings cause thinking-leak on this shared-embedding MTP model. AGENT (Q8_0_ROCMFPX) is 9.1 GiB — too large for MTP compute buffers on 16 GB at any context size.

Cross-GPU: ROCm vs CUDA

Metric Node D CUDA (Q6_K) Node B ROCm (COHERENT) Notes
Size 7.62 GiB 5.1 GiB ROCmFP4 is 33% smaller
MTP-OFF gen t/s 56.2 44.4 CUDA faster (different hardware class)
MTP-ON gen t/s 89.1 61.1 MTP acceptance +58.5% vs +37.5%
mesh_eval 4/4 4/4 Both clean
hermes_loop 5/5 5/5 Both clean

CUDA Q6_K is faster per-token, but ROCm COHERENT delivers the same quality scores at 33% smaller footprint — critical for multi-model GPU slots.

mesh_eval — COHERENT MTP-OFF

Test Result Notes
Gibberish ✅ OK Clean output
Thinking leak ✅ CLEAN No <think> in content (Q6_K embeddings fix this)
Tool calling ✅ PASS get_weather(location=Tokyo)
Coding ✅ PASS Merge code executes correctly
Uncensored ⚠️ equivocal Framework pass; 4/4 overall
Throughput ✅ 44.4 t/s 44.4 gen t/s mean, 3 reps, stddev 0.6
Vision ❌ FAIL ROCmFPX may not support this mmproj path fully

Overall: 4/4 (vision excluded from core count)

hermes_loop — COHERENT MTP-ON (all 5/5)

Scenario Result tps
single (get_weather) ✅ PASS 41.0
chained (calculate) ✅ PASS 45.2
multi_step (compare) ✅ PASS 51.8
search (Eiffel Tower) ✅ PASS 42.8
error_recovery ✅ PASS 42.9

MTP speedup context — first real MTP win on the mesh

This is the first model on the Sovereign Machina mesh where MTP speculative decoding delivers a large, reliable speedup (+37.5% on ROCm, +81.2% on CUDA Q6_K per the companion bench). Prior models with native single-file MTP (Ornstein-9B-v2) showed only +0-2% within noise. The difference: Qwythos is a dense Qwen3.5-9B with a more capable MTP head, where draft acceptance rates are high enough to recover the overhead.

Quick start

# ROCmFPX fork (required — stock llama.cpp won't load this quant)
git clone https://github.com/charlie12345/ROCmFPX
cd ROCmFPX
mkdir build && cd build
cmake -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1200 ..
make -j$(nproc)

# Serve with all production flags
HSA_OVERRIDE_GFX_VERSION=12.0.0 \
LD_LIBRARY_PATH=$(pwd)/bin \
./bin/llama-server \
  -m Qwythos-9B-Claude-Mythos-5-1M-MTP-ROCmFP4-COHERENT.gguf \
  --mmproj mmproj-Qwythos-9B-Claude-Mythos-5-1M-f16.gguf \
  --host 0.0.0.0 --port 8081 \
  -ngl 99 -c 131072 -np 1 -ub 2048 \
  --cache-type-k turbo4 --cache-type-v turbo4 --cache-ram 32768 --kv-unified \
  -fa on --metrics --fit off \
  --spec-type draft-mtp

Flags explained: -ub 2048 is required for Qwen3.5 hybrid SSM architecture. --fit off is required at 64K+ context on RDNA4 (Pattern 16 per ROCmFPX docs). turbo4 KV cache is free performance for head_dim=128 models (Qwen family). --spec-type draft-mtp enables native single-file MTP speculative decoding.

Reproduce the quant

~/ROCmFPX/build-rdna4/bin/llama-quantize \
  --allow-requantize \
  /path/to/Qwythos-9B-Claude-Mythos-5-1M-MTP-BF16.gguf \
  Q4_0_ROCMFP4_COHERENT \
  Qwythos-9B-Claude-Mythos-5-1M-MTP-ROCmFP4-COHERENT.gguf

Source quant: empero-ai/Qwythos-9B-Claude-Mythos-5-1M-GGUFBF16 source → COHERENT re-quant.

Files in this repo

File Size Description
Qwythos-9B-Claude-Mythos-5-1M-MTP-ROCmFP4-COHERENT.gguf 5.1 GiB ROCmFP4 COHERENT quant
README.md This file
raw-mesh-eval-coherent.json 2.9 KB mesh_eval.py output (4/4 PASS)
raw-hermes-loop-coherent.json 7.5 KB hermes_loop_eval.py output (5/5 PASS)

What's NOT in this repo (caveats)

  • Stock llama.cpp will not load this file. The Q4_0_ROCMFP4_COHERENT weight format is unique to charlie12345/ROCmFPX. Use that fork's llama-server/llama-cli/llama-quantize.
  • No CUDA / non-AMD GPU bench. All measurements are RDNA4 (gfx1200). Vulkan path on RDNA4 has a known upstream regression — we did not test it.
  • 131K ctx is HTTP 400 if --fit off is omitted. Pattern 16 of the ROCmFPX docs: RDNA4 with HSA requires --fit off at 64K+ context to avoid HSA SEGV in hsaKmtWaitOnMultipleEvents_ExtCtx.
  • Vision test failed on the ROCmFPX build. The ROCmFPX fork's mmproj path may not fully support vision encoding for this model on RDNA4. The parent model supports vision on CUDA/stock llama.cpp.
  • No MTP bench on the AGENT quant. AGENT (9.1 GiB) cannot run MTP on 16 GB — compute buffers OOM regardless of context size.
  • No quality benchmark (perplexity, MMLU, GSM8K). See "Scope of these benchmarks" above — this quant is validated for agent stack regression, not academic reproduction.
  • Uncensored test result is equivocal. The eval returned empty output on a sensitive prompt — not a refusal, but not a clear pass either. No impact on agent tool-calling or coding benchmarks.
  • The source is from empero-ai/Qwythos-9B-Claude-Mythos-5-1M safetensors, quantized via empero-ai/Qwythos-9B-Claude-Mythos-5-1M-GGUF BF16 intermediate. The chain is: Qwen/Qwen3.5-9B (Apache 2.0) → empero-ai (fine-tune) → BF16 GGUF → our COHERENT quant.
  • 4.5+ GiB minimum VRAM. Doesn't fit on smaller cards. The mesh's 16 GB card runs it with ~11 GB headroom at 4K context, ~1 GB headroom at 131K.

Provenance

  • Date: 2026-06-29/30
  • Source model: empero-ai/Qwythos-9B-Claude-Mythos-5-1M (Apache 2.0, Qwen3.5-9B fine-tune with 1M YaRN RoPE + MTP)
  • Intermediate GGUF: empero-ai/Qwythos-9B-Claude-Mythos-5-1M-GGUF (BF16 source)
  • Quantizer: charlie12345/ROCmFPX 11d76c2 via llama-quantize --allow-requantize
  • Build hardware: Node B — AMD Ryzen 9 5900XT 16C/24T, 64GB DDR4, AMD RX 9060 XT 16GB (gfx1200, RDNA4, ROCm 7.8)
  • Build tooling: ROCmFPX fork with -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1200
  • Bench harnesses: mesh_eval.py (6 tests), hermes_loop_eval.py (5 agent scenarios)
  • Original bench report: raw/benchmarks/2026-06-30-qwythos-9b-rocm-bench/

License

  • Parent model (empero-ai/Qwythos-9B-Claude-Mythos-5-1M): Apache 2.0
  • Upstream base (Qwen/Qwen3.5-9B): Apache 2.0
  • Quantizer (charlie12345/ROCmFPX): MIT
  • This quant: Apache 2.0 (derived from Apache 2.0 parent)

This is a derivative quantized file. The license terms of the parent model apply to the use of this file. Verify commercial use terms with the parent model card before deployment.

Downloads last month
230
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maczzzzzz/Qwythos-9B-Claude-Mythos-5-1M-MTP-ROCmFP4-COHERENT-GGUF

Finetuned
Qwen/Qwen3.5-9B
Quantized
(69)
this model

Collection including maczzzzzz/Qwythos-9B-Claude-Mythos-5-1M-MTP-ROCmFP4-COHERENT-GGUF