Instructions to use AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP")
model = AutoModelForMultimodalLM.from_pretrained("AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP

SGLang

How to use AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP with Docker Model Runner:
```
docker model run hf.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP
```

Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP

Deployment, operations & benchmarks → github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash

The GitHub repo is the source of truth for the production deployment guide, hardware-tuned docker-compose configs, full configuration reference, measured benchmarks, and AGENTS.md — an operator's manual that pre-empts common stale-documentation traps.

🙏 Reference recipe credit: The modelopt + MTP graft pipeline used to build this variant is based on sakamakismile's validated Qwen3.6-27B-NVFP4-MTP series (22K+ downloads). They worked out the modelopt config, the per-projection quantization choices, and the MTP-head graft technique on the un-abliterated base; we adapted the same recipe to AEON-Ultimate's abliterated weights. The reference benchmark numbers cited below are theirs. Full credit for the recipe → sakamakismile.

🆕 AEON vLLM Ultimate container (2026-06-04)

ghcr.io/aeon-7/aeon-vllm-ultimate:latest — vLLM 0.24.0 (= :2026-07-01-v0.24.0) + PR #44389 NVFP4 KV cache (~3× capacity) + DFlash + TurboQuant K8V4 + AEON sm_121a patches. Same recipe family as the -Multimodal-NVFP4-MTP-XS sibling which has been benchmarked end-to-end (production-style greedy + n_spec=15 by category: math/code peak ~45 tok/s, overall mean 34.7 tok/s; concurrent ×4 steady ~84 tok/s aggregate). This variant uses the same modelopt NVFP4 format, the same qwen3_5_mtp native head, and the same hybrid GDN+attention stack — it should serve identically with --quantization modelopt and either --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' (native MTP) or a DFlash drafter (recommended on Spark — see container README Recipe A).

On the current v0.24.0 image, DFlash pairs with --kv-cache-dtype fp8_e4m3 (FP8 KV). Full setup + recipe matrix: container README.

🚀 Quickstart — DGX Spark / GB10 (recommended daily-driver)

On DGX Spark, run this body with a DFlash drafter (not native MTP): it lands at parity speed with the smaller -MTP-XS sibling while scoring higher on quality-eval benchmarks — the recommended daily-driver body when you have the VRAM. (Native qwen3_5_mtp decoding stays a dedicated-VRAM-Blackwell path — see the routing table below.)

docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest

# body (this repo) + z-lab DFlash drafter
GIT_LFS_SKIP_SMUDGE=1 git clone \
  https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP /models/mm-mtp
( cd /models/mm-mtp && git lfs pull )
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/z-lab/Qwen3.6-27B-DFlash /models/dflash
( cd /models/dflash && git lfs pull )

# ENTRYPOINT is /bin/bash — pass --entrypoint vllm
docker run -d --name aeon-vllm --gpus all --ipc=host --shm-size=16g --net=host \
    -e VLLM_USE_FLASHINFER_SAMPLER=1 \
    -v /models/mm-mtp:/model:ro -v /models/dflash:/drafter:ro \
    --entrypoint vllm ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
    serve /model --served-model-name aeon \
        --quantization modelopt --kv-cache-dtype fp8_e4m3 \
        --attention-backend TRITON_ATTN \
        --max-model-len 229376 --max-num-seqs 16 --max-num-batched-tokens 32768 \
        --gpu-memory-utilization 0.60 \
        --enable-chunked-prefill --enable-prefix-caching \
        --generation-config vllm \
        --reasoning-parser qwen3 --tool-call-parser qwen3_coder --enable-auto-tool-choice \
        --mm-encoder-tp-mode data \
        --speculative-config '{"method":"dflash","model":"/drafter","num_speculative_tokens":12,"attention_backend":"TRITON_ATTN"}' \
        --trust-remote-code

Keep --gpu-memory-utilization ≤ 0.88 on GB10 (unified memory). Use 0.60 when ASR/TTS/embedding sidecars share the Spark; raise toward 0.75–0.85 only when the LLM is the dominant GPU workload. The recommended Spark sidecar profile uses --max-model-len 229376, --max-num-seqs 16, and --max-num-batched-tokens 32768: one near-full-context session can run, while smaller agent sessions still share the pooled FP8 KV budget dynamically.

vLLM 0.24.0 DFlash note: set the attention backend in both places. --attention-backend TRITON_ATTN selects the target-model backend, but vLLM does not inherit that into the speculative drafter; the DFlash JSON must also include "attention_backend":"TRITON_ATTN". Leave --mamba-block-size unset and let vLLM derive the page/block geometry for the hybrid GDN stack. Full recipe matrix (NVFP4-KV capacity, TurboQuant, dedicated-VRAM MTP): container README.

Variants

Format	Size	Use case
BF16	51 GB	Full-precision reference weights (A100/H100 80 GB, RTX PRO 6000 96 GB, multi-GPU, fine-tuning)
NVFP4 (compressed-tensors + DFlash)	26 GB	DGX Spark / GB10 — production validated with DFlash speculative decoding. Patched `vllm-aeon-ultimate-dflash` container.
Multimodal-NVFP4-MTP (this repo)	27 GB	High-bandwidth dedicated GPUs (RTX 5090, RTX PRO 6000, B100/B200) with MTP speculative decoding via the model's native `mtp.*` head. modelopt format, `--quantization modelopt`. Vision tower preserved.
Text-NVFP4-MTP	20 GB	Same as this repo but with vision tower stripped. Smaller footprint for text-only deployments on tighter VRAM.

What this is

This is the modelopt-format NVFP4 variant with MTP speculative decoding, multimodal-preserved, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).

Specifically:

Body quantized to NVFP4 via nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG. This is the modelopt compressed-tensors format that vLLM serves through --quantization modelopt (different code path from the -NVFP4 sibling release which uses --quantization compressed-tensors).
Linear-attn / GatedDeltaNet layers preserved BF16 (432 keys across 48 GDN layers). NVFP4 quantization on Mamba/SSM state collapses the recurrence; modelopt's *linear_attn.conv1d* ignore plus our explicit *linear_attn* exclude keeps these intact.
Vision tower preserved BF16 (333 keys). Multimodal inference fully functional.
MTP head grafted from the base Qwen/Qwen3.6-27B checkpoint (15 tensors, BF16). The base contains MTP heads but Qwen3_5ForConditionalGeneration.from_pretrained drops them during loading; the lna-lab pipeline pattern (which this build follows) explicitly grafts them back into the quantized output, giving vLLM a working drafter for --speculative-config '{"method":"qwen3_5_mtp",...}'.

Why MTP — and where it actually wins

Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.

Measured numbers on AEON-Ultimate (this exact variant)

Hardware	Median tok/s	Peak tok/s	Spec-decode acceptance
RTX PRO 6000 Blackwell (96 GB dedicated VRAM)	~92 (this variant) / 111.4 (XS sibling)	124.7 (XS sibling)	67.7 % regular / 69.2 % XS
DGX Spark / GB10 (unified memory) — MTP method	24.1 (XS sibling)	27.5	66.3 %
DGX Spark / GB10 — DFlash method on this body 🏆	38.5 tok/s thinking-on / 38.1 thinking-off	71.3 tok/s thinking-on / 68.4 off	DFlash v2
RTX 5090, B100 / B200	not yet measured by us — community welcome

Reference numbers from sakamakismile's un-abliterated recipe (RTX 5090)

Single-stream short prompts at n=3: ~132 tok/s
Single-stream long-form: ~105 tok/s
2-parallel aggregate (256K + KV FP8): ~189–207 tok/s
Mean MTP acceptance length: ~3.0–4.0 (vs DFlash chains ~2.0–2.3)

The hardware-routing punchline

On RTX PRO 6000 the XS sibling beats DFlash territory (~111 tok/s vs DFlash-class ~85 we'd expect there). On DGX Spark, DFlash beats MTP by 26 % median / 52 % peak — the unified-memory bandwidth caps how much MTP's high acceptance can translate to throughput. So: MTP is a dedicated-VRAM-Blackwell variant, not a universal upgrade. Full bench data: GitHub repo Performance section.

🎯 When to pick this variant — measured hardware routing

The right speculative-decode method depends on memory architecture:

Hardware tier	Recommended variant	Why
DGX Spark / GB10 (sm_121a, unified memory)	This body — with a DFlash drafter ✅ (recommended daily-driver)	Run this body + z-lab DFlash drafter (see Quickstart above): parity speed with the XS sibling, higher quality-eval scores. Use DFlash, not native `qwen3_5_mtp`, on Spark — DFlash beats the MTP method by +26 % median / +52 % peak here (unified-memory bandwidth doesn't reward MTP's high acceptance).
RTX PRO 6000 Blackwell (sm_120, 96 GB dedicated VRAM)	This variant (Multimodal-NVFP4-MTP) ✅ if you need vision; Text if text-only	MTP wins on dedicated VRAM. ~92 tok/s median measured with GDN BF16; dedicated-VRAM bandwidth lets the MTP head's high acceptance rate translate to throughput.
RTX 5090 (sm_120, 32 GB dedicated VRAM)	Multimodal-XS if you use vision; Text-XS if text-only	XS variants fit comfortably in 32 GB. 111.4 tok/s median measured on RTX PRO 6000; RTX 5090 should land near or above that.
A100 / H100 (no native FP4)	BF16	NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit.
B100 / B200 (sm_100, dedicated FP4)	This variant (Multimodal) or Text variant	Native FP4 + dedicated VRAM = MTP territory.

Full bench numbers: GitHub repo Performance section.

Usage

vLLM serve

# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP \
  --local-dir ./aeon-ultimate-multimodal-nvfp4-mtp

# Serve
export VLLM_USE_FLASHINFER_SAMPLER=1

# v0.24.0 removed VLLM_NVFP4_GEMM_BACKEND / VLLM_USE_FLASHINFER_MOE_* — use the KernelConfig flags
vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp \
  --quantization modelopt \
  --linear-backend flashinfer_cutlass --moe-backend cutlass \
  --mamba-cache-dtype float32 \
  --trust-remote-code \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.94 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

num_speculative_tokens=3 is the canonical setting for qwen3_5_mtp. Higher values diverge the drafter further from the target distribution and acceptance falls.

Configuration notes

--quantization modelopt is required (not compressed-tensors — different format).
--speculative-config '{"method":"qwen3_5_mtp", ...}' activates the grafted MTP head as the spec-decode drafter. No external drafter download needed — the head is in the safetensors of this repo.
--gpu-memory-utilization 0.94 is the validated cap on RTX PRO 6000; 0.95 causes the FlashInfer NVFP4 GEMM autotuner to OOM on first boot. See the GitHub repo's RTX PRO 6000 page for the same OOM behavior under DFlash.

Quantization recipe

Tool: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
Loader: Qwen3_5ForConditionalGeneration.from_pretrained (multimodal-preserved class)
Calibration: neuralmagic/calibration LLM split, 20 samples × 8192 tokens
Excluded from quantization (kept BF16):
- lm_head, proj_out.*, *router*, *mlp.gate.* (NVFP4_DEFAULT_CFG)
- *linear_attn.conv1d*, *mixer.conv1d* (NVFP4_DEFAULT_CFG)
- *linear_attn* (added — full GDN preservation)
- *visual* (added — vision tower preservation)
- *mtp* (added — MTP head preservation)
- *output_layer*, output.*
MTP graft: 15 tensors copied bf16 from Qwen/Qwen3.6-27B after modelopt export (AutoModelForCausalLM.from_pretrained drops them; explicit graft restores)
Pipeline: lna-lab/GGUF-to-NVFP4-SM120 reference recipe, adapted for AEON-Ultimate-BF16 input + separate MTP source

Provenance & credits

BF16 source: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline.
MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (docs/MTP_GRAFT_RECIPE.md)
Reference benchmark recipes: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
Quantization: NVIDIA TensorRT Model Optimizer (nvidia-modelopt 0.43.0)
Base: Alibaba Qwen team — Qwen/Qwen3.6-27B

License + responsibility

Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.