Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
SeaWolf-AIΒ 
posted an update about 12 hours ago
Post
1111
πŸš€ Adding a GPU without building one

AI is usually framed as "how smart is the model / how many GPUs did you buy." The real bottleneck is elsewhere β€” how efficiently you use the GPUs you already have.

Training happens once; inference runs the entire time users use your product. So a service's economics come down to cost per token. Inference acceleration uses software to pull several times more out of the same GPU β€” the effect of plugging in one more "virtual GPU."

VIDRAFT's VKAE, measured (B200, same-harness, no quality loss):

Qwen3.5-35B-A3B (MoE): 25.7 β†’ 601 tok/s (23.4Γ—)
Darwin-36B-Opus (in-house MoE): 25.0 β†’ 280.8 (11.2Γ—)
10,000+ tok/s peak aggregate under concurrency
The key: it's reproducible β€” model + serving shipped as one container.

docker pull vidraft/qwen35-vkae:601
Don't take our word for it β€” run it yourself. The mechanism will be released as a paper.

πŸ† Leaderboard & demo πŸ‘‰ VIDraft/vkae
Articles πŸ‘‰ https://huggingface.co/blog/FINAL-Bench/vkae-leaderboard
In this post