Adding a GPU Without Building One
Why inference acceleration is quietly becoming essential AI infrastructure
In AI, people talk about two things: how smart the model is, and how many GPUs you managed to secure. The bottleneck the industry quietly struggles with sits somewhere else entirely — how efficiently you actually use the GPUs you already have.
TL;DR
- Training happens once; inference happens the entire time users use the product — so a service's economics come down to cost per token.
- Inference acceleration doesn't manufacture a GPU. In software, it turns the GPU you have into the equivalent of several — same hardware, same power, multiplied throughput.
- As a concrete example, VKAE reached up to 23.4× higher single-stream throughput on the same GPU, with no loss of output quality — and ships a container so you can reproduce it yourself.
- The mechanism itself is planned for release as a paper; the results and the means to verify them are published first.
1. The real contest isn't building a model — it's running it
Every time someone asks an AI a question, a GPU spins. Each line of output is electricity and server time, and as users grow this "cost of running" snowballs.
One fact is central: training happens once, but inference happens the entire time your users are using the product. So the profit and loss of an AI service ultimately comes down to cost per token. The real battleground of AI infrastructure today is not the moment a model is created, but the moment it keeps running — inference.
2. Inference acceleration doesn't manufacture a GPU — it adds one
The cleanest way to understand this technology is an analogy:
It doesn't build GPUs. It pushes the performance of the GPU you already have several times higher — the same effect as plugging in one more "virtual GPU."
Where NVIDIA carves a physical chip to raise performance, inference acceleration uses software — kernel-level optimization — to extract more work from the same chip. Same hardware, same electricity, same conditions; several times the throughput.
3. VKAE measured data — with the conditions attached
Single-stream throughput (tok/s) on a single NVIDIA B200, same measurement harness, baseline serving vs. optimized serving. Across every row, no degradation in output quality (accuracy) was observed.
| Model | Type | Baseline | Optimized (VKAE) | Speedup |
|---|---|---|---|---|
| Qwen3.5-35B-A3B | MoE (~3B active) | 25.7 | 601 | 23.4× |
| Darwin-36B-Opus | in-house MoE | 25.0 | 280.8 | 11.2× |
| JGOS-398B | in-house large MoE | 88 | 382 | 4.33× |
| Gemma 4 E4B | small (challenge profile) | 95.4 | 506.9 | 5.3× |
| Qwen3.6-27B | dense | 36.9 | 91.0 | 2.47× |
Qwen3.5-35B-A3B also recorded over 10,000 tok/s of peak aggregate throughput under high concurrency. Single-stream figures are best-case for repetitive content; on varied, real-world content they measure lower (for this model, around ~455 tok/s) — and stating that a number moves with its conditions is exactly what completes the credibility of a claim like this.
The point isn't the number itself, but its implication: a 23× speedup means one GPU does the work of several. Instead of buying a new GPU, you expand the one you have in software.
4. Why some models speed up a lot and others barely
There's a clear pattern in the table, stated plainly:
- MoE models (only a fraction of parameters active per token — e.g., 3B of 35B) are memory-bandwidth-bound in single-stream decoding. The dominant per-step cost is reading weights, so serving-layer optimization pays off dramatically — 10–23×.
- Large dense models are compute-bound, so the same optimization yields a smaller gain — 1–2.5×.
It is not "tens of times faster, always." The speedup depends on model structure, and the reason is the nature of the hardware bottleneck. Publishing that limit, too, is part of an honest benchmark.
5. Why this is "lose money without it," not "nice to have"
Inference acceleration isn't optional — it's mandatory for every AI data center.
- GPUs are expensive, scarce, and power-hungry. H100/B200-class parts are hard to get, and data-center power is finite. Buying your way out indefinitely isn't possible.
- So "efficiency of the resources you secured" becomes the competitive edge. Doubling throughput on the same GPU is the same as building twice the infrastructure; without optimization, you use expensive hardware at half capacity.
- Inference cost grows as your service succeeds. Training is once; inference runs the whole time users use it. Margins hinge on cost per token — and acceleration lowers it directly.
For everyone running AI as a service — cloud providers, AI startups, enterprises running their own models, and nation-scale (sovereign) AI — this is foundational. The tighter GPU supply gets and the higher power prices climb, the more it is worth.
6. This is already a global trend
Not one company's claim, but the direction of the whole industry. Optimization frameworks like vLLM and TensorRT-LLM becoming de-facto standards is the same story; the attention on inference-dedicated silicon (Groq, Cerebras) is the same push toward "make inference cheap."
7. What stands out here — reproducibility
Speed bragging is common; "we're the fastest" is everywhere. The problem is that there's usually no way to verify it.
What's notable in the VKAE case is that, alongside the numbers, it ships a single container bundling the model weights and the optimized serving runtime — so you can reproduce the result on your own GPU. Because it's OpenAI-compatible, it drops into an existing service directly.
docker pull vidraft/qwen35-vkae:601
docker run --gpus all -p 8000:8000 vidraft/qwen35-vkae:601
# OpenAI-compatible API on http://localhost:8000/v1
This reproducibility matters more than it seems. Benchmarks can be made favorable by choosing conditions, so "I ran it myself and it held up" becomes the standard of trust. To be rigorous, such a claim is only complete when it also states what baseline the multiple is measured against, and how much quality was lost. The more transparent those conditions, the more the technical community trusts it.
8. What about the "method"? — to be published in a paper
Right now, the specific acceleration mechanism behind VKAE is not disclosed — not to hide it, but because it is planned for release as a formal paper.
So the order was flipped: rather than arguing the method first, the results and the means to reproduce them are published first. You don't need to know the method yet — pull the container and measure tok/s yourself. If the numbers reproduce, that's enough for now; the "why" will be answered by the paper to come.
9. The big picture
AI competition is usually drawn as a duel of "model intelligence." But what makes that intelligence run in the real world is the economics of infrastructure. However smart a model is, it can't become a service if the cost of running it is unbearable.
Inference acceleration fills exactly that gap — between "smart" and "runnable." Building a new GPU is something only a handful of chipmakers can do; making better use of the GPUs that exist is the domain of software, and there is still a lot of room here.
Winning by buying more GPUs with more capital is natural. But getting the same result out of the same GPU by using it better is a different kind of competitiveness — and the scarcer GPUs become, the more valuable it is.
Adding one more GPU without building one. That's why inference acceleration is quietly, but essentially, becoming important in AI infrastructure.
Verify for yourself
| 🏆 Leaderboard & demo (with Docker) | https://huggingface.co/spaces/VIDraft/vkae |
| 🐳 Docker Hub | https://hub.docker.com/r/vidraft/qwen35-vkae |
| 🤗 Hugging Face | https://huggingface.co/FINAL-Bench/Qwen3.5-35B-A3B-VKAE |
| 🌐 Official site | https://www.vidraft.net |
VIDRAFT builds Korean-focused large language models and AI infrastructure — an in-house model family (Darwin, JGOS), inference optimization, a metacognition benchmark (FINAL Bench), and a hallucination-reduction middleware (MARL). The detailed mechanism behind VKAE is planned for release as a paper.
