On May 13, 2026, llama.cpp renamed the argument that enables Multi-Token Prediction (MTP) speculative decoding. The old --spec-type mtp is gone. The new flag is --spec-type draft-mtp. If you set up a local Qwen3.6 install in the last two weeks following any of the guides that were online, your speculative decoding may now be silently disabled, and your throughput has dropped from ~140 tok/s back to the unaccelerated ~70 tok/s on a 4090.
This is a small change with a large practical impact, so worth a short post.
What Changed
The PR that landed the rename (llama + spec: MTP Support) unified MTP under the same “draft” speculative decoding family as EAGLE and standard draft-model speculation. The mechanism is the same; the argument shape is now consistent with the rest of the speculative-decoding API.
The Unsloth Qwen3.6 documentation has been updated to reflect the new flag. Their recommended invocation now reads:
--spec-type draft-mtp --spec-draft-n-max 2
--spec-draft-n-max 2 is the value the docs recommend in practice. Two tokens of speculation per step appears to be the sweet spot for Qwen3.6 — more aggressive speculation costs more on rejection.
Why You Might Not Have Noticed
The old flag does not error. It is silently ignored as an unknown spec-type. The model loads, generation works, and you only notice if you happen to be watching tok/s. On a Qwen3.6-27B Q4 setup, the difference is:
| Configuration | Tokens/sec on RTX 4090 |
|---|---|
| MTP enabled (correct flag) | ~140 |
| MTP flag ignored | ~70 |
For the 35B-A3B MoE variant on the same hardware the gap is wider — Unsloth reports 220 tok/s with the new flag enabled. If your setup is running at half that, the flag is the first thing to check.
What To Do
If you have a llama.cpp build from before May 13 and your existing scripts use --spec-type mtp, update both:
- Pull the latest llama.cpp and rebuild.
- Replace
--spec-type mtpwith--spec-type draft-mtpin any wrapper scripts. - Add
--spec-draft-n-max 2if you do not already have it set. - Run a generation against a known prompt and confirm tok/s is back in the expected range.
If you are on Unsloth Studio or any wrapper, the wrapper has likely already pulled the change. The risk is mostly for people running llama.cpp directly with hand-rolled scripts.
The Larger Pattern
This is the second flag-shape change to llama.cpp’s speculative-decoding API in six months. The pattern is clear: speculative decoding is no longer an experimental side feature, it is the default path for fast local inference, and the API is being tidied as it matures. Expect more small breaking changes through the rest of 2026.
The takeaway for anyone running local AI in production is the same as for anyone running self-hosted anything: pin your runner version, watch the changelog, and have a smoke test that catches throughput regressions. The model file is stable; the runner around it is moving fast.
For everyone else — if you set up a Qwen3.6 install in late April or early May and have been quietly running at half-speed without realizing it, today is a good day to update.
Part of the Local AI series
- A 27B Model on a Single GPU Is 10 Points Off Claude Opus 4.7 — the benchmark deep dive
- Running Qwen3.6-27B Locally — hardware, quantization, runners
- The Local AI Inflection Point: May 2026 — the wider story
Sources: