Your vLLM Thinking Budget Was Doing Nothing With MTP On

vLLM 0.21.0 landed on May 15, 2026. Buried in the release notes between a C++20 compiler requirement and the transformers v5 migration is a one-line fix that matters for anyone running a reasoning model through vLLM with speculative decoding:

Speculative decoding now respects reasoning/thinking budgets, enabling correct spec decode for reasoning models (#34668)

Translation: every thinking_token_budget you set in production with MTP enabled before this release did nothing. The model kept thinking until it decided to stop on its own.

What Was Actually Broken

The bug (#39573, reported April 11, 2026) is reproducible with a single config. Send a request to vLLM 0.19.0 against Qwen3.5-35B-A3B-FP8 with thinking_token_budget=50 and MTP speculative decoding off — the reasoning trace terminates at the budget, as expected. Turn MTP on with the same request. The model produces a multi-paragraph reasoning trace running into the hundreds of tokens. No error. No warning. The budget field is accepted and silently ignored.

Internally, the thinking-budget enforcement runs on the scheduler side and looks at the accepted token sequence. With MTP, tokens are accepted in batches of 2–3 per forward pass. The budget check was running per-step but operating on stale state, so the stop condition consistently fired late or not at all.

Why You Might Not Have Noticed

Three reasons this stayed under the radar for over a month:

It does not error. The request completes. Output is well-formed.
It does not always blow the budget by a huge factor. On simple prompts the model would have stopped soon anyway, so a budget of 50 might produce 80 tokens of thinking — annoying but not alarming.
The cost shows up in your bill, not your logs. Hosted vLLM behind an OpenAI-compatible endpoint just bills the extra thinking tokens at whatever your output rate is. No SLO alarm goes off when a reasoning trace runs 4× the budget on a hard prompt.

On a self-hosted setup serving Qwen3.6, DeepSeek-V4-Flash, or any of the recent reasoning models with MTP on, the impact compounds with traffic. If your average request was supposed to think for 200 tokens and was actually thinking for 600, you have been paying for three times the output you asked for on the reasoning portion of every call.

What To Do

Upgrade to vLLM 0.21.0.
Re-read the C++20 build requirement before you upgrade — your existing build toolchain may not work.
Migrate off transformers v4 in the same pass. Deprecated in this release, gone in the next.
Add a guardrail test. Send a known-hard reasoning prompt with thinking_token_budget=50 and assert the actual thinking-token count in the response is ≤ 50. If it is not, you did not pick up the fix.

That last step is the one most people will skip. The whole reason this bug survived a month in the wild is that nobody had a regression test for "the budget I set is the budget I get."

The Other Two Things in 0.21.0

Worth noting, since you are upgrading anyway:

C++20 compiler now required. GCC 10+ or Clang 13+. This is to track PyTorch 2.11's own bump. If you build vLLM in CI off a pinned older base image, it will fail.
NVFP4 KV cache. On Blackwell hardware (B100/B200), vLLM 0.21.0 ships an NVFP4 KV cache backend with Triton dequant kernels. For DeepSeek-V4-Pro at 1M-token context this is the path to actually fitting the cache. Not relevant if you are on Hopper or consumer cards.

The Larger Pattern

This is the third speculative-decoding correctness fix I have seen across runners in six months — the llama.cpp MTP flag rename, the EAGLE-2 KV state issue in TGI, and now this. Spec decode is fast enough and standard enough that every runner ships it on by default, but the interaction with other features (stop sequences, budgets, sampling tweaks) is still a minefield.

If you run reasoning models with speculative decoding, the operational lesson is the same one we keep relearning with self-hosted infrastructure: pin your runner, version-control your invocation, and have a smoke test that checks the things you actually depend on. The model is stable. The stack around it is not.

Part of the Local AI series

Speculative Decoding: Why Your Local Model Got 2× Faster in 2026 — the mechanism
Your Local Qwen3.6 Throughput Probably Just Halved — the llama.cpp flag rename
The Local AI Inflection Point: May 2026 — the wider story

Sources: