Google’s Gemma 4 AI triples speed by predicting future tokens

▼ Summary
– Google released Multi-Token Prediction (MTP) drafters for Gemma 4, using speculative decoding to speed up token generation.
– Gemma 4 open models are built on Gemini technology but optimized to run locally on a single AI accelerator or consumer GPU.
– The Apache 2.0 license for Gemma 4 is more permissive than previous Gemma licenses, allowing broader use on personal hardware.
– MTP drafters (74 million parameters) bypass the main model during idle compute cycles to generate speculative tokens, halving wait time on an NVIDIA RTX PRO 6000.
– Drafters share the key value cache and use sparse decoding to efficiently predict likely tokens without recalculating context.
Google’s spring launch of the Gemma 4 open models already set a new benchmark for local AI performance. Now, the company is pushing the envelope even further with experimental Multi-Token Prediction (MTP) drafters. These tools employ a form of speculative decoding that lets the model guess future tokens, potentially slashing generation times compared to standard autoregressive methods.
The Gemma 4 lineup shares the foundational architecture of Google’s frontier Gemini AI, but is optimized for local deployment. Gemini thrives on Google’s custom TPU chips, running in massive clusters with ultrafast interconnects and memory. In contrast, a single high-power AI accelerator can handle the largest Gemma 4 model at full precision, and quantization makes it feasible on a consumer GPU.
This local focus gives users control over their data, sidestepping the need to share everything with a cloud system. Google also switched the Gemma 4 license to Apache 2.0, a far more permissive option than the custom license used for earlier versions. But local hardware has inherent limits,most consumer systems lack the blazing-fast memory found in enterprise gear. That’s where MTP steps in.
Large language models like Gemma generate tokens autoregressively, producing one token at a time based on the previous one. Each token demands the same computational effort, whether it’s a simple filler word or a critical step in a complex logical chain. The bottleneck? System memory in typical hardware is much slower than the high bandwidth memory (HBM) used in enterprise setups. As a result, the processor spends significant time moving parameters from VRAM to compute units, leaving compute cycles idle.
MTP exploits that idle time. Instead of waiting for the heavy model to process each token, a lightweight drafter generates speculative tokens. These draft models are tiny,just 74 million parameters in the Gemma 4 E2B,but they’re optimized for speed. For instance, the drafter shares the key value cache, the LLM’s active memory, so it doesn’t need to recalculate context the main model already resolved. The E2B and E4B drafters also use a sparse decoding technique to narrow down clusters of likely tokens.
Early benchmarks show impressive gains. Running Gemma 4 26B on an NVIDIA RTX PRO 6000, standard inference delivers one speed, while the MTP drafter cuts the wait time nearly in half,with the same output quality. That’s a meaningful leap for anyone running AI locally, where every millisecond counts.
(Source: Ars Technica)