top of page
AINews (3).png

Google’s Gemma 4 Gets 3x Faster With Multi-Token Prediction

  • Writer: Covertly AI
    Covertly AI
  • 13 hours ago
  • 3 min read

Google’s Gemma 4 models are already being positioned as some of the most capable open AI models Google has released, and now they are receiving a major speed upgrade. Just weeks after launching Gemma 4, Google introduced Multi-Token Prediction, or MTP, drafters designed to make the models run much faster without lowering the quality of their responses. According to Google, these new drafters can deliver up to a 3x speed boost, making Gemma 4 more practical for developers, personal computers, mobile devices, and local AI applications.


The main challenge MTP tries to solve is the slow, step-by-step way large language models usually generate text. Models like Gemma typically create responses one token at a time, meaning each word or piece of a word depends on what came before it. This process works well, but it can be inefficient because the model uses the same amount of effort for simple predictions as it does for more complex reasoning tasks. On consumer hardware, this becomes even more noticeable because regular system memory and graphics memory are often much slower than the high-bandwidth memory used in large enterprise AI systems.


MTP improves this process through a technique called speculative decoding. Instead of forcing the larger Gemma model to generate every token by itself, a smaller and faster draft model predicts several possible future tokens ahead of time. The main Gemma model then checks those suggested tokens in parallel. If the predictions are correct, the model accepts the full sequence in one step and also generates an additional token of its own. This means the system can produce more text in the same amount of time it would normally take to generate just one token.



Google says this approach does not reduce output quality because the larger Gemma model still performs the final verification. The draft model is only helping predict possible continuations faster, while the main model decides what is actually accepted. If a drafted token is rejected, the target model still produces the correct token for that position, so the step is not wasted. Under the hood, the draft model shares the target model’s input embeddings and uses activations from the target model’s last layer. It also shares the key-value cache, which acts like the model’s active memory, so it does not need to recalculate context from scratch. For the smaller E2B and E4B edge models, Google also added an efficient clustering method that narrows token predictions to likely groups instead of checking the entire vocabulary each time.


The speed improvements vary depending on the hardware and model size. Google reported that smaller E2B and E4B Gemma models running on Pixel phones can see around 2.8x to 3.1x speedups, while the larger Gemma 4 31B model on Apple’s M4 silicon can reach about a 2.5x improvement. The upgrade could make local AI more useful for real-time chat, coding assistants, voice applications, autonomous agents, and mobile apps that need faster responses. It may also help improve battery life on devices running smaller Gemma models directly on-device. For mixture-of-experts models like Gemma 4 26B A4B, performance can depend more heavily on batch size and hardware because each token may activate different expert weights.


Google has made the MTP drafters available under the same Apache 2.0 license as Gemma 4, making them easier for developers to use and experiment with. The model weights are available through platforms such as Hugging Face and Kaggle, and support is available through tools including MLX, Hugging Face Transformers, vLLM, SGLang, and Ollama. Developers can also try them through Google AI Edge Gallery for Android and iOS. For developers building local or low-latency AI tools, Gemma 4’s MTP upgrade represents an important step toward making open AI models faster, more responsive, and more practical outside of massive cloud data centers.


Works Cited


Whitwam, Ryan. “Google’s Gemma 4 AI Models Get 3x Speed Boost by Predicting Future Tokens.” Ars Technica, 6 May 2026, arstechnica.com/ai/2026/05/googles-gemma-4-open-ai-models-use-speculative-decoding-to-get-up-to-3x-faster/


Lacombe, Olivier, and Maarten Grootendorst. “Accelerating Gemma 4: Faster Inference with Multi-Token Prediction Drafters.” Google Blog, 5 May 2026, https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/.   


Google. “Speed-up Gemma 4 with Multi-Token Prediction.” Google AI for Developers, 5 May 2026, https://ai.google.dev/gemma/docs/mtp/overview


Binder, Matt. “Google Launches Gemma 4, a New Open-Source Model: How to Try It.” Mashable, 3 Apr. 2026, https://mashable.com/article/google-releases-gemma-4-open-ai-model-now-open-source-how-to-try-it


Monge, Jim Clyde. “Google Releases Gemma: A Lightweight and Open Source Model.” Generative AI, 21 Feb. 2024, https://generativeai.pub/google-releases-gemma-a-lightweight-and-open-source-model-b6411d67ecca



Comments


Subscribe to Our Newsletter

  • Instagram
  • Twitter
bottom of page