AI & TechArtificial IntelligenceBigTech CompaniesDigital PublishingNewswireTechnology

Google’s DiffusionGemma open AI model gets 4x speed boost

▼ Summary

– DiffusionGemma generates entire blocks of text in parallel rather than one token at a time, making it faster and more efficient on local hardware.
– It functions like an image generation model, starting with static and denoising a field of placeholder tokens to produce a final “denoised” text canvas.
– The model is a 26 billion parameter Mixture of Experts (MoE) model, with only 3.8 billion parameters activated during inference, fitting in 18GB of GPU RAM.
– On an RTX 5090, DiffusionGemma outputs around 700 tokens per second, and on a single Nvidia H100, it produces over 1,000 tokens per second—about four times faster than similarly sized autoregressive Gemma models.
– This parallel generation approach boosts performance in non-linear tasks like in-line editing, molecular sequencing, and mathematical graphing, such as solving Sudoku puzzles.

Google DeepMind has unveiled a new addition to its Gemma 4 open model family, and it breaks the mold in a significant way. DiffusionGemma does not generate text one token at a time like most AI models. Instead, it produces entire blocks of text in parallel, a shift that Google claims makes it faster and more efficient when running on local hardware, from an Nvidia DGX system to a standard gaming GPU.

The majority of AI models are autoregressive, meaning they generate text sequentially from left to right. DiffusionGemma, however, operates more like an image generation model. It starts with a field of placeholder tokens and iteratively “denoises” them across multiple passes, refining likely tokens to improve estimations of others. At the end of this process, the model finalizes its output in one large, coherent block , essentially a denoised text canvas.

In terms of scale, DiffusionGemma is substantial for an open model. It is a Mixture of Experts (MoE) architecture with 26 billion total parameters, but only 3.8 billion are activated during inference. This design allows it to fit comfortably within the 18GB memory of a high-end consumer GPU. In benchmark testing with an RTX 5090, DiffusionGemma achieved around 700 tokens per second. On a single Nvidia H100 AI accelerator, that rate jumps to over 1,000 tokens per second , roughly four times the output of similarly sized autoregressive Gemma models.

This parallel approach shifts the computational bottleneck from memory bandwidth to raw compute, enabling the generation of up to 256 tokens simultaneously. Google highlights measurable gains in non-linear tasks such as in-line editing, molecular sequencing, and mathematical graphing. The model’s ability to continuously self-correct large sets of tokens makes it particularly effective for challenges like solving Sudoku puzzles, a task that stumps standard autoregressive models because each token depends on future context.

(Source: Ars Technica)

Topics

diffusiongemma model 98% autoregressive vs. diffusion 95% text generation speed 93% mixture of experts 90% local hardware efficiency 88% parallel token generation 86% non-linear task performance 84% sudoku puzzle solving 82% google deepmind ai 80% open model family 78%