Google’s TurboQuant AI Memory Compression Shakes Chip Stocks

▼ Summary
– Google introduced TurboQuant, a new algorithm that compresses the key-value cache in AI models to 3 bits per value, reducing memory use by at least sixfold without accuracy loss.
– The algorithm’s innovation eliminates compression overhead by using a two-stage process (PolarQuant and QJL) that avoids storing extra normalization constants.
– Its announcement caused immediate drops in memory company stock prices as investors reassessed future industry demand for physical memory hardware.
– Testing showed TurboQuant matches or outperforms existing methods on benchmarks and can speed up attention computation on GPUs by up to eight times at 4-bit precision.
– The technology could lower inference costs and improve vector search for services like Google Search, but its long-term impact on total hardware demand remains uncertain.
A new research breakthrough from Google sent immediate shockwaves through the semiconductor market this week. The announcement of the TurboQuant AI memory compression algorithm triggered a sharp sell-off in memory stocks, with shares of Micron, Western Digital, and SanDisk falling between 3% and nearly 6%. Investors reacted to the prospect that a fundamental component of AI infrastructure, physical memory, might soon be needed in far smaller quantities.
The innovation tackles a critical and costly bottleneck in deploying large language models: the key-value cache. This high-speed data store holds conversational context, preventing the model from recalculating information for every new word it generates. As AI processes longer conversations and documents, this cache expands dramatically, consuming precious GPU memory. Google’s solution compresses this cache to a mere 3 bits per value, down from the standard 16. This represents a reduction in memory footprint by at least sixfold, all while maintaining model accuracy according to the company’s benchmarks.
Authored by Google research scientist Amir Zandieh and Google Fellow Vahab Mirrokni, with collaborators from Google DeepMind, KAIST, and New York University, the paper will be presented at the ICLR 2026 conference. It builds upon the team’s prior work, including the QJL and PolarQuant techniques.
The core achievement of TurboQuant lies in solving a persistent inefficiency of traditional compression. Standard quantization methods shrink data vectors but must store extra normalization constants to decompress the information later. These constants add bits back into the system, undermining the headline compression ratio. TurboQuant employs a novel two-stage process to eliminate this overhead entirely.
First, the PolarQuant stage converts data vectors from Cartesian to polar coordinates, separating them into magnitude and angular components. The predictable patterns in the angular data allow the system to bypass the costly per-block normalization step. Next, the QJL technique applies a Johnson-Lindenstrauss transform, reducing any minor residual error to a single sign bit per dimension. The combined result dedicates almost the entire compression budget to representing the original data’s meaning, with minimal resources spent on error correction and no waste on normalization constants.
In testing across five standard benchmarks for long-context AI models, including LongBench and Needle in a Haystack, TurboQuant at 3-bit precision matched or exceeded the performance of KIVI, the current leading method for cache quantization. On retrieval tasks that require finding specific information in lengthy texts, it achieved perfect scores while delivering that sixfold compression. At a slightly higher 4-bit precision, the algorithm produced an eight-times speedup in computing attention on Nvidia H100 GPUs compared to an uncompressed baseline.
The swift market reaction, while pronounced, was viewed by some analysts as an overcorrection. Wells Fargo’s Andrew Rocha acknowledged that TurboQuant directly pressures the cost structure for AI memory, forcing a reevaluation of future capacity needs. However, he and others noted that robust demand for AI memory is unlikely to vanish, and compression techniques have coexisted with growing hardware procurement for years.
The concern has a logical basis, given the staggering scale of current AI infrastructure investment. Companies like Meta, Google, Microsoft, and Amazon are collectively planning capital expenditures in the hundreds of billions for data centers through 2026. A technology that slashes memory requirements by six times does not cut total spending by the same factor, as memory is just one data center cost. Yet it alters the fundamental cost ratio, and at this investment scale, even marginal efficiency gains yield massive compounded savings.
This development arrives as the industry grapples with the economics of AI inference. The one-time cost of training a model, while huge, is dwarfed by the recurring expense of serving millions of daily queries with low latency and high accuracy. The key-value cache sits at the heart of this challenge, governing how many users a single GPU can handle and how long a model’s context window can be. Techniques like TurboQuant are part of a concerted push to make inference cheaper, alongside next-generation hardware from Nvidia and Google.
The pivotal question is whether such efficiency gains will reduce total hardware purchases or simply enable more powerful AI applications at a similar cost. The history of technology strongly suggests the latter. When storage gets cheaper, we store more data; when bandwidth increases, applications emerge to consume it.
For Google, the implications extend beyond academic research. The algorithm also enhances vector search, the technology behind semantic similarity matching used in everything from core search results to YouTube recommendations and ad targeting. In tests on the GloVE benchmark, TurboQuant delivered superior recall without the large codebooks or dataset-specific tuning required by other methods, pointing to direct applications across Google’s revenue-generating services.
The technical contribution is significant: a training-free compression method that sets a new state-of-the-art, backed by solid theory and proven on production hardware. Whether it fundamentally reshapes AI infrastructure spending or becomes another optimization absorbed by the industry’s relentless demand for compute is a question the market will resolve in the coming months, not in a single day of trading.
(Source: The Next Web)