Google's TurboQuant compresses AI models with zero accuracy loss

Compressing a large language model by 6x without losing any accuracy sounds like a trade-off that doesn’t exist. On March 24, 2026, Google Research published evidence that it does.

The paper introduces TurboQuant, a vector quantization algorithm that targets one of the most expensive problems in running LLMs at scale: the key-value (KV) cache. This is the memory structure that stores frequently accessed data during inference. As context windows grow longer, the KV cache gets heavier, and that drives up hardware costs fast.

Diagram showing TurboQuant's two-stage pipeline: Stage 1 PolarQuant converts Cartesian vectors to polar coordinates, Stage 2 QJL applies 1-bit residual correction, producing a 3-bit compressed output with 6x memory reduction and zero accuracy loss — TurboQuant compresses FP32 key-value states in two stages. PolarQuant handles bulk compression via polar coordinate mapping. QJL corrects residual error with a 1-bit sign representation. End result: 3-bit output, 6x less memory, up to 8x faster attention on H100 GPUs. (Data: Zandieh & Mirrokni, Google Research, ICLR 2026)

How the three algorithms fit together

TurboQuant is not a single trick. It is a two-stage system built on two sub-algorithms: PolarQuant and QJL.

PolarQuant converts vector data from Cartesian coordinates into polar coordinates. Think of it as replacing “go 3 blocks east, 4 blocks north” with “go 5 blocks at a 37-degree angle.” Because the angle pattern is predictable and tightly concentrated, the model can skip expensive normalization steps entirely. That eliminates the memory overhead that traditional quantization methods always carry.

QJL (Quantized Johnson-Lindenstrauss) handles the residual error left after PolarQuant. It uses the Johnson-Lindenstrauss mathematical transform to shrink high-dimensional data while preserving the distances between points, then reduces each value to a single sign bit (+1 or -1). The overhead: zero. The cost per number: 1 bit.

TurboQuant chains these together. PolarQuant does the heavy compression using most of the available bits. QJL cleans up what’s left with just 1 bit. The combined result is a model that quantizes the KV cache down to 3 bits with no training or fine-tuning required.

The benchmark numbers

The team tested all three algorithms on open-source LLMs, specifically Gemma and Mistral, across five standard benchmarks: LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval.

Vector search gets faster too

The KV cache story is only half of it.

Modern search engines match meaning, not just keywords. That requires vector similarity lookups across billions of entries. The TurboQuant team evaluated it against state-of-the-art vector search methods, PQ and RabbiQ, using the GloVe dataset (d=200). TurboQuant hit the best recall ratios without needing large codebooks or dataset-specific tuning.

That matters directly for Google’s search infrastructure. Semantic search at scale means querying massive vector indices constantly. TurboQuant reduces the memory and preprocessing cost of that without giving up accuracy. It’s a practical infrastructure win, not just a research result.

The research was led by Amir Zandieh and Vahab Mirrokni (VP and Google Fellow at Google Research), with contributions from researchers at Google DeepMind, KAIST, and NYU. TurboQuant is slated for presentation at ICLR 2026, and PolarQuant at AISTATS 2026.

If your team runs inference on long-context models and KV cache memory is already a cost line item, read the full papers linked on the Google Research blog before your next infrastructure planning session. The 3-bit, no-fine-tuning angle alone is worth the hour.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Real Tech. No Filter

Google’s TurboQuant compresses AI models with zero accuracy loss

How the three algorithms fit together

The benchmark numbers

Vector search gets faster too

Tags:

The US just forced Anthropic to pull its most powerful AI models

8 creative apps Claude can now control directly

OpenAI kills Sora and pulls video out of ChatGPT

Aji

Leave a Reply Cancel reply

NVIDIA GeForce RTX 5090 vs RTX 4090: A Comprehensive Performance Analysis

Google’s AI Search Faces Legal Reckoning: Antitrust, Copyright, and the Future of Online Search

The US just forced Anthropic to pull its most powerful AI models

How to run Gemma 4 12B locally with Ollama

China’s new Lisuan LX 7G100 GPU is a technical marvel you shouldn’t buy

8 creative apps Claude can now control directly

Aji

Latest from Blog

The US just forced Anthropic to pull its most powerful AI models

How to run Gemma 4 12B locally with Ollama

China’s new Lisuan LX 7G100 GPU is a technical marvel you shouldn’t buy

8 creative apps Claude can now control directly

How to run Gemma 4 locally with Unsloth AI

How the three algorithms fit together

The benchmark numbers

Vector search gets faster too

Tags:

You might be interested in

Leave a Reply Cancel reply

OpenAI kills Sora and pulls video out of ChatGPT

GPT Image 2 knows who Sam Altman is. That’s the story.

Latest from Blog