Shrink AI Models: How Distillation Cuts Costs & Size

▼ Summary
– DeepSeek’s R1 chatbot drew attention for rivaling top AI models with far less computing power and cost, causing significant stock drops for Western tech firms like Nvidia.
– Accusations arose that DeepSeek used distillation to gain knowledge from OpenAI’s proprietary model, though distillation is a common and established AI technique.
– Distillation originated from a 2015 Google paper and involves transferring “dark knowledge” from a large teacher model to a smaller student model using probability-based soft targets.
– The technique became widely adopted to reduce model size and cost without sacrificing accuracy, as seen in examples like DistilBERT, and is now offered by major tech companies.
– Distillation requires access to a model’s internals, making unauthorized use on closed-source models unlikely, but it remains a fundamental and effective AI method with ongoing new applications.
Earlier this year, the Chinese AI firm DeepSeek introduced a chatbot named R1 that quickly captured global attention. The buzz centered on the company’s claim that it had developed a system rivaling top-tier models from industry giants, but using far less computational power and at a dramatically lower cost. This announcement sent shockwaves through the market, contributing to a historic single-day drop in the stock value of chipmaker Nvidia and other Western tech firms.
Some observers raised questions about how such rapid progress was achieved. Speculation emerged that DeepSeek might have used a technique called knowledge distillation to extract insights from proprietary models like OpenAI’s o1 without authorization. While this possibility was framed as a scandal in some reports, the reality is that distillation is neither new nor secret. In fact, it’s a well-established and widely used method for making AI systems more efficient.
Enric Boix-Adsera, a researcher at the University of Pennsylvania’s Wharton School who studies model efficiency, emphasized that “distillation is one of the most important tools that companies have today to make models more efficient.” The technique has been part of the AI toolkit for nearly a decade and is routinely employed by major tech companies to streamline their own systems.
The concept of distillation traces back to a 2015 paper authored by three Google researchers, including AI pioneer Geoffrey Hinton. At the time, it was common to combine multiple models, or “ensembles”, to improve performance, but doing so was computationally expensive and slow. Oriol Vinyals, a principal scientist at Google DeepMind and co-author of the paper, recalled, “We were intrigued with the idea of distilling that onto a single model.”
The team realized that machine learning models treated all errors equally, whether an image of a dog was misclassified as a fox or as a pizza, the penalty was the same. They suspected that larger “teacher” models contained nuanced information about which wrong answers were closer to being right. By training a smaller “student” model using these soft probability outputs, rather than hard classifications, the student could learn more efficiently. Hinton referred to this hidden information as “dark knowledge.”
Instead of just telling the student whether an image was a dog or not, the teacher might indicate there was a 30% chance it was a dog, 20% a cat, 5% a cow, and a tiny chance it was a car. These probability scores helped the student understand relationships between categories, accelerating learning and enabling the creation of compact models with nearly identical accuracy to their larger counterparts.
Despite its promise, the idea didn’t gain immediate traction. The paper was initially rejected from a conference, and Vinyals moved on to other projects. But as AI models grew larger and more expensive to train and run, researchers returned to distillation as a practical solution. In 2018, Google introduced BERT, a powerful language model used to interpret billions of searches. The following year, a distilled version called DistilBERT was released, offering similar performance at a fraction of the cost.
Today, distillation is ubiquitous. Tech leaders like Google, OpenAI, and Amazon offer it as a service, and the original paper, still only published on arXiv, has been cited more than 25,000 times.
It’s worth noting that distilling a model typically requires access to its internal workings, making it difficult to secretly extract knowledge from a closed-source system like OpenAI’s o1. However, a student model can still learn from a teacher through strategic prompting, using the teacher’s responses to train itself in a process resembling Socratic dialogue.
Researchers continue to find new applications for distillation. In January, the NovaSky lab at UC Berkeley demonstrated that the technique works effectively for training chain-of-thought reasoning models, which use multi-step logic to solve complex problems. Their fully open-source Sky-T1 model was trained for less than $450 and performed on par with much larger systems.
Dacheng Li, a doctoral student and co-lead of the NovaSky team, remarked, “We were genuinely surprised by how well distillation worked in this setting. Distillation is a fundamental technique in AI.” As models grow ever larger and more costly, methods like these will play an increasingly vital role in making artificial intelligence more accessible and sustainable.
(Source: Wired)