How Tech Companies Are Adapting to Cheaper AI Models

▼ Summary
– The AI industry’s core assumption that bigger models are always better is being challenged by rising costs, prompting a shift toward smaller, cheaper models.
– Coinbase co-founder Brian Armstrong predicts that within 12-18 months, 80% of AI workloads will run on 99% cheaper models, with only 20% using the most advanced ones.
– A shift to cheaper models could financially impact major labs like OpenAI and Anthropic, as savings would come from their revenue just before planned IPOs.
– A test by legal AI tool Harvey reduced inference costs by 3x without quality loss by using a cheaper model for most tasks and a premium model only for intensive ones.
– The real industry divide is between large and small models, not proprietary versus open-source, and users face new cost pressure as token prices rise and subsidies slow.
The foundational belief driving the AI boom has long been straightforward: bigger models produce better results, and the best results win. But the industry is now confronting a potential disruption if that assumption begins to unravel.
Rising operational costs are forcing users to reconsider smaller, more affordable AI models that were previously overlooked. This emerging trend of cost-driven model selection is still in its early stages, but its implications could reshape the entire sector.
Coinbase co-founder Brian Armstrong has offered a stark forecast: the majority of AI workloads will migrate to cheaper models. “Demand for intelligence is near infinite, but 80% of workloads will be running on 99% cheaper models within 12-18 months,” Armstrong wrote on X. “20% of workloads will still run on latest gen models where IQ maxing is important.”
The scale of this shift, if realized, would be transformative. Historically, AI companies have competed on quality, defaulting to the most advanced models available. If those tasks can be performed just as effectively by less expensive alternatives, the economics of AI would fundamentally change. Crucially, much of the cost savings would come at the expense of major labs like OpenAI and Anthropic, potentially undermining their financial positions as they approach IPOs.
This potential upheaval hinges on one critical question: Are enterprises ready to embrace smaller models?
Early evidence suggests they might be. A recent test by legal AI startup Harvey, conducted with inference platform Fireworks AI, demonstrated a threefold reduction in inference costs without any decline in quality. The system combined Claude Opus with Fireworks’ GLM 5.1, reserving Opus for the most demanding tasks. The result was a significantly lighter load on servers and lower overall expenses.
“Quality comes first, and in legal it always will,” said Harvey co-founder Gabe Pereyra. “However, the definition of quality is evolving from simply using the most powerful model for everything, to using the best model that gets the right answer most efficiently.”
This trend is often framed as a battle between major labs and Chinese or open-weight models, but that perspective misses the larger picture. The real divide is between large models and small models, not proprietary versus open-source. Savings can come from switching from GPT-5.5 to DeepSeek’s V4 Flash, but moving to GPT-5.4-mini works just as well.
A price war is currently underway between in-house inference from big labs and independently served open-weight models. For the broader question of small versus large, it matters little which specific small model prevails.
This logic may seem obvious , why use more compute than necessary? Yet it contradicts the scaling-first approach that has defined the industry. Inspired by the bitter lesson, labs have focused on training the most compute-intensive models possible, pushing the boundaries of capability. With investors heavily subsidizing prices, clients had no incentive to choose anything less than the most advanced option.
Now, as token prices rise and subsidies decline, users face cost pressure for the first time. It remains unclear whether this will drive enterprise users to smaller models. They might instead economize by making fewer calls, using less context, or abandoning the least promising deployments.
But if the majority of deployments can run just as effectively on smaller models, it could dampen the growing demand for inference and raise difficult questions about how to justify the expense of training frontier models.
(Source: TechCrunch)




