Probably Raises $9M for More Reliable AI Development

▼ Summary
– Probably raised $9 million from Andreessen Horowitz to build a system that catches LLM hallucinations and factual errors before they reach users.
– The company aims for 99.99% accuracy, a standard common in deterministic systems but rare in AI, requiring rethinking basic AI engineering assumptions.
– Its first product is a data science tool that provides answers from complex datasets, each with a citation and audit trail.
– The tool uses a “harness system” where an LLM’s answers are checked against a deterministic validator, rejecting results that don’t match the dataset.
– This approach allows the system to run on smaller, cheaper local AI models, reducing token costs and extending to precision-sensitive fields like accounting or medicine.
Andreessen Horowitz has placed a $9 million seed bet on a startup called Probably, which aims to build a more dependable framework for catching AI errors. As large language models become increasingly sophisticated, the persistent problem of hallucinations remains a critical challenge, and the industry is still searching for the most effective solution.
Probably’s approach, according to founder Peter Elias, is to prevent hallucinations and factual mistakes from ever reaching the end user. The goal is to achieve the 99.99% accuracy standard common in deterministic systems, a level that has proven far more elusive in AI development. Achieving this requires a fundamental rethinking of many core AI engineering assumptions.
The company’s first product is a data science tool designed to generate quick, reliable answers from complex datasets. Every result is accompanied by a citation and a full audit trail showing how the answer was derived, a practice that is becoming standard among responsible AI tools.
To keep errors out of these summaries, Probably developed an elaborate harness system that Elias describes as a “data science mech suit.” In this system, the LLM’s initial responses are checked against a deterministic validator. Any result that doesn’t align with the dataset is sent back for correction. Critically, the LLM has been trained specifically against this validator, and the entire system is optimized for both speed and accuracy.
“What we learned building this was that the better your harness engineering is, the weaker the model can be,” Elias explains. “If you can refine the context enough, the model does not have to work very hard to do the right thing. Basically, it’s an exercise in reducing ambiguity.”
This design allows Probably’s tool to run on much smaller AI models. Elias notes the current version operates on a model “four classes weaker than the frontier models,” enabling it to run on local hardware like a desktop computer instead of a data center. This significantly reduces the token costs associated with AI use.
This cost efficiency arrives at a welcome time, as token prices are climbing and many businesses are rethinking their AI spending. Elias sees the potential extending far beyond data science, envisioning the same engine applied to precision-sensitive use cases like accounting or medical services.
“I think it’s really interesting that the big AI labs have not even attempted to do this,” Elias says. “They’re incentivized not to, because they make money the more times you have to correct the model.”
(Source: TechCrunch)




