Theorem wants to stop AI-written bugs before they ship – and just raised $6 million to do it



As artificial intelligence reshapes software development, a small startup is betting that the industry’s next big bottleneck won’t be writing code, but trusting it.

Theorema San Francisco-based company from the Y Combinator project Spring 2025 batch, announced Tuesday that it has raised $6 million in seed funding to create automated tools that verify the accuracy of AI-generated software. Khosla Ventures led the tour, with the participation of Y combiner, e14, SAIF, Halcyonand angel investors, including Blake Borgesson, co-founder of Recursion Pharmaceuticals, and Arthur Breitman, co-founder of blockchain platform Tezos.

The investment comes at a pivotal time. AI coding assistants from companies like GitHub, AmazonAnd Google now generate billions of lines of code each year. Business adoption is accelerating. But the ability to verify that software written by AI actually works as intended has not kept pace, creating what Theorem founders describe as a widening "monitoring gap" that threatens critical infrastructure, from financial systems to power grids.

"We are already there," said Jason Gross, co-founder of Theorem, when we asked if AI-generated code was beyond the capacity of human review. "If you asked me to review 60,000 lines of code, I wouldn’t know how to do it."

Why AI writes code faster than humans can verify it

Theorem’s core technology combines formal verification (a mathematical technique that proves that software behaves exactly as specified) with trained AI models to automatically generate and verify proofs. This approach transforms a process that historically required years of PhD-level engineering into something the company says can be done in weeks or even days.

Formal verification has existed for decades, but remains limited to the most critical applications: avionics systems, nuclear reactor controls and cryptographic protocols. The prohibitive cost of the technique – often requiring eight lines of mathematical proofs for each line of code – made it impractical for consumer software development.

Gross knows this firsthand. Before founding Theorem, he obtained his doctorate at MIT working on cryptography code that now powers the HTTPS security protocol protecting billions of Internet connections every day. This project, according to his estimates, required fifteen person-years of work.

"No one prefers to have incorrect code," Gross said. "Software verification simply wasn’t economical before. In the past, the tests were written by doctorate-level engineers. Now AI writes it all."

How formal verification finds bugs that traditional testing misses

The theorem system works on the principle of raw calls "fractional proof decomposition." Rather than exhaustively testing all possible behaviors – which is computationally infeasible for complex software – the technology allocates verification resources in proportion to the importance of each component of the code.

The approach recently identified a bug that escaped testing at Anthropic, the AI ​​security company behind the Claude chatbot. Gross said the technique helps developers "detect their bugs now without spending a lot of computing."

In a recent tech demo called SFBench, Theorem used AI to translate 1,276 problems from Rocq (a formal proof assistant) to Lean (another verification language), then automatically proved that each translation was equivalent to the original. The company estimates that a human team would have required approximately 2.7 person-years to accomplish the same work.

"Anyone can run agents in parallel, but we are also able to run them sequentially," Gross explained, noting that Theorem’s architecture handles interdependent code — where solutions build on each other across dozens of files — that trips up conventional AI coding agents limited by pop-ups.

How one company turned a 1,500-page specification into 16,000 lines of reliable code

The startup already works with customers in research labs in AI, electronic design automation and GPU-accelerated computing. A case study illustrates the practical value of the technology.

A customer came to Theorem with a 1,500-page PDF specification and an existing software implementation plagued by memory leaks, crashes, and other elusive bugs. Their most pressing problem: improving performance from 10 megabits per second to 1 gigabit per second – a 100-fold increase – without introducing additional errors.

Theorem’s system generated 16,000 lines of production code, which the customer deployed without ever manually checking it. The confidence came from a compact executable specification – a few hundred lines that generalized the huge PDF document – coupled with an equivalence check harness that verified that the new implementation matched the intended behavior.

"They now have a production-grade analyzer running at 1 Gbps that they can deploy with confidence that no information is lost during analysis." Gross said.

Security Risks Lurking in AI-Generated Software for Critical Infrastructure

The funding announcement comes as policymakers and technologists increasingly scrutinize the reliability of AI systems embedded in critical infrastructure. Software already controls financial markets, medical devices, transportation networks, and power grids. AI is accelerating the evolution of software and the ease with which subtle bugs can propagate.

Gross defines the challenge in terms of security. As AI makes finding and exploiting vulnerabilities less costly, defenders need what it calls "asymmetric defense" — scalable protection without proportional increase in resources.

"Software security is a delicate balance between attack and defense," he said. "With AI hacking, the cost of hacking a system drops sharply. The only viable solution is asymmetric defense. If we want a software security solution that can last beyond a few generations of model improvements, it will be through verification."

When asked whether regulators should mandate formal verification of AI-generated code in critical systems, Gross offered a pointed answer: "Now that formal verification is cheap enough, failure to use it to ensure critical systems could be considered gross negligence."

What sets Theorem apart from other AI code verification startups

Theorem is entering a market where many startups and research labs are exploring the intersection of AI and formal verification. According to Gross, the company’s differentiation lies in its particular focus on evolving software monitoring rather than applying verification to mathematics or other fields.

"Our tools are useful to systems engineering teams, working close to the metal, who need guarantees of accuracy before merging changes," he said.

The founding team reflects this technical orientation. Gross brings deep expertise in programming language theory and experience deploying verified code to production at scale. Co-founder Rajashree Agrawal, a machine learning research engineer, focuses on training the AI ​​models that power the verification pipeline.

"We are working on formal program reasoning so that everyone can not only supervise the work of an average software engineer-level AI, but also actually harness the capabilities of a Linus Torvalds-level AI," Agrawal said, referring to the legendary creator of Linux.

The race to check the AI ​​code before it controls everything

Theorem plans to use the funding to expand its team, increase computational resources for training verification models, and expand into new industries including robotics, renewable energy, cryptocurrency, and drug synthesis. The company currently employs four people.

The startup’s emergence signals a shift in how enterprise technology leaders may need to evaluate AI coding tools. The first wave of AI-assisted development promised productivity gains: more code, faster. Theorem is betting that the next wave will require something different: mathematical proof that speed doesn’t come at the expense of safety.

Gross presents the issues in stark terms. AI systems are improving exponentially. If this trajectory continues, he believes superhuman software engineering is inevitable – capable of designing systems more complex than anything humans have ever built.

"And without a radically different surveillance economy," he said, "we will end up deploying systems that we do not control."

Machines write the code. Now someone needs to check his work.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *