Your AI may need AI to oversee its work. Gallic AI biz Mistral is leaning into making AI code generation more reliable with Leanstral, a coding agent for proofs constructed using the open source Lean programming language.
Formal code verification, Mistral argues, reduces the need for human code review, a potentially time-consuming process. Proofs, tests, linting, and specifications can help ground AI code agents in reality so that they produce better output.
Leanstral has been released with open weights (Apache 2.0) as an agent mode within Mistral Vibe, and via a free API endpoint. It is accompanied by results from an as-yet-unreleased benchmark test called FLTEval, designed to evaluate how AI models handle engineering proofs.
According to Mistral, Leanstral-120B-A6B outperforms larger (more parameters) open source rivals like GLM5-744B-A40B, Kimi-K2.5-1T-32B, and Qwen3.5-397B-A17B on FLTEval.
But perhaps more noteworthy is Leanstral's effect on one's bank account.
"Leanstral serves as a high-value alternative to the Claude suite, offering competitive performance at a fraction of the price: Leanstral pass@2 reaches a score of 26.3, beating Sonnet by 2.6 points, while costing only $36 to run, compared to Sonnet's $549," the AI biz claims. "At pass@16, Leanstral reaches a score of 31.9, comfortably beating Sonnet by 8 points."
As for Claude Opus 4.6, Anthropic's premium model at the moment, it scores higher on FLTEeval than Leanstral (39.6 compared to 31.9 for pass@16). But Opus will cost $1,650, compared to $290 Leanstral at 16 passes, or $18 for a single pass at a score of 21.9.
As proof of Leanstral's adept handling of test-driven development, Mistral had the coding agent tackle an actual question from the Proof Assistant Stack Exchange about a bug in some Lean 4 code.
The company reports that Leanstral successfully built the test code to reproduce the failure and then correctly spotted and fixed the flaw.
Mistral also released Mistral Small 4, designed as an all-in-one model that can handle reasoning, coding, and instruct/chat tasks, so you don't have to switch between specialized models. ®
Source: The register