Water company wasted $200k on bad answers from an AI model – so built its own slop filtering system

Tech companies have in recent years developed a reputation for being rapacious rent-seekers, but can also be unwittingly generous because their penchant to prioritize popularity over quality leaves room for others to sell improvements or repairs.

Waterline Development, a water desalination startup, is the beneficiary of this legacy of commercial haste. Having tried AI models and found them wanting, it came up with a fix.

Derek Bednarski, founder and CEO, told The Register in an email that when his company tried to use large language models for materials science research "they were confidently wrong in ways that cost us months."

Bednarski said his company was trying to build a desalination product that was essentially a water battery – charging the cell would remove ions like salt from the water.

"We were debating between carbon cloth and cast carbon electrodes," he explained. "Not being PhDs in the space, we read relevant academic papers and used LLMs like Grok and ChatGPT to validate our findings. We chose carbon cloth, which is heavily used in academic papers like the Stanford dissertation we based our initial prototypes on, due to commercial availability."

That material, he said, turned out to have issues that didn't exist for cast carbon electrodes, including poor conductivity, water retention that affected ion removal, and poor durability.

"While we were not solely relying on LLMs, they did influence our research meaningfully," said Bednarski. "LLMs chose statistics from various papers and fields (such as citing the lifespan of a carbon electrode in a capacitor) and put them together in ways that were plausible enough. Ultimately, we spent four months and $200,000 validating this material would not in fact work past pilot scale; cast carbon electrodes would be superior."

The problem Waterline Development encountered is that commercial AI models are ill-suited to multidisciplinary research, which requires synthesizing expertise from a variety of fields.

"No single AI model does this reliably," the company explains in a white paper [PDF]. "Frontier language models hallucinate under extended multi-step reasoning. They produce plausible answers that silently break when a problem crosses domain boundaries. At best this wastes time; at worst, it poisons critical decision making."

Rather than trying to integrate domain-specific tools or to make the work of human expert teams more efficient, Waterline created Rozum, a multi-model reasoning system that operates various AI models in parallel and synthesizes their answers through a verification layer.

Rozum, from the Slavic word for "reason" and now an AI startup under Bednarski, is a model orchestration system that operates at inference time. It relies on an ensemble of commercial models, open weight models, and domain-specialized models. These models each process the queries they receive using tools that perform verifiable operations and provide deterministic results that serve to ground answers.

The tool passes answers through a verification layer designed to detect and correct errors and hallucinations, errant claims, miscalculations, and phony citations.

Rozum uses a deterministic verification process to advance a final answer based on the evidence and reasoning from the ensemble of models it employs. According to the white paper, the system can come up with correct answers from a set of partial truths, even if no individual model has the complete, correct answer.

Bednarski said Rozum is not focused on correcting LLMs to the extent they can be used for, say, critical engineering work like bridge construction. Rather, the goal is to empower researchers, engineers, and scientists so they can do their jobs better.

"We are focused on deterministic tool implementation (ex. RDKit for Chemistry), allowing engineers, scientists, and analysts a direct path to verify outputs in a format familiar to them by domain," he explained.

"Our system orchestration method is heavily focused on deterministic validation (code execution replicated, etc.) of outputs, which roots out hallucinations that plague all models at various times. We see further improvements to this in verifying the methods used in sources we cite as well."

Rozum can spend minutes or even hours working on its responses, much more time than commercial AI models like Gemini 3.1 Pro or GPT 5.4 require with native tools. So it's not well-suited for real-time conversations, high-volume commodity queries, or tasks where current frontier models perform adequately.

We are prepared to further increase costs if it drives a meaningful gain in outcomes

As such it costs more, but the cost probably isn't consequential for the kind of projects to which Rozum is best suited.

"It does cost more than running a single frontier model," said Bednarski. "However, Rozum is being used by early customers for high-stakes questions and decision-making, like a $3M dollar solar investment or allocating months of engineering time towards one R&D priority or another. In these cases our customers prioritize intelligence over all else. We are prepared to further increase costs if it drives a meaningful gain in outcomes for customers who are making expensive decisions regularly."

But he claims it gets much better results. Rozum outscored GPT-4, Grok 4, and Gemini 3.1 Pro on the Humanity's Last Exam benchmark by several percentage points or more in every category but one.

Chart of Rozum performance on Humanity's Last Exam - Click to enlarge

"When we ran 1,000 PhD-level benchmark questions through the pipeline, the verification layer flagged unsupported claims in 76.2 percent of frontier model responses and couldn't confirm cited sources in 21.3 percent," he said. "Only 5.5 percent of questions produced clean consensus across all models."

That consensus rate – 5.5 percent – underscores how variable AI model responses can be and why AI alone is not enough.

Rozum debuted last week and is currently offered through a wait list. ®

Source: The register

Home

Water company wasted $200k on bad answers from an AI model – so built its own slop filtering system