Silencing Agents Beats Letting Them Talk: DarkForest Cuts Errors 30.7%

DarkForest keeps multi-agent LLMs from seeing each other's reasoning, then aggregates structured beliefs , cutting errors 30.7% and token use 6.5x.

The default design for multi-agent reasoning systems assumes more communication is better. If one agent shows its work, the others can correct it. If a consensus forms across several chains of thought, that consensus should be more reliable. DarkForest ran that assumption through six benchmarks and found the opposite: agents that never see each other's intermediate reasoning produce more accurate answers, not less.

The failure mode in communication-heavy ensembles is specific and worth naming precisely. When agents share raw reasoning traces, a plausible-sounding but incorrect intermediate step gets picked up by downstream agents, incorporated into their own chains, and then reinforced at the aggregation stage. The error doesn't get filtered out. It gets amplified into a confident wrong answer. This is not a theoretical concern: it is the mechanism that explains why verbose multi-agent designs underperform on hard reasoning tasks despite their higher token budgets.

DarkForest bypasses this by enforcing silence at the reasoning stage. Each agent produces an answer independently, with no visibility into what any other agent is doing. The system then parses those raw responses into structured candidate records, groups semantically equivalent answers into clusters, and builds a calibrated belief distribution over those clusters. That distribution incorporates five factors: agent reliability, confidence, parse quality, support-pattern reliability, and an independence correction that adjusts for the fact that agents trained on similar data are not truly independent sources of evidence. A coordinator receives only what the belief state permits, not the full reasoning trace. Think of it as the difference between a jury that deliberates openly from the start, where early speakers anchor the room, and one where every juror writes a sealed verdict first and the foreman counts patterns across sealed envelopes before any discussion begins.

Across six reasoning benchmarks, DarkForest improves the strongest baseline by up to 30.7% on benchmark metrics. Token consumption drops by up to 6.5x compared with communication-heavy baselines. For teams running multi-agent inference pipelines at scale, the takeaway is direct: the token overhead of verbose agent communication is not buying accuracy, and a structured belief aggregation layer can recover more signal from independent completions than round-trip reasoning exchanges can.

We're thinking: We find the independence correction the most underappreciated piece here. Multi-agent systems built on models from the same family, fine-tuned on overlapping data, are not producing independent evidence, they are producing correlated noise with extra steps. DarkForest's explicit correction for this correlation is the part that separates it from naive majority voting. The broader implication is uncomfortable for teams who have invested in elaborate agent communication protocols: the "show your work" design pattern, borrowed from chain-of-thought prompting for single models, may be actively harmful when applied to ensembles. Silence is not a limitation to work around. It is the feature.

Key takeaways:

DarkForest enforces agent independence at reasoning time, then aggregates structured belief distributions that account for reliability, confidence, and inter-agent correlation, rather than passing raw reasoning traces between agents.
On six reasoning benchmarks, it beats the strongest communication-heavy baseline by up to 30.7% and reduces token consumption by up to 6.5x; the benchmarks are reasoning-focused, so generalization to open-ended generation tasks remains an open question.
Teams running multi-agent LLM pipelines for reasoning-intensive tasks should audit whether inter-agent reasoning sharing is actually improving accuracy or just increasing latency and cost, and treat DarkForest's silence-first aggregation as a reference architecture for replacement.

Source: DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs