Your LangGraph Orchestrator Is Failing 24% of Travel Conversations

A controlled comparison shows in-context self-orchestration beats LangGraph on procedural tasks, cutting failure rates by half across three domains.

The assumption behind every agent framework is that an external orchestrator makes complex multi-turn workflows more reliable. Put a graph above the model, track state externally, inject routing instructions at each turn. That architecture has become the default. It is also, for procedural tasks, the worse option.

The comparison is direct and controlled: the same model, the same tasks, two architectures. One uses LangGraph to manage state and routing across a defined procedure. The other puts the entire procedure in the system prompt and lets the model handle sequencing itself. Three domains were tested: travel booking (14 nodes), Zoom technical support (14 nodes), and insurance claims processing (55 nodes). Each condition ran 200 conversations, scored by an LLM-as-judge across five quality criteria on a 5-point scale.

The in-context approach scores between 4.53 and 5.00 across all three domains. LangGraph, using the identical underlying model, scores between 4.17 and 4.84. The gap is not noise. The orchestrated system fails outright on 24% of travel conversations, 9% of Zoom conversations, and 17% of insurance conversations. The in-context baseline fails on 11.5%, 0.5%, and 5% respectively. Across every domain, the simpler architecture wins on both quality score and failure rate.

Why does external orchestration underperform? The orchestrator-above-model pattern was designed for an era when frontier models could not reliably track multi-step procedural state across a conversation. The orchestrator compensated for that limitation by holding state externally and injecting it as instructions. That injection creates its own failure surface: routing decisions can misfire, injected context can conflict with conversational history, and the model receives fragmented instructions rather than a coherent procedural scaffold. A system prompt containing the full procedure gives the model everything it needs at inference start. No external state transitions, no injection timing errors, no mismatch between the orchestrator's routing logic and the model's current context window.

The 55-node insurance workflow is the most telling data point. At that complexity level, the argument for external orchestration is strongest, and yet the failure rate gap widens: 17% for LangGraph versus 5% for in-context. For teams building procedural agents, the takeaway is direct: if your workflow follows a defined procedure, adding an orchestration layer above a frontier model is adding failure modes, not removing them.

We're thinking: We find the insurance result the sharpest signal here. At 55 nodes, conventional wisdom says you need external state management, and yet the in-context approach cuts failure rates to less than a third of LangGraph's. That reframes the entire value proposition of orchestration frameworks for procedural work: they were compensating for model limitations that frontier models have largely outgrown. The practical implication is uncomfortable for teams that have invested in LangGraph or CrewAI pipelines. Those frameworks add latency, engineering overhead, and, per this data, higher failure rates, without a quality return on procedural tasks. The caveat worth holding: this finding applies to defined procedures, not open-ended agentic tasks where the action space is unbounded and external state management may still earn its keep.

Key takeaways:

Putting the full procedure in the system prompt lets the model self-orchestrate, eliminating the injection timing errors and routing mismatches that external orchestrators introduce at every state transition.
In-context prompting scores 4.53-5.00 vs. LangGraph's 4.17-4.84 on a 5-point scale, with failure rates two to eighteen times lower across all three domains tested; caveat is that all tasks follow a defined procedural structure, not open-ended agentic workflows.
Teams running LangGraph, CrewAI, or similar frameworks on procedural workflows should run a direct comparison against a system-prompt-only baseline before the next architecture review.

Source: In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks