Strip the Leakage, and the LLM Forecasting Edge Mostly Disappears
A 36-month leakage-controlled test shows a 7B RAG forecaster's median IC of +0.154 is largely explained by macro-analog retrieval, not LLM capability.
Most published LLM forecasting results carry a hidden flaw: the features fed to the model at prediction time include data that would not have existed when the decision actually had to be made. Strip that flaw out, and the picture changes.
A 36-month live test of a retrieval-augmented 7B open-source LLM, run from April 2023 through March 2026, applied strict decision-time constraints to equity style factor ranking. At each month-end, the model saw only what a real portfolio manager could have seen: lag-shifted FRED macro variables, recent macro-event summaries, and the Cleveland Fed's archived daily CPI nowcast for the not-yet-released current month. No future-dated features. No convenience labels. The pipeline works in three stages: a macro-analog retrieval module selects historical macro states that resemble the current one, a critic LLM compresses those analogs into a single tactical rule, and an actor LLM maps the current state plus recent rules into scores for seven U.S. equity style factors.
The headline result is a median monthly Spearman rank IC of +0.154 across the full 36-month window, with positive mean IC across each of three non-overlapping 12-month subwindows. That is a real signal. But the mean IC's bootstrap 95% confidence interval includes zero, meaning the result is statistically underpowered as a standalone claim. More telling: a non-LLM kNN macro-analog model, running under the identical decision-time constraint, recovers a comparable median IC. Real-time inflation data and macro-similar retrieval explain most of the median signal, not the language model. The LLM pipeline does retain higher mean IC and a stronger long-short allocation sanity check, which suggests any marginal LLM contribution concentrates in the extreme rankings that actually drive long-short portfolio returns. For quant teams building or evaluating LLM forecasting systems, the takeaway is direct: if your benchmark does not enforce strict decision-time information constraints, your measured performance is not measuring model capability.
We're thinking: We read this paper as a methodological indictment of most published LLM forecasting research. When you enforce honest decision-time constraints, a 7B open-source model's apparent edge collapses toward what a well-tuned kNN retrieval system can already do. That is not a minor caveat; it reframes the entire category. The implication for practitioners is uncomfortable: the LLM forecasting papers your team has been tracking as evidence of capability may be measuring data contamination more than signal. The one place where the LLM shows a residual edge, extreme rank ordering for long-short formation, is also the hardest place to validate statistically with 36 months of data. That is worth naming clearly before anyone allocates capital to it.
Key takeaways:
- A three-stage RAG pipeline (macro-analog retrieval, critic LLM rule compression, actor LLM scoring) produces a median Spearman rank IC of +0.154 over 36 months under strict decision-time constraints, but a non-LLM kNN baseline matches that median, indicating retrieval design drives the bulk of the measured signal.
- Mean IC across the full window is statistically underpowered with a 95% bootstrap CI that includes zero; the LLM's marginal advantage appears concentrated in extreme factor rankings, a claim that 36 months of data cannot yet confirm with confidence.
- Teams evaluating or building LLM-based forecasting pipelines should audit every feature in their benchmark for decision-time validity before reporting results: if features are labeled with the target's timestamp rather than the information available at prediction time, the performance number is measuring leakage, not the model.