Did the Bot Beat the S&P?

Yes — by 40 percentage points across three years of historical data. Here's how, and why I'm a little suspicious.

This is a follow-up to Designing a Bot to Beat the S&P.

Before you let a bot trade real money, you need to know if it works. That means backtesting — running the full pipeline against historical earnings data and seeing what would have happened.

I did that, and there were some interesting learnings along the way. But the headline finding was that on the trades the bot actually placed, it correctly predicted earnings beats 77% of the time — leading to a roughly 40 percentage point outperformance versus the S&P 500 over three years.

Let me take you through the journey of getting there.

The setup

Backtesting is hard. There are the usual things to control for — survivorship bias, point-in-time data, that kind of stuff. But the one that didn't immediately jump to mind: LLMs are trained on data. They might already know that NVIDIA had a smashing 2024. Asking the model to predict NVIDIA's 2024 earnings isn't exactly a fair test if it can recall what happened.

As academic research recommends, you should anonymise everything. The LLM never sees a ticker or company name during the backtest. Instead of "NVDA," it sees "COMPANY_4827" — a deterministic hash. The point is that it has to reason from the fundamentals alone.

Interestingly, a Columbia study found that anonymising company names actually improved performance on data within the training window, suggesting that brand recognition distracts the model from the fundamentals more than memorisation helps it. The authors argue anonymisation is potentially useful even on out-of-sample data, since the distraction effect doesn't go away after the knowledge cutoff.

With that in place, here's what the setup looked like:

Period: Jan 2023 to Dec 2025 (3 years)
Universe: S&P 500 + NASDAQ 100 constituents at the time (not today's)
Simulation: walk-forward, day by day — no future information leaks in
Data the LLM sees for each candidate (all anonymised, all from external APIs):
- Consensus EPS and revenue estimates
- Last 4 quarters of actual vs estimated EPS
- Key financials: revenue, margins, FCF, debt/equity
- Valuation: P/E ratio, Piotroski F-score, EPS growth YoY
- Technical context: price, RSI, moving averages, relative strength vs sector
- Current VIX
Data dropped for the backtest (hard to get point-in-time): news headlines, insider activity

The results

Of the 289 trades the bot actually executed, 77% correctly predicted a beat (223 out of 289).

Looking at the funnel: 8,034 earnings events made it through basic eligibility filters (market cap, volume, analyst coverage) and were sent to the LLM for analysis. The LLM made 2,477 buy recommendations across them, but only 289 actually got traded. The reason for that gap isn't that the risk gates were filtering for quality — it's that the bot was designed to hold only 8 positions at a time, and earnings tend to cluster (a few big weeks per quarter). Once those slots fill up, it's somewhat random which subsequent candidates make it in. There are smarter designs for handling that lumpiness; mine was rudimentary.

So the 77% on the 289 trades is real but partly an artefact of the strategy's tight filtering. The truer number — what you'd see at scale with looser caps — is the 68% accuracy across all 2,477 buy recommendations. Either way, that's better than the median Wall Street analyst, in line with the Chicago Booth research that inspired the project.

Returns vs the S&P

A quick aside before the numbers: the Sharpe ratio measures return per unit of risk — higher is better, above 1.5 is good. Two strategies can produce the same total return but feel completely different — one might compound smoothly while another swings wildly. The first has a much higher Sharpe.

Looking at just the pre-earnings trades — buying before the announcement, exiting on the day:

Metric	earnings_bot	SPY Buy & Hold
Total Return	+118.83%	+79.07%
Sharpe Ratio	1.74	~0.95
Max Drawdown	12.29%	~24%
Hit Rate	54.5%	—

So the bot beat the S&P by about 40 percentage points on raw returns, with a Sharpe nearly double — meaning the ride was much less bumpy.

Right call, lost money

You'll notice the hit rate (54.5%) is well below the directional accuracy (77%). Predicting the direction correctly and making money on the trade aren't the same thing. Of the 223 correct beat calls, 97 (43%) still lost money. The dominant story: 58 of those were correct beat predictions where the bot entered seven trading days before the announcement, the stock drifted down before the print and tripped the 5% stop loss, then the company beat — but we'd already been kicked out.

The 5% stop is doing its risk job — there were no catastrophic losses in the backtest — but it's surrendering alpha on trades where the thesis turned out to be right. The flip side: the stop is also why the average winner (+9.80%) is much larger than the average loser (-5.91%). Losses are capped tight, winners can run. That asymmetry is part of what drives the returns.

Unlocking more with drift

That's just the pre-earnings strategy. The bot also has a second tactic: post-earnings drift. The idea is that when a company actually beats by a meaningful amount, stocks tend to keep drifting in the direction of the surprise for weeks afterwards. This is one of the most studied anomalies in academic finance.

When the bot held positions for 15 days after a confirmed beat (surprise > 5%), the returns jumped:

Metric	Pre-Earnings Only	With 15-day Drift
Total Return	+118.83%	+130.59%
Sharpe Ratio	1.74	1.89

That said, I'm a bit suspicious of how much can be read into the 15-day window specifically. When I tested different hold periods, the curve wasn't smooth — 5-day and 10-day holds actually underperformed the no-drift case, and only the 15-day window peaked. With just 114 drift trades, that spikiness could easily be noise rather than a genuine 15-day sweet spot. The directional finding (drift exists, holding through it adds returns) is well-supported in academic literature. The exact 15-day optimum is much less robust. Any real conclusion would need a much larger sample.

Two small learnings

A couple of incidental findings worth noting.

The first is on confidence. As part of the prompt, the LLM had to spit out a confidence score alongside its prediction. I found this surprisingly useful — it gave me a knob to tune. In my case, 0.65 was the sweet spot for the threshold, but I wouldn't read too much into the specific number. The meta point is that forcing a model to declare confidence in its output is probably a good thing, not just for filtering, but for understanding when to trust it.

The second is on model choice. I ran the full backtest on both Claude Opus and Claude Sonnet, honestly expecting the results to be quite similar — Sonnet is roughly 5x cheaper per call, and this didn't feel like a task that needed the bigger model. I was wrong. Opus hit 68% directional accuracy, Sonnet hit 51%. That's a 17 percentage point gap. One caveat: the prompt was iterated on with Opus's help, so there might be an inherent bias toward what Opus likes to read. But still — the size of the difference surprised me.

Conclusion

This backtest validated the core hypothesis from the previous post: that LLMs aren't smarter than humans, but when fed up-to-date information and a robust framework for analysing it, they can predict earnings surprises pretty well — especially the larger frontier models.

The reality, though, is that I find this outperformance a little bit suspiciously good. A hypothesis I have is that even though companies are anonymised, the LLM might still be able to recognise them based on sector, market cap, financial profile, and so on. There aren't that many companies in the world that earn as much money as NVIDIA. The only way to really prove the strategy works is to run it forward, on data the model genuinely hasn't seen. Which is what I'm currently doing.

Nevertheless, I found this somewhat of an adventure, and I learned a few things along the way. I hope you did too.

See you in the next one ✌️