Can ChatGPT beat the S&P 500?

Results vary significantly by methodology and experiment design. The University of Chicago study reported 10% annualized alpha using GPT-4 with chain-of-thought reasoning on anonymized financial data. However, most real-world trading benchmarks show LLMs failing to beat simple buy-and-hold strategies once transaction costs and slippage are factored in. The key distinction is that backtests often outperform live trading results by a substantial margin.

What are LLMs actually good at in investing?

LLMs excel at processing earnings calls at scale, screening stocks against quantitative criteria, and reducing large universes of companies to curated shortlists. They can summarize thousands of analyst reports, detect changes in management tone, and explain investment theses quickly. Multi-agent systems show promise by combining specialized models for different analytical dimensions (fundamentals, momentum, macro, news sentiment). However, they struggle with regime changes, live trading execution, and timing decisions.

Why do most AI stock-picking experiments fail in live trading?

Several factors cause the gap between backtests and live performance: (1) slippage and transaction costs erode theoretical returns, (2) market impact when executing large orders, (3) inability to predict regime changes outside historical patterns, and (4) consensus trap—as more investors use the same models, the recommendations lose their alpha advantage. Additionally, LLMs make overconfident predictions when they should remain uncertain about tail risks and structural breaks.

Do AI models have the same stock recommendations?

Yes—heavily. An estimated 70% of LLM stock recommendations overlap with the S&P 500's top 10 mega-cap holdings. Models trained on public financial data all recommend NVIDIA, Microsoft, Apple, and Amazon because these companies dominate analyst coverage and training datasets. If your AI portfolio is 70% identical to the index, you're paying attention costs for marginal differentiation. This explains why AI picks often match or slightly beat the index—they're mostly just the index.

Should I use AI for final investment decisions?

No. The best practice is to use AI for research acceleration: screening candidates, summarizing data, and processing large information sets. Make final buy/sell decisions yourself based on your risk tolerance, position sizing, and time horizon. Any system that removes you from the decision-making loop will fail at the worst possible time. Use AI like a research analyst who never sleeps but has no predictive insight—it reads everything but doesn't know the future.

How long will AI stock-picking remain profitable?

Likely not long. A University of Florida finance professor predicted that within five years of widespread AI adoption, return predictability from LLM-based strategies would drop to zero. Once millions of investors use the same models for the same recommendations, those recommendations become consensus trades—and consensus trades generate zero alpha by definition. The edge decays fastest in the most widely publicized strategies.

AI Investing Test

ChatGPT Stock Picks vs. the S&P 500

Can you type "pick me 10 stocks" into an AI chatbot and beat the market? Dozens of experiments have tried it. Academic researchers have tested it. Someone even launched an autonomous GPT-4 trading agent with real money. This essay collects every credible test and tells you what actually happened.

The Experiments

Every Major Test, Ranked by Rigor

Not all AI stock-picking experiments are created equal. A TikToker asking ChatGPT for five stocks and checking back in a month is entertainment. A University of Chicago study testing GPT-4 across 150,000 firm-year observations with anonymized data is research. The results depend heavily on the methodology.

LLM Stock-Picking Experiments: Results Summary

Selected experiments, ranked by methodology quality

Experiment	Model	Return	Benchmark	Alpha
UChicago (150K obs.)	GPT-4 CoT	+10% ann.	N/A*	+10% ann.
MarketSenseAI (multi-agent)	GPT-4o	+125.9%	+73.5%	+52.4%
ScienceDirect (anonymized)	Gemini 1.5	Below index	S&P 500	Negative
StockBench (real trading)	Multiple	Below B&H	Buy & hold	Negative
Finder.com (8 weeks)	GPT-3.5	+4.9%	+3.0%	+1.9%
Motley Fool UK (9 months)	GPT-4	+17.4%	+13.0%	+4.4%

The pattern is inconsistent. Some experiments show strong outperformance. Others show underperformance. The most rigorous studies tend to find that LLMs struggle to beat simple buy-and-hold strategies in real-world trading conditions, even when they show promise in backtests.

The University of Chicago study found GPT-4 generated 10% annualized alpha with a Sharpe ratioMeasures return per unit of risk. Above 1.0 is good, above 2.0 is excellent. of 3.36 using chain-of-thought reasoning on anonymized financial statements. That's a remarkable finding. It also hasn't been replicated in live trading with real money at scale.

The Mega-Cap Problem

AI Stock Pickers Love the Obvious Winners

When you ask ChatGPT, Claude, or Gemini to pick stocks, they tend to recommend the same companies: NVIDIA, Microsoft, Apple, Amazon, Alphabet. These are the most discussed companies in their training data. They're the companies with the most analyst coverage, the most earnings call transcripts, and the most bullish commentary.

In 2025, a Motley Fool writer compared ChatGPT's three picks (Microsoft, NVIDIA, Visa) against his own three picks (Amazon, Axon Enterprise, Uber). ChatGPT returned 17.4%. The human returned 23.1%. Both beat the S&P 500's 13%.

But ChatGPT's picks were the second and third largest companies in the index by market cap. You don't need artificial intelligence to suggest buying the biggest, most successful companies on Earth. That's the equivalent of asking a language model for restaurant recommendations and getting back "Try the most popular restaurant in the city."

70%+

Estimated overlap between LLM "top picks" and the S&P 500's top 10 holdings

Most AI chatbot stock recommendations are mega-cap companies that already dominate the S&P 500. If your AI-selected portfolio is 70% the same as the index, you're paying attention costs for marginal differentiation. You could get most of the same exposure by buying VOO.

Where AI Has an Edge

The Tasks LLMs Are Actually Good At

Processing Earnings Calls at Scale

GPT-4 can read and summarize thousands of earnings call transcripts faster than any human analyst team. It can detect changes in management tone, spot hedging language, and flag guidance revisions across an entire sector in minutes. This is the University of Chicago study's insight: when you strip away company names and give GPT-4 raw financial data, it can identify patterns that predict future earnings changes.

Screening and Filtering

LLMs are excellent at reducing a universe of 4,000 stocks to a shortlist of 20 candidates based on specific criteria. Revenue growth above 25%, gross margins above 60%, forward P/EPrice-to-Earnings ratio using next year's expected earnings. Lower means cheaper. below 30, analyst consensus rating above 4.0. The screening itself isn't the edge. The edge is that LLMs can explain why each stock passed the screen and what the risks are, saving hours of manual research.

Multi-Agent Research Systems

The MarketSenseAI experiment used five specialized AI agents (news, fundamentals, momentum, macro, and signal generation) working together. This multi-agent approach returned 125.9% versus 73.5% for the S&P 100 over 2023-2024. The key was that each agent handled a different dimension of analysis. No single chatbot prompt could replicate this. It required a system, not a conversation.

Where AI Fails

The Tasks LLMs Get Wrong

Regime Changes

LLMs are trained on historical data. When the market regime shifts, their predictions break. The 2022 rate-hiking cycle caught every model off guard because the training data was dominated by a decade of low rates. An LLM can't predict the next pandemic, the next banking crisis, or the next Fed pivot. It can only extrapolate from what it's already seen.

Live Trading Execution

The StockBench real-world trading benchmark tested GPT-5, Claude 4, Gemini, and other frontier models in live market conditions. Most failed to beat a simple buy-and-hold strategy. The gap between backtest performance and live performance was significant. Slippage, transaction costs, and market impact all eroded returns that looked strong on paper. Knowing what to buy is only half the problem. Knowing when to buy and sell, in real time, with real money at stake, is where models break down.

The Consensus Trap

If millions of investors all use the same AI models for stock picks, the models' recommendations become consensus. And consensus trades have zero alpha by definition. A University of Florida finance professor predicted that within five years of widespread AI adoption, the return predictability from LLM-based strategies would drop to zero. The edge disappears as more people find it.

The Verdict

How to Use AI for Stock Research Without Fooling Yourself

Use AI For

Screening stocks by quantitative criteria. Summarizing earnings calls and 10-K filings. Comparing companies within a sector on specific financial metrics. Generating investment theses you then verify with primary sources. Building multi-factor scoring models. Identifying data patterns in large datasets. AI works like a research analyst who never sleeps and reads everything but has no original insight about the future.

Don't Use AI For

Final buy/sell decisions. Timing entries and exits. Predicting macro events or regime changes. Replacing your own judgment on position sizing and risk management. Following "top 10 stock picks" from any chatbot without doing your own analysis. The most common mistake: treating an LLM's confident tone as evidence of accuracy. These models sound certain even when they're guessing.

The best use of AI in investing isn't picking stocks. It's processing information faster. The decision still needs to be yours. Any system that removes you from the decision-making loop will eventually fail at the worst possible time.

Research, Yes. Trade, No.

LLMs can read every earnings call in the S&P 500 before breakfast. They can screen thousands of stocks in seconds. They can summarize a 200-page annual report in a paragraph. Use them for that. Don't hand them your portfolio.

How I Built This

Analysis based on published academic research, public experiment trackers, financial media reporting, and ETF performance data as of early 2026.

University of Chicago Study

Kim et al. (2024), GPT-4 chain-of-thought on 150,000 firm-year observations

The 10% annualized alpha and 3.36 Sharpe ratio come from the University of Chicago working paper by Kim, Muhn, and Nikolaev. They tested GPT-4 with anonymized financial statement data, removing company names and dates to prevent look-ahead bias. The model used chain-of-thought prompting to predict earnings direction. This is a backtest, not live trading.

MarketSenseAI

Multi-agent GPT-4o system, S&P 100, 2023-2024

The 125.9% cumulative return versus 73.5% for the S&P 100 comes from the MarketSenseAI paper. The system used five specialized agents (news, fundamentals, dynamics, macro, signal). This was a simulated portfolio, not a live fund. Transaction costs and slippage were partially but not fully modeled.

StockBench

Real-world trading benchmark with frontier models

The finding that most LLM agents fail to beat buy-and-hold comes from the StockBench paper (2025), which tested GPT-5, Claude 4, Qwen3, and other models in live market conditions. The key finding was that performance on financial QA benchmarks did not predict trading success.

Finder.com and Motley Fool UK Experiments

Public experiments with published results

The Finder.com experiment (GPT-3.5, 8 weeks, +4.9% vs. +3.0% S&P) and the Motley Fool UK experiment (GPT-4, 9 months, +17.4% vs. +13.0% S&P) are anecdotal experiments reported by financial media. They used small portfolios over short time periods. These results are interesting but not statistically significant and should not be extrapolated to predict long-term performance.

Jesse Walker has been an individual investor for 30 years. Before that, he was a poker professional, which is where he learned that the best decision and the best outcome aren't always the same thing. He writes about investing through the uncertainty of AI.