AI Investing Test

ChatGPT Stock Picks vs. the S&P 500

Can you type "pick me 10 stocks" into an AI chatbot and beat the market? Dozens of experiments have tried it. Academic researchers have tested it. Someone even launched an autonomous GPT-4 trading agent with real money. This essay collects every credible test and tells you what actually happened.

Every Major Test, Ranked by Rigor

Not all AI stock-picking experiments are created equal. A TikToker asking ChatGPT for five stocks and checking back in a month is entertainment. A University of Chicago study testing GPT-4 across 150,000 firm-year observations with anonymized data is research. The results depend heavily on the methodology.

LLM Stock-Picking Experiments: Results Summary
This table presents seven major academic and real-world tests of ChatGPT and other large language models for stock picking. The University of Chicago study achieved 10% annualized alpha using GPT-4 with chain-of-thought reasoning on 150,000 firm-year observations. MarketSenseAI returned 125.9% versus 73.5% for the S&P 100. Other experiments including real trading benchmarks showed mixed results, with most failing to beat buy-and-hold strategies once costs are included. The Motley Fool UK experiment returned 17.4% versus 13.0% for the S&P 500 over 9 months.
Selected experiments, ranked by methodology quality
Experiment Model Return Benchmark Alpha
UChicago (150K obs.) GPT-4 CoT +10% ann. N/A* +10% ann.
MarketSenseAI (multi-agent) GPT-4o +125.9% +73.5% +52.4%
ScienceDirect (anonymized) Gemini 1.5 Below index S&P 500 Negative
StockBench (real trading) Multiple Below B&H Buy & hold Negative
Finder.com (8 weeks) GPT-3.5 +4.9% +3.0% +1.9%
Motley Fool UK (9 months) GPT-4 +17.4% +13.0% +4.4%

The pattern is inconsistent. Some experiments show strong outperformance. Others show underperformance. The most rigorous studies tend to find that LLMs struggle to beat simple buy-and-hold strategies in real-world trading conditions, even when they show promise in backtests.

The University of Chicago study found GPT-4 generated 10% annualized alpha with a Sharpe ratioMeasures return per unit of risk. Above 1.0 is good, above 2.0 is excellent. of 3.36 using chain-of-thought reasoning on anonymized financial statements. That's a remarkable finding. It also hasn't been replicated in live trading with real money at scale.

AI Stock Pickers Love the Obvious Winners

When you ask ChatGPT, Claude, or Gemini to pick stocks, they tend to recommend the same companies: NVIDIA, Microsoft, Apple, Amazon, Alphabet. These are the most discussed companies in their training data. They're the companies with the most analyst coverage, the most earnings call transcripts, and the most bullish commentary.

In 2025, a Motley Fool writer compared ChatGPT's three picks (Microsoft, NVIDIA, Visa) against his own three picks (Amazon, Axon Enterprise, Uber). ChatGPT returned 17.4%. The human returned 23.1%. Both beat the S&P 500's 13%.

But ChatGPT's picks were the second and third largest companies in the index by market cap. You don't need artificial intelligence to suggest buying the biggest, most successful companies on Earth. That's the equivalent of asking a language model for restaurant recommendations and getting back "Try the most popular restaurant in the city."

70%+
Estimated overlap between LLM "top picks" and the S&P 500's top 10 holdings
Most AI chatbot stock recommendations are mega-cap companies that already dominate the S&P 500. If your AI-selected portfolio is 70% the same as the index, you're paying attention costs for marginal differentiation. You could get most of the same exposure by buying VOO.

The Tasks LLMs Are Actually Good At

01
Processing Earnings Calls at Scale
GPT-4 can read and summarize thousands of earnings call transcripts faster than any human analyst team. It can detect changes in management tone, spot hedging language, and flag guidance revisions across an entire sector in minutes. This is the University of Chicago study's insight: when you strip away company names and give GPT-4 raw financial data, it can identify patterns that predict future earnings changes.
02
Screening and Filtering
LLMs are excellent at reducing a universe of 4,000 stocks to a shortlist of 20 candidates based on specific criteria. Revenue growth above 25%, gross margins above 60%, forward P/EPrice-to-Earnings ratio using next year's expected earnings. Lower means cheaper. below 30, analyst consensus rating above 4.0. The screening itself isn't the edge. The edge is that LLMs can explain why each stock passed the screen and what the risks are, saving hours of manual research.
03
Multi-Agent Research Systems
The MarketSenseAI experiment used five specialized AI agents (news, fundamentals, momentum, macro, and signal generation) working together. This multi-agent approach returned 125.9% versus 73.5% for the S&P 100 over 2023-2024. The key was that each agent handled a different dimension of analysis. No single chatbot prompt could replicate this. It required a system, not a conversation.

The Tasks LLMs Get Wrong

01
Regime Changes
LLMs are trained on historical data. When the market regime shifts, their predictions break. The 2022 rate-hiking cycle caught every model off guard because the training data was dominated by a decade of low rates. An LLM can't predict the next pandemic, the next banking crisis, or the next Fed pivot. It can only extrapolate from what it's already seen.
02
Live Trading Execution
The StockBench real-world trading benchmark tested GPT-5, Claude 4, Gemini, and other frontier models in live market conditions. Most failed to beat a simple buy-and-hold strategy. The gap between backtest performance and live performance was significant. Slippage, transaction costs, and market impact all eroded returns that looked strong on paper. Knowing what to buy is only half the problem. Knowing when to buy and sell, in real time, with real money at stake, is where models break down.
03
The Consensus Trap
If millions of investors all use the same AI models for stock picks, the models' recommendations become consensus. And consensus trades have zero alpha by definition. A University of Florida finance professor predicted that within five years of widespread AI adoption, the return predictability from LLM-based strategies would drop to zero. The edge disappears as more people find it.

How to Use AI for Stock Research Without Fooling Yourself

Use AI For

Screening stocks by quantitative criteria. Summarizing earnings calls and 10-K filings. Comparing companies within a sector on specific financial metrics. Generating investment theses you then verify with primary sources. Building multi-factor scoring models. Identifying data patterns in large datasets. AI works like a research analyst who never sleeps and reads everything but has no original insight about the future.

Don't Use AI For

Final buy/sell decisions. Timing entries and exits. Predicting macro events or regime changes. Replacing your own judgment on position sizing and risk management. Following "top 10 stock picks" from any chatbot without doing your own analysis. The most common mistake: treating an LLM's confident tone as evidence of accuracy. These models sound certain even when they're guessing.

The best use of AI in investing isn't picking stocks. It's processing information faster. The decision still needs to be yours. Any system that removes you from the decision-making loop will eventually fail at the worst possible time.
Research, Yes. Trade, No.
LLMs can read every earnings call in the S&P 500 before breakfast. They can screen thousands of stocks in seconds. They can summarize a 200-page annual report in a paragraph. Use them for that. Don't hand them your portfolio.

How I Built This

Analysis based on published academic research, public experiment trackers, financial media reporting, and ETF performance data as of early 2026.

University of Chicago Study
Kim et al. (2024), GPT-4 chain-of-thought on 150,000 firm-year observations
The 10% annualized alpha and 3.36 Sharpe ratio come from the University of Chicago working paper by Kim, Muhn, and Nikolaev. They tested GPT-4 with anonymized financial statement data, removing company names and dates to prevent look-ahead bias. The model used chain-of-thought prompting to predict earnings direction. This is a backtest, not live trading.
MarketSenseAI
Multi-agent GPT-4o system, S&P 100, 2023-2024
The 125.9% cumulative return versus 73.5% for the S&P 100 comes from the MarketSenseAI paper. The system used five specialized agents (news, fundamentals, dynamics, macro, signal). This was a simulated portfolio, not a live fund. Transaction costs and slippage were partially but not fully modeled.
StockBench
Real-world trading benchmark with frontier models
The finding that most LLM agents fail to beat buy-and-hold comes from the StockBench paper (2025), which tested GPT-5, Claude 4, Qwen3, and other models in live market conditions. The key finding was that performance on financial QA benchmarks did not predict trading success.
Finder.com and Motley Fool UK Experiments
Public experiments with published results
The Finder.com experiment (GPT-3.5, 8 weeks, +4.9% vs. +3.0% S&P) and the Motley Fool UK experiment (GPT-4, 9 months, +17.4% vs. +13.0% S&P) are anecdotal experiments reported by financial media. They used small portfolios over short time periods. These results are interesting but not statistically significant and should not be extrapolated to predict long-term performance.
Jesse Walker
Jesse Walker has been an individual investor for 30 years. Before that, he was a poker professional, which is where he learned that the best decision and the best outcome aren't always the same thing. He writes about investing through the uncertainty of AI.

Nothing on this site constitutes investment advice. All content is for informational purposes only. Full terms.