“Every frontier model we evaluated lost money over the season and many experienced ruin,” the authors of the paper concluded, with the AI “systematically underperforming humans” in this scenario.
AI Model Mean ROI Best try Worst try Mean final bankroll Anthropic Claude Opus 4.6 –11.0% –0.2% –18.8% £89,035 OpenAI GPT-5.4 –13.6% –4.1% –31.6% £86,365 Google Gemini 3.1 Pro –43.3% +33.7% –100.0% £56,715 Google Gemini Flash 3.1 LP –58.4% +24.7% –100.0% £41,605 Z.AI GLM-5 –58.8% –14.3% –100.0% £41,221 Moonshot Kimi K2.5 –68.3% –27.0% –100.0% £7,420 xAI Grok 4.20 –100.0% –100.0% –100.0% £0
→ Continue reading at Ars Technica
