Measured sycophancy rates on the BrokenMath benchmark. Lower is better. Measured sycophancy rates on the BrokenMath benchmark. Lower is better. Credit: Petrov et al
GPT-5 also showed the best “utility” across the tested models, solving 58 percent of the original problems despite the errors introduced in the modified theorems. Overall, though, LLMs also showed more sycophancy when the original problem proved more difficult to solve, the researchers found.
While hallucinating proofs for false theorems is obviously a big problem, the researchers
→ Continue reading at Ars Technica
