The incident, which unfolded in mid-October, began with an effusive post by Sébastien Bubeck, a research scientist at OpenAI. Bubeck announced that two mathematicians had purportedly used GPT-5 to find solutions to 10 unsolved problems in mathematics, triumphantly declaring, "Science acceleration via AI has officially begun." This bold proclamation immediately drew the ire of Thomas Bloom, a mathematician at the University of Manchester and the creator of erdosproblems.com, a website meticulously tracking the status of over 1,100 problems posed by the prolific 20th-century mathematician Paul Erdős. Bloom, who maintains that around 430 of these problems have documented solutions, swiftly countered Bubeck’s claim on X, stating, "This is a dramatic misrepresentation."

Bloom’s critique illuminated a critical flaw in Bubeck’s celebratory announcement. He explained that his website’s lack of a listed solution did not definitively mean a problem was unsolved; it simply indicated his own lack of awareness of any published answer. Given the sheer volume of mathematical literature, it’s a Herculean task for any single individual to be privy to every single published proof. However, an AI like GPT-5, with its capacity to scour the vast expanse of the internet, might indeed encounter solutions that have eluded human oversight. As it transpired, GPT-5 had not independently discovered new solutions to 10 previously unsolved Erdős problems. Instead, it had, in essence, performed an advanced literature search, unearthing 10 existing solutions that Bloom had not yet encountered. The AI’s actual capability, while still impressive in its own right, was far less revolutionary than initially portrayed.

This "Erdős gotcha," as it’s been termed, offers two significant takeaways regarding the current AI landscape. Firstly, it underscores the detrimental impact of social media on the responsible dissemination of scientific news. The urge for immediate gratification and the pursuit of viral attention often lead to premature and hyperbolic announcements, bypassing the crucial stages of verification and sober analysis. A more measured approach, involving less knee-jerk reaction and more critical gut-checking, is sorely needed. Secondly, and perhaps more subtly, the incident highlights the genuinely remarkable, yet often overshadowed, capabilities of AI like GPT-5. Its ability to efficiently locate obscure or overlooked research papers is a powerful tool that could revolutionize how mathematicians conduct their literature reviews and identify potential avenues for further research. However, this more prosaic but still valuable application was lost in the cacophony of hype.

Françoise Charton, a research scientist specializing in the application of Large Language Models (LLMs) to mathematics at the AI startup Axiom Math, corroborated this sentiment. She explained that mathematicians are indeed keenly interested in leveraging LLMs to sift through immense bodies of existing research. However, the allure of groundbreaking discovery, the kind that ignites the imaginations of AI’s fervent social media boosters, often eclipses the less glamorous but equally important task of literature synthesis. Bubeck’s misstep, unfortunately, is not an isolated incident.

Another instance occurred in August, when a pair of mathematicians demonstrated that no LLM at the time could solve a specific mathematical puzzle known as Yu Tsumura’s 554th Problem. Merely two months later, social media buzzed with claims that GPT-5 had conquered this challenge. One observer, drawing parallels to the historic match between Go champion Lee Sedol and DeepMind’s AlphaGo, commented, "Lee Sedol moment is coming for many." However, Charton, offering a more grounded perspective, pointed out that solving Yu Tsumura’s 554th Problem is hardly a monumental feat for the mathematical community. "It’s a question you would give an undergrad," she remarked, lamenting the prevailing "tendency to overdo everything."

Concurrently, more sober and empirical assessments of LLMs’ capabilities are emerging. While mathematicians were engaged in online debates about GPT-5, two significant studies were published examining the use of LLMs in medicine and law – fields where AI developers have frequently claimed their technology excels. In medicine, researchers found that while LLMs could assist with certain diagnoses, they were demonstrably flawed in recommending treatments. Similarly, in the legal domain, studies revealed that LLMs often provided inconsistent and incorrect advice. The authors of one such study concluded, "Evidence thus thus far spectacularly fails to meet the burden of proof."

However, such cautious and evidence-based findings struggle to gain traction on platforms like X. "You’ve got that excitement because everybody is communicating like crazy—nobody wants to be left behind," Charton explained, encapsulating the FOMO (fear of missing out) driving the rapid-fire exchange of information. X has become the primary conduit for much of the AI news, the launchpad for trumpeting new results, and the public arena where influential figures like Sam Altman, Yann LeCun, and Gary Marcus engage in spirited debates. The sheer velocity of information makes it challenging to keep pace, and even more difficult to disengage.

Bubeck’s post was only embarrassing because his error was identified and corrected. The unfortunate reality is that not all inaccuracies are so readily exposed. Unless a significant shift occurs in how AI advancements are communicated and verified, researchers, investors, and casual enthusiasts will continue to fuel each other’s exaggerated optimism. As Charton observed, many of these individuals, whether scientists or not, are united by a shared passion for technology. "Huge claims work very well on these networks," she stated, underscoring the effectiveness of sensationalism in capturing attention.

There’s a coda to this unfolding narrative. The content you’re reading was originally prepared for the "Algorithm" column in the January/February 2026 issue of MIT Technology Review magazine. Mere days after the magazine went to print, Axiom announced that its own mathematical model, AxiomProver, had achieved a significant milestone: it had solved two previously open Erdős problems (specifically, #124 and #481). This accomplishment is particularly noteworthy for a startup founded only a few months prior, demonstrating the rapid pace of AI development.

The news didn’t stop there. Five days later, Axiom reported that AxiomProver had successfully solved nine out of twelve problems from this year’s Putnam competition, a prestigious undergraduate mathematics challenge often considered more difficult than the International Mathematical Olympiad. Notably, LLMs from both Google DeepMind and OpenAI had reportedly aced the latter competition a few months prior. The Putnam achievement was met with widespread acclaim on X, garnering endorsements from prominent figures such as Jeff Dean, chief scientist at Google DeepMind, and Thomas Wolf, co-founder of the AI firm Hugging Face.

However, the familiar debates quickly resurfaced in the replies. Some researchers pointed out a crucial distinction: while the International Mathematical Olympiad emphasizes creative problem-solving, the Putnam competition primarily tests mathematical knowledge. This distinction, they argued, makes the Putnam notoriously challenging for human undergraduates but theoretically more accessible for LLMs that have been trained on vast datasets of internet-sourced information.

Ultimately, the true measure of Axiom’s achievements, and indeed any AI’s performance in complex domains like mathematics, should not be determined on social media. The eye-catching competition wins represent a starting point, not a definitive conclusion. A deeper examination into the precise mechanisms by which these LLMs solve challenging mathematical problems is essential to accurately gauge their true capabilities and limitations. The conversation must move beyond superficial pronouncements and delve into the intricate workings of the technology itself.