作者:Edd Gent
Since OpenAI’s launch of ChatGPT in 2022, AI companies have been locked in a race to build increasingly gigantic models, causing companies to invest huge sums in building data centers. But toward the end of last year, there were rumblings that the benefits of model scaling were hitting a wall. The underwhelming performance of OpenAI’s largest ever model, GPT-4.5, gave further weight to the idea.
This situation is prompting a shift in focus, with researchers aiming to make machines “think” more like humans. Rather than building larger models, researchers are now giving them more time to think through problems. In 2023, a team at Google introduced the chain of thought (CoT) technique, in which large language models (LLMs) work through a problem step by step.
This approach underpins the impressive capabilities of a new generation of reasoning models like OpenAI’s o3, Google’s Gemini 2.5, Anthropic’s Claude 3.7, and DeepSeek’s R1. And AI papers are now awash with references to “thought,” “thinking,” and “reasoning,” as the number of cognitively inspired techniques proliferate.
“Since about the spring of last year, it has been clear to anybody who is serious about AI research that the next revolution will not be about scale,” says Igor Grossmann, a professor of psychology at the University of Waterloo, Canada. “The next revolution will be about better cognition.”
At their core, LLMs use statistical probabilities to predict the next token—the technical name for the chunks of text that models work with—in a string of text. But the CoT technique showed that simply prompting the models to respond with a series of intermediate “reasoning” steps before arriving at an answer significantly boosted performance on math and logic problems.
“It was a surprise that it worked so incredibly well,” says Kanishk Gandhi, a computer-science graduate student at Stanford University. Since then, researchers have devised a host of extensions of the technique, including “tree of thought,“ “diagram of thought,“ “logic of thought,“ and “iteration of thought,“ among others.
Leading model developers have also used reinforcement learning to bake the technique into their models, by getting a base model to produce CoT responses and then rewarding those that lead to the best final answers. In the process, models have developed a variety of cognitive strategies that mirror how humans solve complex problems, says Gandhi, such as breaking them down into simpler tasks and backtracking to correct mistakes in earlier reasoning steps.
But the way these models are trained can lead to problems, says Michael Saxon, a graduate student at University of California, Santa Barbara. Reinforcement learning requires a way to verify whether a response is correct to determine whether to give a reward. This means reasoning models have primarily been trained on tasks where this verification is easy, such as math, coding, or logical puzzles. As a result, they tend to tackle all questions as if they were complicated reasoning problems, which can lead to overthinking, says Saxon.
In a recent experiment described in a preprint paper, he and colleagues gave various AI models a series of deliberately easy tasks, and showed that reasoning models use far more tokens to get to a correct answer than conventional LLMs. In some cases this overthinking even led to worse performance. Interestingly, Saxon says that dealing with the models the same way you’d deal with an overthinking human proved highly effective. The researchers got the model to estimate how many tokens it would take to solve the problem, and then gave it regular updates during the reasoning process on how many it had left before it needed to give an answer.
“That’s been a recurring lesson,” says Saxon. “Even though the models don’t really act like humans in a lot of important ways, approaches that are inspired by our own cognition can be surprisingly effective.”
There are still important gaps in these models’ reasoning capabilities. Martha Lewis, an assistant professor of neurosymbolic AI at the University of Amsterdam, recently compared the ability of LLMs and humans to reason through the use of analogies, which is believed to form the basis of much creative thinking.
When tested on standard versions of analogical reasoning tests, both models and humans performed well. But when they were given new variants of the tests, model performance nose-dived compared to that of humans. The likely explanation, says Lewis, is that problems similar to the standard versions of these tests were in the models’ training data and they were simply using shallow pattern matching to find the solutions rather than reasoning. The tests were conducted on OpenAI’s older GPT-3, GPT-3.5, and GPT-4 models, and Lewis says it’s possible that newer reasoning models would perform better. But the experiments demonstrate the need for caution when talking about AI’s cognitive capabilities.
“Because the models do generate very fluent output, it’s very easy to feel as if they’re doing something more than they actually can,” says Lewis. “I don’t think we should say that these models are reasoning without really testing what we mean by reasoning within a specific context.”
Another important area where AI’s reasoning capabilities may be deficient is the ability to think about the mental states of others, something known as theory of mind. Several papers have demonstrated that LLMs can solve classical psychological tests of this capability, but researchers at the Allen Institute for AI (AI2) suspected this exemplary performance may be due to the tests’ inclusion in training datasets.
So the researchers created a new set of theory-of-mind tests grounded in real-world situations, which separately measured a model’s ability to deduce someone’s mental state, predict how that state influences their behavior, and judge whether their actions were reasonable. For instance, the model might be told that someone picks up a closed packet of chips in the supermarket, but the contents are moldy. It is then asked whether the person knows that the chips are moldy, whether they would still buy the chips, and whether that would be reasonable.
The team found that while the models were good at predicting mental states, they were bad at predicting behavior and judging reasonableness. AI2 research scientist Ronan Le Bras suspects this is because the models calculate the probability of actions based on all of the data available to them—and they know, for instance, that it’s highly unlikely that someone would buy moldy chips. Even though the models can deduce someone’s mental state, they don’t appear to take this state into account when predicting their behavior.
However, the researchers found that reminding the models of their mental-state prediction, or giving them a specific CoT prompt telling them to consider the character’s awareness, significantly improved performance. Yuling Gu, a predoctoral young investigator at AI2, says it’s important that models use the correct pattern of reasoning for specific problems. “We’re hoping that in the future, such reasoning will be baked deeper into these models,” she says.
Getting models to reason flexibly across a wide range of tasks may require a more fundamental shift, says the University of Waterloo’s Grossmann. Last November, he coauthored a paper with leading AI researchers highlighting the need to imbue models with metacognition, which they describe as “the ability to reflect on and regulate one’s thought processes.”
Today’s models are “professional bullshit generators,” says Grossmann, that come up with a best guess to any question without the capacity to recognize or communicate their uncertainty. They are also bad at adapting responses to specific contexts or considering diverse perspectives, things humans do naturally. Providing models with these kinds of metacognitive capabilities will not only improve performance but will also make it easier to follow their reasoning processes, says Grossmann.
Doing so will be tricky, he adds, because it will either involve a mammoth effort to label training data for things like certainty or relevance, or the addition of new modules to the models that do things like evaluate the confidence of reasoning steps. Reasoning models already use far more computational resources and energy than standard LLMs, and adding these extra training requirements or processing loops is likely to worsen the situation. “It could put a lot of the small companies out of business,” says Grossmann. “And there is an environmental cost associated with that as well.”
Nonetheless, he remains convinced that attempting to mimic the cognitive processes behind human intelligence is the most obvious path forward, even if most efforts today are highly simplistic. “We don’t know an alternative way to think,” he says. “We can only invent things that we have some kind of conceptual understanding of.”