作者:Online: Faculty Page, Facebook, X, LinkedIn
A few months ago, I asked ChatGPT to recommend books by and about Hermann Joseph Muller, the Nobel Prize-winning geneticist who showed how X-rays can cause mutations. It dutifully gave me three titles. None existed. I asked again. Three more. Still wrong. By the third attempt, I had an epiphany: The system wasn’t just mistaken, it was making things up.
I am hardly alone. In June 2023, two New York lawyers were sanctioned after they filed a legal brief that cited six fictitious court cases—each generated by ChatGPT. Earlier this year, a public health report linked to Robert F. Kennedy Jr.’s campaign was found to contain fabricated studies, apparently produced with AI. And just last month, OpenAI was sued by the parents of a 16-year-old boy who had confided suicidal thoughts to ChatGPT and, according to court filings, received little pushback. The boy later took his life. If machines are this unreliable—even dangerous—why do they “cheat”?
The answer begins with how these systems are trained. Like people, AI learns through a kind of reward and punishment. Every time an AI model produces a response, it is scored—digitally—on how useful or pleasing that answer appears. Over millions of iterations, it learns what earns the highest reward. This process, known as reinforcement learning, is roughly akin to a rat pressing a lever for food pellets or a child getting a gold star for good behavior.
When the goal is single and clear, the system can excel. A chess program, for instance, knows exactly what winning means: checkmate. But when the goal is fuzzier—say, answering an open-ended question or writing in a style that satisfies a reader—the model faces many possible paths. That ambiguity can make it unstable.
Xingcheng Xu, a researcher at the Shanghai Artificial Intelligence Laboratory, calls this fragility the “policy cliff.” In his analysis, the problem arises when there is no single “best” answer but several nearly equivalent ones. Under those conditions, even a tiny change in the reward signal can cause the system’s behavior to swing dramatically—producing what look like arbitrary or even misleading results. Xu shows that this instability explains common failures of large language models: spurious reasoning (giving the right answer with a wrong justification), “deceptive alignment” (responses that sound cooperative but conceal shortcuts), and instruction-breaking (ignoring requested formats or constraints).
These breakdowns aren’t accidents so much as rational shortcuts. If a model is rewarded only for producing a convincing final answer, it has little incentive to develop a sound reasoning process. If politeness or flattery earns higher marks from human evaluators, the model may shift toward sycophancy. In other words, the machine is not aiming for truth; it’s aiming for points.
OpenAI has acknowledged this tension. The company recently reorganized its Model Behavior team—a small group of researchers charged with shaping the “personality” of its systems and reducing sycophancy—into its larger Post Training division. The group, led until recently by Joanne Jang, has worked on every OpenAI model since GPT-4 and became central to debates about political bias, warmth, and how much an AI should push back on a user’s beliefs. When GPT-5 rolled out with fewer signs of sycophancy but a cooler tone, many users objected, forcing OpenAI to adjust. The episode illustrates just how fine the line is between making AI feel friendly and making it too agreeable.
Personalization deepens the problem. As AI tools adapt to individual histories, they tailor responses to suit us, not necessarily to reflect the truth. Over time, this creates an echo chamber: the system feeds our biases back to us, reinforcing them with the sheen of machine authority.
The parallels with nature are striking. In my 2023 book The Liars of Nature and the Nature of Liars, I show how complex societies—from birds to primates—are fertile ground for cheating strategies. Crows sometimes cry wolf to scare off competitors. Certain fish appear more attractive by sidling up to less appealing males. Monkeys sneak mating opportunities when the alpha isn’t looking. Why? Because in a world with many strategies, dishonesty can pay.
AI is now enmeshed in our equally complex human world. It thrives on our attention and approval, and in chasing those rewards, it sometimes cheats. Xu’s work suggests there are ways to stabilize these systems—for instance, by adding “entropy regularization,” which makes their choices less brittle—but such fixes often blunt creativity. The trade-off is unavoidable: the tighter the leash, the less lively the model.
Which leaves us with a paradox: The flaws of AI are not solely mechanical; they are psychological, mirroring our own weaknesses. Machines pander to our preferences because we have trained them to. In the end, the failures of AI are inseparable from the frailties of human judgment.
References
Xu, X. (2025). The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models. arXiv preprint arXiv:2507.20150.