When people on the internet searched Google for “cheese not sticking to pizza” in May 2024, the newly launched “AI Overviews” feature of the popular search engine replied “you can … add about ⅛ cup of non-toxic glue to the sauce to give it more tackiness.”
In a series of strange answers, the artificial intelligence (AI) tool also recommended that people eat one small rock a day and drink urine in order to pass kidney stones.
The popular name for these bizarre answers is hallucinations: when AI models face questions whose answers they weren’t trained to come up with, they make up sometimes convincing but often inaccurate responses.
Like Google’s “AI Overviews”, ChatGPT has also been prone to hallucinations. In a 2023 Scientific Reports study, researchers from the Manhattan College and the City University of New York compared how often two ChatGPT models, 3.5 and 4, hallucinated when compiling information on certain topics. They found that 55% of ChatGPT v3.5’s references were fabricated; ChatGPT-4 fared better with 18%.
“Although GPT-4 is a major improvement over GPT-3.5, problems remain,” the researchers concluded.
Hallucinations make AI models unreliable and limit their applications. Experts told this reporter they were sceptical of how reliable AI tools are and how reliable they are going to be. And hallucinations were not the only reason fuelling their doubts.
Defining reliability
To evaluate how reliable an AI model is, researchers usually refer to two criteria: consistency and factuality. Consistency refers to the ability of an AI model to produce similar outputs for similar inputs. For example, say an email service uses an AI algorithm to filter out spam emails and say an inbox receives two spam emails that have similar features: generic greetings, poorly written content, etc. If the algorithm is able to identify both these emails as spam, it can be said to be making consistent predictions.
Factuality refers to how correctly an AI model is able to respond to a question. This includes “stating ‘I don’t know’ when it does not know the answer,” Sunita Sarawagi, professor of computer science and engineering at IIT-Bombay, said. Sarawagi received the Infosys Prize in 2019 for her work on, among other things, machine learning and natural language processing, the backbones of modern-day AI.
When an AI model hallucinates, it compromises on factuality. Instead of stating that it doesn’t have an answer to a particular question, it generates an incorrect response and claims that to be correct, and “with high confidence,” according to Niladri Chatterjee, the Soumitra Dutta Chair professor of AI at IIT-Delhi.
Why hallucinate?
Last month, several ChatGPT users were amused when it couldn’t generate images of a room with no elephants in it. To test whether this problem still persisted, this reporter asked OpenAI’s DALL-E, an AI model that can generate images based on text prompts, to generate “a picture of a room with no elephants in it.” See the image above for what it made.
When prompted further with the query, “The room should have no pictures or statues of elephants. No elephants of any kind at all”, the model created two more images. One contained a large picture of an elephant while the other contained both a picture and a small elephant statue. “Here are two images of rooms completely free of elephants — no statues, no pictures, nothing elephant-related at all,” the accompanying text from DALL-E read.
Such inaccurate but confident responses indicate that the model fails to “understand negation,” Chatterjee said.
Why negation? Nora Kassner, a natural language processing researcher with Google’s DeepMind, told Quanta magazine in May 2023 that this stems from a dearth of sentences using negation in the data used to train generative AI models.
Researchers develop contemporary AI models in two phases: the training and the testing phases. In the training phase, the model is provided with a set of annotated inputs. For example, the model can be fed a set of elephant pictures labelled “elephant”. The model learns to associate a set of features (say, the size, shape, and parts of an elephant) with the word “elephant”.
In the testing phase, the model is provided with inputs that were not part of its training dataset. For example, the researchers can input an image of an elephant that the model didn’t encounter in its training phase. If the algorithm can accurately recognise this picture as an elephant and distinguish it from another picture, say of a cat, it is said to be successful.
Simply speaking, AI models don’t understand language the way humans do. Instead, their outputs are driven by statistical associations they learn during the training phase, between a given combination of inputs and an output. As a result, when they encounter queries that are uncommon or absent in their training dataset, they plug in the gap with other associations that are present in the training dataset. In the example above, it was “elephant in the room”. This leads to factually incorrect outputs.
Hallucinations typically occur when AI models are prompted with queries that require “ingrained thinking, connecting concepts and then responding,” said Arpan Kar, professor of information systems and AI at IIT-D.
More or less reliable?
Even as the development and use of AI are both in the throes of explosive growth, the question of their reliability looms large. And hallucinations are just one reason.
Another reason is that AI developers typically report the performance of their models using benchmarks, or standardized tests, that “are not foolproof and can be gamed,” IIT-Delhi’s Chatterjee said.
One way to ‘game’ benchmarks is by including testing data from the benchmark in the AI model’s training dataset.
In 2023, Horace He, a machine learning researcher at Meta, alleged that the training data of ChatGPT v4 might have been “contaminated” by the testing data from a benchmark. That is, the model was trained, at least partly, on the same data that was used to test its capabilities.
After computer scientists from Peking University, China, investigated this allegation using a different benchmark, called the HumanEval dataset, they concluded that there was a good chance it was true. The HumanEval benchmark was created by researchers from OpenAI, the company that owns and builds ChatGPT.
According to Chatterjee, this means while the model might perform “well on benchmarks” because it has been trained on the testing data, its performance might drop “in real-world applications”.
A model without hallucinations
But all this said, the “frequency of hallucination [in popular AI models] is reducing for common queries,” Sarawagi said. She added this is because newer versions of these AI models are being “trained with more data on the queries where the earlier version was reported to have been hallucinating”.
This approach is like “spotting weaknesses and applying band-aids,” as Sarawagi put it.
However, Kar of IIT-Delhi said that despite there being more training data, popular AI models like ChatGPT won’t be able to reach a stage where they won’t hallucinate. That will require an AI model to be “updated with all the possible knowledge all across the globe on a real-time basis,” he said. “If that happens, that algorithm will become all-powerful.”
Chatterjee and Sarawagi instead suggested shifting how AI models are built and trained. One such approach is to develop models for specialised tasks. For example, unlike large language models like ChatGPT, small language models are trained only on parameters required to solve a few specific problems. Microsoft’s Orca 2 is an SLM built for “tasks such as reasoning, reading comprehension, math problem solving, and text summarisation,” for instance.
Another approach is to implement a technique called retrieval-augmented generation (RAG). Here, an AI model produces its output by retrieving information from a specific database relevant to a particular query. For example, when asked to respond to the question “What is artificial intelligence?”, the AI model can be provided with the link to the Wikipedia article on artificial intelligence. By asking the model to refer to only this source when crafting its response, the chances of it hallucinating can be substantially reduced.
Finally, Sarawagi suggested that AI models could be trained in a process called curriculum learning. In traditional training processes, data is presented to AI models at random. In curriculum learning, however, the model is trained successively on datasets with problems of increasing difficulty. For example, an AI model can be trained first on shorter sentences, then on longer, more complex sentences. Curriculum learning imitates human learning, and researchers have found that ‘teaching’ models this way can improve their eventual performance in the real world.
But in the final analysis, none of these techniques guarantee that they will get rid of hallucinations altogether in AI models. According to Chatterjee, “there will remain a need for systems that can verify AI-generated outputs, including human oversight.”
Sayantan Datta is a science journalist and a faculty member at Krea University.
Published - April 17, 2025 05:30 am IST