Artificial intelligence (AI) is becoming increasingly integral to modern society, transforming how individuals and businesses operate. The rise of GenAI has captivated audiences with its potential to revolutionize industries, drive productivity, and fuel creative breakthroughs. Yet, despite the impressive capabilities, there are concerns about GenAI’s lack of a true understanding of the world and the underlying principles that govern it.
A team of scientists from MIT, Harvard, and Cornell have found that large language models (LLMs), like OpenAI’s GPT-4 and Anthropic Claude 3 Opus, fail to generate an accurate representation of the real world. An LLM model that seems to perform well in one context, might break down if the environment changes slightly.
To test the LLMs, researchers developed two new metrics. The first metric, sequence compression, checks if the model understands that different inputs leading to the same situation should behave the same way moving forward. The second metric, sequence distinction, analyzes whether the model knows that inputs leading to different situations should behave differently.
Together, the two metrics provide a framework to assess whether the LLM captures underlying patterns in how inputs relate to outcomes. This can evaluate the model’s ability to generate consistent responses in various contexts.
To demonstrate their findings, the researchers presented an experiment involving a popular LLM tasked with providing driving directions in New York City. While the LLM delivered near-100% accuracy, the researchers found that it used maps filled with streets and routes that didn’t exist.
The problem worsened when the researchers introduced unexpected changes such as road closures and detours. The LLM struggled to adjust, leading to a significant drop in accuracy. In some cases, it failed entirely to handle these real-world disruptions. Closing just 1% of streets led to a drop in the AI’s directional accuracy from nearly 100% to 67%.
“One hope is that, because LLMs can accomplish all these amazing things in language, maybe we could use these same tools in other parts of science, as well. But the question of whether LLMs are learning coherent world models is very important if we want to use these techniques to make new discoveries,” says senior author Ashesh Rambachan, assistant professor of economics and a principal investigator in the MIT Laboratory for Information and Decision Systems (LIDS).
The scientist detailed their research in a study published in the arXiv preprint database. Rambachan co-authored the paper with lead author Keyon Vafa, a postdoc at Harvard; Justin Y. Chen, an electrical engineering and computer science (EECS) graduate student at MIT; Jon Kleinberg, Tisch University Professor of Computer Science and Information Science at Cornell University; and Sendhil Mullainathan, an MIT professor in the departments of EECS and of Economics, and a member of LIDS.
The findings of the research will be presented at the Conference on Neural Information Processing Systems at the Vancouver Convention Center in December this year.
Several other studies and incidents have highlighted the unpredictable nature of LLMs. Just last week, CBS reported an incident where a student in Michigan received a threatening response during a chat with Google AI’s chatbot Gemini.
Google responded to the incident by stating that “Large language models can sometimes respond with non-sensical responses, and this is an example of that. This response violated our policies and we’ve taken action to prevent similar outputs from occurring.”
However, this is not an isolated event as other AI models have also been shown to return concerning outputs. Last month, the mother of a Florida teen filed a lawsuit against an AI company claiming their AI model encouraged her son to take his life.
Unless AI models truly understand the systems they interact with, their impressive results can be deceptive. They may do well in familiar situations but often fail when conditions change. To be truly reliable, AI must go beyond just performing well. They must demonstrate a deeper understanding of the contexts in which it operates.
Related
This post was originally published on here