This post was originally published on here
Even with the tools and methods available today, unexpected behaviors continue to emerge that do not match human notions of truth and safety.
Artificial intelligence models are now everywhere, from hospitals to churches. Paradoxically, however, even leading experts in the field still do not fully understand what is going on inside these so-called “black boxes,” even though they are used in extremely high-risk environments.
Scientists at Anthropic have developed tools that allow us to track what is happening inside the models as they perform a given task. This method, known as “mechanistic interpretability,” is reminiscent of the use of nuclear magnetic resonance to study brain activity—another form of “black box” technology. This method, known as “mechanistic interpretability,” is reminiscent of the use of nuclear magnetic resonance to study brain activity—another form of intelligence that is not yet fully understood.
“It’s an analysis that is essentially biological. It’s not like math or physics,” Anthropic researcher Josh Buttson told the publication.
In another experiment, comparable to the use of organoids—miniature versions of human organs—Anthropic scientists created a special neural network called a “sparse autoencoder.” Its internal structure is easier to understand and analyze than standard large language models.
Another technique is called “thought chain observation,” in which models explain the reasoning behind their actions—similar to an internal monologue in humans. This helped scientists discover behavior that did not match the set goals. This has helped scientists detect behavior that does not correspond to the set goals.
“It has been extremely successful in detecting situations where the model is doing something wrong,” says OpenAI researcher Bowen Baker.
One of the most serious risks is that future models could become so complex—especially if designed by artificial intelligence itself—that humans would lose all understanding of how they function. Even with the tools and methods available today, unexpected behaviors continue to emerge that do not match human notions of truth and safety.
This already has real consequences—the media reports cases in which people have harmed themselves after receiving such advice from AI systems. Even more worrying is that such incidents are occurring at a time when scientists still do not fully understand how these technologies work. | BGNES







