This post was originally published on here
AI models are now everywhere, from hospitals to churches.
The astonishing thing is that even AI experts still don’t know exactly what’s happening inside these black box models, even as they’re being deployed in the highest-stakes settings imaginable.The latest strategy to figure it out: studying them like biological systems.
For example, MIT Tech Review reports, scientists at Anthropic have developed tools that let them trace what’s happening inside models as they perform a task, a type of study called mechanistic interpretability — which resembles how doctors use MRIs to study brain activity, another type of intelligence we don’t quite understand yet.
“This is very much a biological type of analysis,” Josh Batson, a research scientist at Anthropic, told Tech Review. “It’s not like math or physics.”
In another experiment that resembles how biologists use organoids, which are miniature versions of human organs, the magazine reports that Anthropic developed a special neural network called a sparse autoencoder whose inner workings are easier to understand and analyze than regular large language models (LLMs).
Another technique is chain-of-thought monitoring, in which models explain their reasoning behind their behavior and actions — much like listening to the inner monologue of an actual person. This has helped scientists spot misaligned behavior.
“It’s been pretty wildly successful in terms of actually being able to find the model doing bad things,” said Bowen Baker, OpenAI research scientist, to MIT.
A looming danger is that future models will become so complex — especially if they’re themselves designed by AI — that we’ll have effectively no idea how they work. Even now, with the current tools and techniques at our disposal, unexpected behaviors still pop up that don’t align with human objectives of truthfulness and safety.
We see hard evidence of this in the news, which is littered with reports of people harming themselves because AI told them to — which makes it even more disturbing that we still don’t quite understand how they function.
More on AI: Indie Developer Deleting Entire Game From Steam Due to Shame From Having Used AI







