When you interact with a large language model like Claude, what exactly is happening? This fundamental question drives Anthropic's interpretability team. As the researchers explain, we're not dealing with a simple database of responses or a glorified autocomplete - something far more complex and fascinating is occurring inside these models.
The core mystery: Nobody fully understands what's happening inside these AI systems, even the people who built them. This is because LLMs aren't programmed with explicit rules like "if someone says hello, respond with hello." Instead, they're trained through an evolutionary process where their internal components get tweaked millions of times until they become excellent at predicting what comes next in text.
The researchers - Jack (a former neuroscientist), Emanuel (a machine learning engineer), and Josh (with backgrounds in mathematics and viral evolution) - approach studying LLMs like biologists studying unknown organisms. This isn't just a cute metaphor; it reflects a profound truth about how these systems develop.
Key insight: Just as evolution shaped humans to survive and reproduce, but we don't consciously think about reproduction all day, LLMs are shaped to predict the next word, but internally they've developed complex intermediate goals and abstractions to achieve this.
The models develop through a process remarkably similar to evolution:
While technically LLMs are "just" predicting the next word, this description vastly undersells what's happening. To predict words accurately in complex contexts, the models must: