How Large Language Models Think: Understanding the Inner Workings of AI

A Deep Dive into Anthropic's Interpretability Research

The Big Picture: What Are We Really Talking To?

When you interact with a large language model like Claude, what exactly is happening? This fundamental question drives Anthropic's interpretability team. As the researchers explain, we're not dealing with a simple database of responses or a glorified autocomplete - something far more complex and fascinating is occurring inside these models.

The core mystery: Nobody fully understands what's happening inside these AI systems, even the people who built them. This is because LLMs aren't programmed with explicit rules like "if someone says hello, respond with hello." Instead, they're trained through an evolutionary process where their internal components get tweaked millions of times until they become excellent at predicting what comes next in text.

The Biology of AI: Why Researchers Compare LLMs to Living Organisms

The researchers - Jack (a former neuroscientist), Emanuel (a machine learning engineer), and Josh (with backgrounds in mathematics and viral evolution) - approach studying LLMs like biologists studying unknown organisms. This isn't just a cute metaphor; it reflects a profound truth about how these systems develop.

Key insight: Just as evolution shaped humans to survive and reproduce, but we don't consciously think about reproduction all day, LLMs are shaped to predict the next word, but internally they've developed complex intermediate goals and abstractions to achieve this.

The models develop through a process remarkably similar to evolution:

Beyond Word Prediction: The Emergence of Actual Thinking

While technically LLMs are "just" predicting the next word, this description vastly undersells what's happening. To predict words accurately in complex contexts, the models must: