How Large Language Models Think: Understanding the Inner Workings of AI

A Deep Dive into Anthropic's Interpretability Research

The Big Picture: What Are We Really Talking To?

When you interact with a large language model like Claude, what exactly is happening? This fundamental question drives Anthropic's interpretability team. As the researchers explain, we're not dealing with a simple database of responses or a glorified autocomplete - something far more complex and fascinating is occurring inside these models.

The core mystery: Nobody fully understands what's happening inside these AI systems, even the people who built them. This is because LLMs aren't programmed with explicit rules like "if someone says hello, respond with hello." Instead, they're trained through an evolutionary process where their internal components get tweaked millions of times until they become excellent at predicting what comes next in text.

The Biology of AI: Why Researchers Compare LLMs to Living Organisms

The researchers - Jack (a former neuroscientist), Emanuel (a machine learning engineer), and Josh (with backgrounds in mathematics and viral evolution) - approach studying LLMs like biologists studying unknown organisms. This isn't just a cute metaphor; it reflects a profound truth about how these systems develop.

Key insight: Just as evolution shaped humans to survive and reproduce, but we don't consciously think about reproduction all day, LLMs are shaped to predict the next word, but internally they've developed complex intermediate goals and abstractions to achieve this.

The models develop through a process remarkably similar to evolution:

They start out terrible at their task
Through countless small adjustments based on training data
They gradually develop sophisticated internal structures
Nobody explicitly designed these structures - they emerged naturally

Beyond Word Prediction: The Emergence of Actual Thinking

While technically LLMs are "just" predicting the next word, this description vastly undersells what's happening. To predict words accurately in complex contexts, the models must:

Understand context deeply
Perform calculations without calculators
Plan several steps ahead
Develop abstract concepts
Create internal representations of ideas that transcend individual languages