On Tuesday, OpenAI published a new research paper detailing a technique that uses its GPT-4 language model to write explanations for the behavior of neurons in its older GPT-2 model, albeit imperfectly. It’s a step forward for “interpretability,” which is a field of AI that seeks to explain why neural networks create the outputs they do.
While large language models (LLMs) are conquering the tech world, AI researchers still don’t know a lot about their functionality and capabilities under the hood. In the first sentence of OpenAI’s paper, the authors write, “Language models have become more capable and more widely deployed, but we do not understand how they work.”
For outsiders, that likely sounds like a stunning admission from a company that not only depends on revenue from LLMs but also hopes to accelerate them to beyond-human levels of reasoning ability.
But this property of “not knowing” exactly how a neural network’s individual neurons work together to produce its outputs has a well-known name: the black box. You feed the network inputs (like a question), and you get outputs (like an answer), but whatever happens in between (inside the “black box”) is a mystery.
In an attempt to peek inside the black box, researchers at OpenAI utilized its GPT-4 language model to generate and evaluate natural language explanations for the behavior of neurons in a vastly less complex language model, such as GPT-2. Ideally, having an interpretable AI model would help contribute to the broader goal of what some people call “AI alignment,” ensuring that AI systems behave as intended and reflect human values. And by automating the interpretation process, OpenAI seeks to overcome the limitations of traditional manual human inspection, which is not scalable for larger neural networks with billions of parameters.
OpenAI’s technique “seeks to explain what patterns in text cause a neuron to activate.” Its methodology consists of three steps:
- Explain the neuron’s activations using GPT-4
- Simulate neuron activation behavior using GPT-4
- Compare the simulated activations with real activations.
To understand how OpenAI’s method works, you need to know a few terms: neuron, circuit, and attention head. In a neural network, a neuron is like a tiny decision-making unit that takes in information, processes it, and produces an output, just like a tiny brain cell making a decision based on the signals it receives. A circuit in a neural network is like a network of interconnected neurons that work together, passing information and making decisions collectively, similar to a group of people collaborating and communicating to solve a problem. And an attention head is like a spotlight that helps a language model pay closer attention to specific words or parts of a sentence, allowing it to better understand and capture important information while processing text.
By identifying specific neurons and attention heads within the model that need to be interpreted, GPT-4 creates human-readable explanations for the function or role of these components. It also generates an explanation score, which OpenAI calls “a measure of a language model’s ability to compress and reconstruct neuron activations using natural language.” The researchers hope that the quantifiable nature of the scoring system will allow measurable progress toward making neural network computations understandable to humans.
interactive website that contains example breakdowns of each step, showing highlighted portions of the text and how they correspond to certain neurons. Additionally. OpenAI has provided “Automated interpretability” code and its GPT-2 XL neurons and explanations datasets on GitHub.
If they ever figure out exactly why ChatGPT makes things up, all of the effort will be well worth it.