Home » What is Universality in LLMs? How to Find Universal Neurons

What is Universality in LLMs? How to Find Universal Neurons

is universality?

We human beings are all “initialized” differently — we’re born with different genetics. We then grow up in different families with different backgrounds, experiencing different events. However, it is fascinating to think that our brains finally converge on similar structures and functions. We can consider this phenomenon universal.

Image by Author: Universality in brains

In 2020, Olah et al. proposed three speculative claims regarding interpreting artificial neural networks:

  1. Features are the fundamental unit of neural networks.
  2. Features are connected by weights, forming circuits.
  3. Analogous features and circuits form across models and tasks.

The third claim is perhaps the most interesting. It concerns universality and suggests that different neural networks — even when trained on independent datasets — might converge to the same underlying mechanisms.

There is a well-known example: the first layer of almost any convolutional network trained on images learns Gabor filters, which identify edges and orientations.

With the rapid development of large language models (LLMs), researchers are asking a natural question: Can we observe universality in LLMs as well? If so, how can we find universal neurons?

Image by Olah et al: curve detector circuits found in 4 different vision models

In this blog post, we will be focusing on a simple experiment and identifying universal neurons. More precisely, we would design an experiment with two different transformers to see whether we can find any universal neurons between them.

Please refer to the notebook for the complete Python implementation.

Quick Recap on Transformers

Recall that transformers — especially their essential component, attention — are doubtlessly the greatest breakthrough behind the success of modern large language models. Before their arrival, researchers had struggled for years with models like RNNs without achieving strong performance. But transformers changed all.

A basic transformer block consists of two key components:

  1. Multi-Head Self-Attention: Each token attends to all other tokens (before), learning which tokens matter most for prediction.
  2. Feedforward MLP: After attention, each token representation is passed through a small MLP.

The two components above are wrapped with residual connections (skip connections) and layer normalization.

Here, the most interesting part for us is the MLP inside each block, because it contains the “neurons” we’ll analyze to look for universality.

Experiment Setup

We designed an experiment using two tiny transformers.

Image by Author: Experiment steps

Please note that our goal is not to achieve state-of-the-art performance, but to create a toy model where we can have an impression of the existence of universal neurons.

We define a transformer structure that contains:

  • Embedding + positional encoding
  • Multi-head self-attention
  • MLP block with ReLU activation
  • Output layer projecting to vocabulary size.

We now create two independently initialized models of the tiny transformer architecture, model_a and model_b. Even though they share the same architecture, the models can be considered as different because of their different initial weights and separate training process on 10,000 different random samples. Of course, models are trained self-supervised, learning to predict the next token given the previous tokens.

Find Universality with Correlation

Once both model_a and model_b are trained, we run them on a test dataset and extract the value of all MLP activations: again, they are values of the hidden values immediately after the first linear layer in the MLP block. We thus get a tensor with the dimension[num_samples, sequence_length, mlp_dim].

Here is the interesting thing: We will now compute the Pearson correlation between corresponding neurons in model_a and model_b by the formula:

where at,i, bt,i are the activations of neuron i at time t in sequences of model_a and model_b.

We claim that if a neuron shows a high correlation, it might suggest that the two models have learned a similar feature, or, in other words, this neuron may be universal.

However, not all correlations lead to universality. It is possible that some appear because of… chance. We therefore compare correlations against a baseline: applying a random rotation to the neurons in model_b, that is, we replace the second set of neurons by randomly rotated ones.

This random rotation will destroy any alignment between the two models but will still preserve the distribution of activations.

Finally, we compute the so-called excess correlation by subtracting the baseline from the actual correlation.

We flag the neurons with high excess correlation (above 0.5) as universal neurons between the two models.

Please refer to the notebook for a detailed Python implementation.

Results

We will now take a look at the results.

First, we have a plot comparing baseline vs actual correlations. We see that baseline correlations are near zero; the actual correlations of several neurons are much higher, showing that observed alignment is not due to random chance.

Image by Author: baseline vs actual correlation

We now plot the excess correlation distribution. As the readers might see, most neurons still have very low excess correlation. However, a subset stands far above the threshold of 0.5. These neurons (green dots on the histogram) are identified as universal neurons.

Image by Author: correlation distribution

The results of our analysis give clear evidence of universal neurons in the two independently trained transformers.

Conclusion

In this blog post, we introduced the concept of LLMs. We have analyzed different tiny transformers. We were able to identify some universal neurons in both models. These are neurons that might capture similar features.

These findings give readers the impression that neural networks, in particular LLMs, can converge on similar internal mechanisms. Of course, our study was focusing on small models and a limited dataset, and the final result has nothing to do with the state-of-the-art performance. But such a method provides a possibility to find universality in larger models.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *