Mechanistic View of Transformers: Patterns, Messages, Residual Stream… and LSTMs

my previous article, I talked about how mechanistic interpretability reimagines attention in a transformers to be additive without any concatenation. Here, I will dive deeper into this perspective and show how it resonates with ideas from LSTMs, and how this reinterpretation opens new doors for understanding.

To ground ourselves: the attention mechanism in transformers relies on a series of matrix multiplications involving the Query (Q), Key (K), Value (V), and an output projection matrix (O). Traditionally, each head computes attention independently, the results are concatenated, and then projected via O. But from a mechanistic perspective, it is better seen that the final projection by weight matrix O is actually applied per head (compared with the traditional view of concatenating the heads and then projecting). This subtle shift implies that the heads are independent and separable until the end.

Image by Author

Patterns and Messages

A brief analogy on Q, K and V: Each matrix is a linear projection of the embedding E. Then, the tokens in Q can be thought of as asking the question “which other tokens are relevant to me?” to K, which represents a key (like in a hashmap) of the actual information contained in the tokens stored in V. In this way, the input tokens in the sequence know which tokens to attend to, and how much.

In essence, Q and K determine relevance, and V holds the content. This interaction tells each token which others to attend to, and by how much. Let us now see how seeing the heads as independent leads to the view that the per-head Query-Key and Value-Output matrices belong to two independent processes, namely patterns and messages.

Unpacking the steps of attention:

Multiply embedding matrix E with W_q to get the query vector Q. Similarly get key vector K and value vector V by multiplying E with W_k and W_v
Multiply with Q and K^T. In traditional view of attention, this operation is seen as determining which other tokens in the sequence are the most relevant to the current token under consideration.
Apply softmax. This ensures that the relevance or similarity scores calculated in the previous step normalize to 1, thereby giving a weighting of the importance of the other tokens in context to the current.
Multiply with V. This step ends the attention calculation wherein we now have extracted information from (that is, attended to) the sequence based on the scores calculated. This gives us a contextually enriched representation of the current token that encodes information as to how other tokens in the sequence relate to it.
Finally, this result is projected back onto model space using O

The final attention calculation then is: QK^TVO

Now, instead of seeing this as ((QK^T)V)O, mechanistic interpretation sees this as the rearranged (QK^T)(VO) where QK^T forms the pattern and VO forms the message. Why does this matter? Because it lets us cleanly separate two conceptual processes:

Messages (VO): figuring out what to transmit (content).

Patterns (QKᵀ): figuring out where to look (relevance).

Diving deeper, remember that Q and K themselves are derived from the embedding matrix E. So, we can also write the equation as:

(EW_q)(W^T_kE)

Mechanistic interpretation refers to W_qW_k as W_p for pattern weight matrix. Here, EW_p can be intuited as producing a pattern that is then matched against the embeddings in the other E, obtaining a score that can be used to weight messages. Basically, this reformulates the similarity calculation in attention to “pattern matching” and gives us a direct relationship between similarity calculation and embeddings.

Similarly VO can be seen as EW_vO that is the per-head value vectors, derived from the embeddings and projected onto model space. Again, this reformulation gives us a direct relationship between the embeddings and the final output, instead of seeing attention as a sequence of steps. Another difference is that while traditional view of attention implies that the information contained in V is extracted using queries represented by Q, the mechanistic view allows us to think that the information to be packed into messages is chosen by the embeddings themselves, and just weighted by the patterns.

Finally, attention using the pattern-message terminology is this: each token in the embedding uses the patterns that were obtained to determine how much of the message to convey to predict the next token.

What this makes possible: Residual Stream

From my previous article again, where we saw the additive reformulation of multi-head attention and this one where we just reformulated the attention calculation directly in terms of embeddings, we can view each operation as being additive to instead of transforming the initial embedding. The residual connections in transformers which are traditionally interpreted as skip connections can be reinterpreted as a residual stream which carries the embeddings and from which components like multi-head attention and MLP read, do something and add back to the embeddings. This makes each operation an update to a persistent memory, not a transformation chain. The view is thus conceptually simpler, and still preserves full mathematical equivalence. More on this here.

How does this relate to LSTM?

To recap: LSTMs, or Long Short-Term Memory is a type of RNN designed to handle the vanishing gradient problem frequent in RNNs by storing information in a “cell” and allowing them to learn long-range dependencies in data. The LSTM cell (seen above) has two states – the cell state c for long term memory and hidden state h for short term memory.

It also has gates – forget, input and output that control the flow of information into and out of the cell. Intuitively, the forget gate acts as a lever for determining how much of the long term information to not pass through or forget; input gate acts as a lever for determining how much of the current input from the hidden state to add to long term memory; and output gate acts as a lever to determine how much of the modified long-term memory to send further to the hidden state of the next time step.

The core difference between a LSTM and a transformer is that LSTM is sequential and local in that it only works on one token at a time whereas a transformer works in parallel on the whole sequence. But they are similar because they are both both fundamentally state-updating mechanisms, especially when the transformer is viewed from the mechanistic lens. So, the analogy is this:

Cell state is similar to the residual stream; acting as long-term memory throughout
Input gate does the same job as the pattern matching or similarity scoring in determining which information is relevant for the current token under consideration; only difference being transformer does this in parallel for all tokens in the sequence
Output gate is similar to messages and determines which information to emit and how strongly.

By reframing attention as patterns (QKᵀ) and messages (VO), and reformulating residual connections as a persistent residual stream, mechanistic interpretation offers a powerful way to conceptualize transformers. Not only does this enhance interpretability, but it also aligns attention with broader paradigms of information processing—bringing it a step closer to the kind of conceptual clarity seen in systems like LSTMs.

Mechanistic View of Transformers: Patterns, Messages, Residual Stream… and LSTMs

Patterns and Messages

What this makes possible: Residual Stream

How does this relate to LSTM?

Related Posts

Developer Trust In AI Tools Is Falling, Survey Finds

Context Engineering — A Comprehensive Hands-On Tutorial with DSPy

Leave a Reply Cancel reply