, the order of words is fundamental to meaning. The phrases “the dog chased the cat” and “the cat chased the dog” use the exact same words, but their sequence conveys entirely different events. However, the Transformer architecture, which powers most modern language models, processes all input tokens in parallel. This parallel processing makes it incredibly efficient but also inherently blind to the order of the tokens. Without a mechanism to understand sequence, the model would treat a sentence like an unordered bag of words.
To solve this, positional embeddings were introduced. These are vectors that provide the model with explicit information about the position of each token in the sequence. By combining token embeddings with positional embeddings, the model can learn to leverage word order and understand the contextual relationships that depend on it.
This article provides a mathematical deep dive into three pivotal positional embedding techniques, complete with code examples to solidify your understanding. We will explore:
- Absolute Positional Embeddings (APE): The original sinusoidal method proposed in the “Attention Is All You Need” paper, which assigns a unique positional vector to each absolute position. [2]
- Rotary Position Embedding (RoPE): An elegant approach that incorporates relative positional information by rotating the query and key vectors in the attention mechanism. [4]
- Attention with Linear Biases (ALiBi): A simple yet effective technique that avoids adding embeddings altogether, instead biasing the attention scores based on the distance between tokens. [5]
To build a solid foundation, let’s first begin with the building block for the original positional encoding: the sinusoidal wave.
Sinusoidal Wave
Let’s start with understanding the sinusoidal wave (figure 1). Here (omega) is the angular frequency, (u) is the unit against which the wave is tracked and (phi_0) is the initial phase constant (often set to 0 in embeddings). The unit (u) can be distance (measured in metres), time (measured in seconds) or for embeddings – the position. Angular frequency measures the amount by which the phase or the angle changes per unit (u).
[displaystyle
y = sin(omega u + phi_0) tag{1}
]
Wavelength ((lambda)) is the distance between any two consecutive peaks or any two troughs (lowest points of the wave), meaning that when a wave has travelled a distance of wavelength, we have one full rotation (2*(pi)).
The following block derives the relationship between the wavelength and the angular frequency and it shows that they are inversely proportional. From this inverse relation we can infer that when the angular frequency is high, the rate of oscillation is high, since the wavelength is small and the wave repeats within a short distance and when the angular frequency is low, the rate of oscillation is low, meaning the wave’s position is almost unchanged.
[displaystyle
begin{aligned}
y(u+lambda) &= y(u) \
sin!big(omega(u+lambda)+phi_0big) &= sin(omega u+phi_0) \
sin(omega u+phi_0+omegalambda) &= sin(omega u+phi_0)
end{aligned}
]
[displaystyle
text{LHS = RHS only if } omegalambda = 2pi n,quad ninmathbb{Z}.
]
[displaystyle
text{Since }lambdatext{ is the distance between consecutive peaks/troughs, set }n=1
Rightarrow
lambda = frac{2pi}{omega}. tag{2}
]
With this understanding of the sine wave, we are now in a position to understand “Absolute Positional Embeddings”
Absolute Positional Embeddings
Absolute Positional Embeddings (APE) was first introduced in the “Attention Is All You Need” [2] paper and this forms the basis either directly or indirectly for many of the positional embedding methods that are used today. In this section I will present a detailed mathematical background behind the formulation presented the paper. I’ll start off with the actual formulation used in the paper (refer to figure 2) and then proceed to break it down. The equation might seem simple but it has so many useful properties.

The variables used in the equation from the figure are:
- (pos) – position of a token in the given sequence
- (d_{model}) – dimensionality of the position vector
- (i) – the (i_{th}) dimension of the position vector
Let’s compare how the equation compares with the traditional sinusoidal wave
For an even dimension (i), the sine curve is
[displaystyle
y_i(mathrm{pos}) ;=; sinleft(frac{mathrm{pos}}{10000^{frac{2i}{d_{mathrm{model}}}}}right)
;=; sinleft(frac{mathrm{pos}}{B_i}right). tag{3}
]
Comparing equation (3) with equation (1), we get:
[displaystyle
omega ;=; frac{1}{B_i}. tag{4}
]
From equation 4, we can infer that as (i) increases the angular frequency of the wave decreases, meaning that at higher dimensions the values don’t change much between the positional vectors of two different positions. But at lower dimensions the values change rapidly. This is a highly useful property because within a single vector we can encode both high frequency (lower dimensions) and low frequency information (higher dimensions). We can prove this by comparing the cosine similarity of lower dimension sub-vectors of adjacent positions and higher dimension sub-vectors of adjacent positions. Readers can execute the code below to verify it for themself:
import numpy as np
def get_pos_emb(theta, n_embd, ctx_size):
thetas = theta ** ((-2 * np.arange(0, n_embd // 2)) / n_embd) # n_embd // 2
pos = np.arange(ctx_size) # ctx_size
freqs = np.outer(pos, thetas) # ctx_size, n_embd // 2
even_pos = np.sin(freqs)
odd_pos = np.cos(freqs)
pos_emb = np.stack((even_pos, odd_pos), axis=2)
pos_emb = pos_emb.reshape(ctx_size, -1) # ctx_size, n_embd
return pos_emb
def cosine_similarity(vec1, vec2):
assert vec1.shape == vec2.shape
dot = np.dot(vec1, vec2)
mag1, mag2 = np.linalg.norm(vec1), np.linalg.norm(vec2)
return dot / (mag1 * mag2)
theta = 10000
n_embd = 32
ctx_size = 32
pos_emb = get_pos_emb(theta, n_embd, ctx_size)
l = 5
u = 27
print(f'lower dims: {cosine_similarity(pos_emb[0, :l], pos_emb[1, :l])}')
print(f'higher dims: {cosine_similarity(pos_emb[0, u:], pos_emb[1, u:])}')
# output
# lower dims: 0.6769
# higher dims: 0.9999
At this point you are like: “Okay, but how is this useful?”, well I can explain how it is useful in two ways. The first explanation is that the weights of the model can learn to associate the difference in the lower dimensions with the difference in the position of the tokens and can learn to associate the similarity in the higher dimensions with the long range dependency.
The second explanation involves breaking down the attention mechanism, which is what I’ll be doing below 🙂
Consider two tokens at positions (m) and (n) respectively. Their embedding (x_m) and (x_n) are made up of the token embedding (w_{m,n}) and position embedding (p_{m,n}). The attention mechanism works by calculating the dot product between the query vector of (m) and the key vector of (n), which is broken down equation 5.
[
begin{equation}
begin{aligned}
q^top k
&= (W_q x_m)^top (W_k x_n) \
&= big(W_q(w_m+p_m)big)^top big(W_k(w_n+p_n)big) \
&= (w_m^top W_q^top + p_m^top W_q^top)(W_k w_n + W_k p_n) \
&= underbrace{w_m^top W_q^top W_k w_n}_{text{content–content}}
+ underbrace{big(w_m^top W_q^top W_k p_n + p_m^top W_q^top W_k w_nbig)}_{text{content–position}}
+ underbrace{p_m^top W_q^top W_k p_n}_{text{position–position}} .
end{aligned}
tag{5}
end{equation}
]
Since this is a blog post on the impact of position, the last term in equation 5, where the position embeddings interact is our main point of interest. From figure 2, we know that a position embedding is made up of pairs of sin / cos waves, which form d/2 vectors of dimension 2. In the following example for simplicity we will consider one sin / cos pair and we will assume that the weight matrices (W_q) and (W_k) are identity matrices, which results in just vector dot product of (p_m) and (p_n). Equation 6 shows the result of the dot product for the position vectors for any single sin / cos pair. We can see that the dot product depends on the difference between the position of the vectors. Generalizing this to multiple dimensions we get equation 6.
[
begin{equation}
begin{aligned}
mathbf{p}_m &= begin{bmatrix}
sin!left(frac{m}{B_i}right)\[2pt]
cos!left(frac{m}{B_i}right)
end{bmatrix},quad
mathbf{p}_n = begin{bmatrix}
sin!left(frac{n}{B_i}right)\[2pt]
cos!left(frac{n}{B_i}right)
end{bmatrix},\
mathbf{p}_m^top mathbf{p}_n
&= sin!left(frac{m}{B_i}right)sin!left(frac{n}{B_i}right)
+ cos!left(frac{m}{B_i}right)cos!left(frac{n}{B_i}right) \
&= cos!left(frac{m-n}{B_i}right).
end{aligned}
tag{6}
end{equation}
]
[
begin{equation}
mathbf{p}_m^top mathbf{p}_n
= sum_{i=0}^{frac{d}{2}-1} cos!left(frac{m-n}{B_i}right).
tag{7}
end{equation}
]
And from equation 7, we can infer that high frequency pairs (small (B_i), lower dimensions) are highly sensitive to changes in the relative position while low frequency pairs (big (B_i), higher dimensions) change slowly with the change in relative position which helps keeping the long range dependency. This is how having high frequency and low frequency components within the same vector helps the model learn about position and relative position.
I’d like to discuss one more property, which is that the position vector (p_{(m+t)}) is just the pairwise clockwise rotation of the position vector (p_m) with each pair rotated by an angle ((t/B_i)). The proof is worked out in figure 6. We can easily extend this to the whole vector, where the rotation matrix will be a sparse matrix with dimensions (d_{model} x d_{model}).
[
begin{equation}
begin{aligned}
mathbf{p}_m &=
begin{bmatrix}
sinleft(frac{m}{B_i}right)\[2pt]
cosleft(frac{m}{B_i}right)
end{bmatrix},quad
mathbf{p}_{m+t} =
begin{bmatrix}
sinleft(frac{m+t}{B_i}right)\[2pt]
cosleft(frac{m+t}{B_i}right)
end{bmatrix},\[4pt]
mathbf{p}_{m+t}
&=
begin{bmatrix}
sinleft(frac{m}{B_i}right)cosleft(frac{t}{B_i}right)
+cosleft(frac{m}{B_i}right)sinleft(frac{t}{B_i}right)\
cosleft(frac{m}{B_i}right)cosleft(frac{t}{B_i}right)
-sinleft(frac{m}{B_i}right)sinleft(frac{t}{B_i}right)
end{bmatrix}\
&=
underbrace{
begin{bmatrix}
cosleft(frac{t}{B_i}right) & sinleft(frac{t}{B_i}right)\
-sinleft(frac{t}{B_i}right) & cosleft(frac{t}{B_i}right)
end{bmatrix}
}_{Rleft(frac{t}{B_i}right)}
begin{bmatrix}
sinleft(frac{m}{B_i}right)\
cosleft(frac{m}{B_i}right)
end{bmatrix}
= Rleft(frac{t}{B_i}right)mathbf{p}_m .
end{aligned}
tag{8}
end{equation}
]
If you take a pair of dimensions (even, odd) and plot them in a 2-dimensional plane, you will get a nice circle because of the rotation property.

In my second explanation on the usefulness of having high frequency and low frequency pairs, we made one small but important simplification which is assuming that the weight matrices were identity matrices. If we remove this simplification, we’ll still have the advantages we discussed in this section but we will also have absolute pieces mixed in the calculation which don’t contribute to the attention mechanism. So to this end, we would like the equation 5 to look something like equation 9, which just contains the content-content information and the position / relative position information. This is where RoPE [4], comes in, which is discussed in detail in the next section.
[
begin{equation}
q^top k ;=; w_m^top W_q^top W_k, w_n ;+; g(n-m).
tag{9}
end{equation}
]
Rotary Position Embedding (RoPE)
Okay, we now know how APE works, why is it good and why it is not great. From the concluding paragraph of the previous section, we know that we need a formulation where we are only concerned about the relative position between the tokens and not the absolute position of them. To get such a formulation the authors of RoPE [4], proposed the following set of equations as the solution:
[
begin{equation}
begin{aligned}
q_m &= R(theta, m), W_q, x_q,\
k_n &= R(theta, n), W_k, x_k,\
q_m^top k_n &= (W_q x_q)^top, R(theta, n – m), (W_k x_k).
end{aligned}
tag{10}
end{equation}
]
As you can see in equation 10, we get a query-key dot product that depends only on the relative position of the tokens (through the rotation matrix (R), explained below) and their semantic content (the token embeddings), which removes the absolute position leakage and the interaction between the absolute position and semantic content that was present in APE (figure 3).
Note: I will only be discussing the properties of the RoPE embedding and I won’t be deriving it here, because trying to represent all the equations here became quite hard. So for interested readers, I wrote a latex doc that contains the derivation — here.
So what is ‘R’ in equation 10? ‘R’ is a rotation matrix that rotates a vector and the angle by which it rotates depends on the position of the vector in the sequence. For a two dimensional vector at position (n), (R) is shown in equation 11. So how do we extend to more than two dimensions ie how do we rotate a d-dimensional vector? The authors take inspiration from the APE paper where each pair of dimensions are rotated, leading to a sparse matrix which is represented in equation 12.
[
begin{equation}
R(theta,n)=
begin{bmatrix}
cos(ntheta) & -sin(ntheta)\
sin(ntheta) & cos(ntheta)
end{bmatrix}.
tag{11}
end{equation}
]
[
begin{equation}
R(theta,n)=mathrm{blkdiag}!big(R(ntheta_1),R(ntheta_2),ldots,R(ntheta_L)big)inmathbb{R}^{2Ltimes 2L},
quadtext{where }
R(ntheta_i)=
begin{bmatrix}
cos(ntheta_i) & -sin(ntheta_i)\
sin(ntheta_i) & cos(ntheta_i)
end{bmatrix}.
tag{12}
end{equation}
]
The rotation matrix is also an orthogonal matrix which ensures training stability because it does not shrink or enlarge the gradient vector during backpropagation. The sparsity of the matrix also makes the rotation of the vector computationally easier.
Given below is a self-contained implementation of RoPE for readers who have gone through my derivation and are wondering how to implement it. The intuition behind this code and the way it is written (I implemented mine by referring to LLaMA’s code base) is so beautiful which I think needs a separate blog post in itself. Explaining it here would make the blog post big and so I resist myself:)
def get_freqs_cis(theta, n_embd, n_heads, ctx_size) -> Tensor:
head_dim = n_embd // n_heads
i = torch.arange(head_dim // 2)
thetas = theta ** (-2 * i / head_dim) # head_dim // 2
pos = torch.arange(ctx_size) # pos
freqs = torch.outer(pos, thetas) # pos, head_dim // 2
real = torch.cos(freqs)
imag = torch.sin(freqs)
return torch.complex(real, imag)
def apply_rot_emb(x: Tensor, freqs: Tensor) -> Tensor:
# x -> bsz, n_heads, seq_len, head_dim; freqs -> pos, head_dim // 2
bsz, n_heads, seq_len, head_dim = x.shape
half = head_dim // 2
f = freqs[:seq_len]
x = x.reshape(bsz, n_heads, seq_len, half, 2)
x_rot = torch.view_as_complex(x) * f.view(1, 1, seq_len, half) # bsz, n_heads, seq_len, head_dim // 2
x_real = torch.view_as_real(x_rot) # bsz, n_heads, seq_len, head_dim // 2, 2
return x_real.reshape(bsz, n_heads, seq_len, head_dim)
Attention with Linear Biases (ALiBi):
If the readers had diligently gone through the code snippets for both APE and RoPE, then a question that would have naturally risen is, what if we go beyond the context window during inference? Since both APE and RoPE are static, we know that the positional embeddings can be extended beyond the context window but do they perform well? This is what the authors of ALiBi [5] explore and also propose ALiBi that performs better than APE and RoPE when we go beyond the context window during inference and also trains faster than RoPE.
After going through math intense positional embedding types, ALiBi should be easier to understand and after you go through the experiments section, you will also be able to appreciate ALiBi and Occam’s Razor.
ALiBI does not add positional embeddings to the token embeddings like APE or rotates the token embeddings to inject positional information like RoPE rather it introduces recency bias to certain heads while maintaining long term dependency to other heads by adding a static matrix to the attention matrix of each head. This is shown visually in figure 4, where (m) is called the ‘slope’ whose value changes based on the attention head. As you can see from the figure, as the distance between two tokens increases the penalty on the dot product increases; even RoPE has this nice property where the dot product between distant vectors are smaller than the dot product between nearby vectors and it is called long term decay which is another important property of RoPE and ALiBi as well.

To decide the value of ‘m’ for a particular head the authors use a simple geometric progression which has both the start and the ratio as (2^{-8/n_{heads}}), where n is the number of attention heads. The authors also experimented with making ‘m’ learnable but the results showed that it did not provide them with good extrapolation results. Given below is the code implementation for ALiBi and it is just replicating figure 4 in PyTorch code.
def get_alibi_slopes(n_heads) -> Tensor:
start = 2 ** (-8 / n_heads)
ratio = start
return torch.tensor([start * (ratio ** i) for i in range(n_heads)])
def get_linear_bias(n_heads, ctx_size) -> Tensor:
slopes = get_alibi_slopes(n_heads).view(n_heads, 1, 1)
pos = torch.arange(ctx_size)
distances = pos[None, :] - pos[:, None]
distances = torch.where(distances > 0, 0, distances) # ctx_size, ctx_size
distances.unsqueeze_(0) # 1, ctx_size, ctx_size
linear_bias = distances * slopes # n_heads, ctx_size, ctx_size
return linear_bias.unsqueeze(0) # 1, n_heads, ctx_size, ctx_size
Experiments
After going through all the theory, it is always best to implement these different types of embeddings, train language models that uses different types of positional embeddings and compare them on certain metrics. I have provided the complete code for the experiments in this repo - https://github.com/SkAndMl/llama and the results of my experiments on TinyStories dataset [6] (CDLA-Sharing-1.0 LICENSE) are given below. I urge the readers to download the code and train the different models locally as it will also give you a feel of training language models.
Figures 5 and 6 show the training loss, validation loss, perplexity and time taken to train 28M parameter LLaMA inspired language model with 4 heads, 4 layers, 256 context size and 256 embedding dimension on an M2 Macbook Pro laptop. All the models were trained for 1000 steps and used a batch size of 32. As you can see both ALiBi and RoPE have similar performances and are much better than APE which has to deal with “extra terms” in the attention mechanism. One distinct advantage that ALiBi has over RoPE is that it is faster to train models using ALiBi as can be seen from figure 6. To be honest, I can provide no theoretical explanation or intuition behind why one performed better than the other since any deep learning idea currently can only be validated via experiments.


Conclusion
Oh god, I had so much fun writing this article, reading about positional embeddings, coding them and understanding the math behind them. This took me about a month to write and it helped me understand a lot about positional embeddings. I hope you guys enjoy it too. Cheers!
References:
1. https://medium.com/autonomous-agents/math-behind-positional-embeddings-in-transformer-models-921db18b0c28
2. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.
3. Sine wave — https://simple.wikipedia.org/wiki/Sine_wave
4. Su, J., Lu, Y., Pan, S., Wen, B., & Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. ArXiv, abs/2104.09864.
5. Press, Ofir, Noah A. Smith, and Mike Lewis. “Train short, test long: Attention with linear biases enables input length extrapolation.” arXiv preprint arXiv:2108.12409 (2021).
6. TinyStories dataset: https://huggingface.co/datasets/roneneldan/TinyStories