A Bird’s-Eye View of Linear Algebra: Why Is Matrix Multiplication Like That?

chapter of the in-progress book on linear algebra, “A birds eye view of linear algebra”. The table of contents so far:

Stay tuned for future chapters.

Here, we will describe operations we can do with two matrices, but keeping in mind they are just representations of linear maps.

I) Why care about matrix multiplication?

Almost any information can be embedded in a vector space. Images, video, language, speech, biometric information and whatever else you can imagine. And all the applications of machine learning and artificial intelligence (like the recent chat-bots, text to image, etc.) work on top of these vector embeddings. Since linear algebra is the science of dealing with high dimensional vector spaces, it is an indispensable building block.

Complex concepts from our real world like images, text, speech, etc. can be embedded in high dimensional vector spaces. The higher the dimensionality of the vector space, the more complex information it can encode. Image created with midjourney.

A lot of the techniques involve taking some input vectors from one space and mapping them to other vectors from some other space.

But why the focus on “linear” when most interesting functions are non-linear? It’s because the problem of making our models high dimensional and that of making them non-linear (general enough to capture all kinds of complex relationships) turn out to be orthogonal to each other. Many neural network architectures work by using linear layers with simple one dimensional non-linearities in between them. And there is a theorem that says this kind of architecture can model any function.

Since the way we manipulate high-dimensional vectors is primarily matrix multiplication, it isn’t a stretch to say it is the bedrock of the modern AI revolution.

Deep neural networks have layers with vectors at each layer and connections between successive layers encoded as matrices. Translation between layers happens with linear algebra and matrix multiplication. Image created with midjourney.

II) Algebra on maps

In chapter 2, we learnt how to quantify linear maps with determinants. Now, let’s do some algebra with them. We’ll need two linear maps and a basis.

Algebra on linear maps. Image by midjourney.

II-A) Addition

If we can add matrices, we can add linear maps since matrices are the representations of linear maps. And matrix addition is not very interesting if you know scalar addition. Just as with vectors, it’s only defined if the two matrices are the same size (same rows and columns) and involves lining them up and adding element by element.

Matrix addition element-wise. Image by author.

So, we’re just doing a bunch of scalar additions. Which means that the properties of scalar addition logically extend.

Commutative: if you switch, the result won’t twitch

A+B = B+A

But commuting to work might not be commutative since going from A to B might take longer than B to A.

Associative: in a chain, don’t refrain, take any 2 and continue

A+(B+C) = (A+B)+C

Identity: And here I am where I began! That’s no way to treat a man!

The presence of a special element that when added to anything results in the same thing. In the case of scalars, it is the number 0. In the case of matrices, it is a matrix full of zeros.

A + 0 = A or 0 + A = A

Also, it is possible to start at any element and end up at any other via addition. So it must be possible to start at A and end up at the additive identity, 0. The thing that must be added to A to achieve this is the additive inverse of A and it’s called -A.

A + (-A) = 0

For matrices, you just go to each scalar element in the matrix and replace with the additive inverse of each one (switching the signs if the scalars are numbers) to get the additive inverse of the matrix.

II-B) Subtraction

Subtraction is just addition with the additive inverse of the second matrix instead.

A-B = A+(-B)

II-C) Multiplication

We could have defined matrix multiplication just as we defined matrix addition. Just take two matrices that are the same size (rows and columns) and then multiply the scalars element by element. There is a name for that kinds of operation, the Hadamard product.

But no, we defined matrix multiplication as a far more convoluted operation, more “exotic” than addition. And it isn’t complex just for the sake of it. It is the most important operation in linear algebra by far.

It enjoys this special status because it’s the means by which linear maps are applied to vectors, building on top of dot products.

The way it actually works requires a dedicated section, so we’ll cover that in section III. Here, let’s list some of its properties.

Commutative

Unlike addition, matrix multiplication is not always commutative. Which means that the order in which you apply linear maps to your input vector matters.

A.B != B.A

Associative

It is still associative

A.B.C = A.(B.C) = (A.B).C

And there is a lot of depth to this property, as we will see in section IV.

Identity

Just like addition, matrix multiplication also has an identity element, I, an element that when any matrix is multiplied to results in the same matrix. The big caveat being that this element only exists for square matrices and is itself square.

Now, because of the importance of matrix multiplication, “the identity matrix” in general is defined as the identity element of matrix multiplication (not that of addition or the Hadamard product for example).

The identity element for addition is a matrix composed of 0’s and that of the Hadamard product is a matrix composed of 1’s. The identity element of matrix multiplication is:

The identity matrix of matrix multiplication and also linear algebra. Image by author.

So, 1’s on the main diagonal and 0’s everywhere else. What kind of definition for matrix multiplication would lead to an identity element like this? We’ll need to describe how it works to see, but first let’s go to the final operation.

II-D) Division

Just as with addition, the presence of an identity matrix suggests any matrix, A can be multiplied with another matrix, A^-1 and taken to the identity. This is called the inverse. Since matrix multiplication isn’t commutative, there are two ways to this. Thankfully, both lead to the identity matrix.

A.(A^-1) = (A^-1).A = I

So, “dividing” a matrix by another is simply multiplication with the second ones inverse, A.B^-1. If matrix multiplication is very important, then this operation is as well since it’s the inverse. It is also related to how we historically developed (or maybe stumbled upon) linear algebra. But more on that in the next chapter (4).

Another property we’ll be using that is a combined property of addition and multiplication is the distributive property. It applies to all kinds of matrix multiplication from the traditional one to the Hadamard product:

A.(B+C) = A.B + A.C

III) Why is matrix multiplication defined this way?

We have arrived at last to the section where we will answer the question in the title, the meat of this chapter.

Matrix multiplication is the way linear maps act on vectors. So, we get to motivate it that way.

III-A) How are linear maps applied in practice?

Consider a linear map that takes m dimensional vectors (from R^m) as input and maps them to n dimensional vectors (in R^n). Let’s call the m dimensional input vector, v.

At this point, it might be helpful to think of yourself actually coding up this linear map in some programming language. It should be a function that takes the m-dimensional vector, v as input and returns the n dimensional vector, u.

The linear map has to take this vector and turn it into an n dimensional vector somehow. In the function above, you’ll notice we just generated some vector at random. But this completely ignored the input vector, v. That’s unreasonable, v should have some say. Now, v is just an ordered list of m scalars v = [v1, v2, v3, …, vm]. What do scalars do? They scale vectors. And the output vector we need should be n dimensional. How about we take some (fixed) m vectors (pulled out of thin air, each n dimensional), w1, w2, …, wm. Then, scale w1 by v1, w2 by v2 and so on and add them all up. This leads to an equation for our linear map (with the output on the left).

Eq (1) A linear map motivated as a linear combination of m vectors. Image by author.

Make note of the equation (1) above since we’ll be using it again.

Since the w1, w2,… are all n dimensional, so is u. And all the elements of v=[v1, v2, …, vm] have an influence on the output, u. The idea in equation (1) is implemented below. We take some randomly generated vectors for the w’s but with fixed seeds (ensuring that the vectors are the same across every call of the function).

We have a way now to “map” m dimensional vectors (v) to n dimensional vectors (u). But does this “map” satisfy the properties of a linear map? Recall from chapter-1, section II the properties of a linear map, f (here, a and b are vectors and c is a scalar):

f(a+b) = f(a) + f(b)

f(c.a) = c.f(a)

It’s clear that the map specified by equation (1) satisfies the above two properties of a linear map.

The function defined by equation (1) satisfies the property that the function of sum is the sum of functions. Image by author.

The function defined in equation (1) satisfies the property that a scalar times a vector passed to the function is equivalent to the scalar times the vector passed to the function. Image by author.

The m vectors, w1, w2, …, wm are arbitrary and no matter what we choose for them, the function, f defined in equation (1) is a linear map. So, different choices for those w vectors results in different linear maps. Moreover, for any linear map you can imagine, there will be some vectors w1, w2,… that can be applied in conjunction with equation (1) to represent it.

Now, for a given linear map, we can collect the vectors w1, w2,… into the columns of a matrix. Such a matrix will have n rows and m columns. This matrix represents the linear map, f and its multiplication with an input vector, v represents the application of the linear map, f to v. And this application is where the definition of matrix multiplication comes from.

Animation (1): Matrix vector multiplication. Image by author.

We can now see why the identity element for matrix multiplication is the way it is:

Animation (2) Why is the identity matrix of multiplication what it is? Image by author.

We start with a column vector, v and end with a column vector, u (so just one column for each of them). And since the elements of v must align with the column vectors of the matrix representing the linear map, the number of columns of the matrix must equal the number of elements in v. More on this in section III-C.

III-B) Matrix multiplication as a composition of linear maps

Now that we described how a matrix is multiplied to a vector, we can move on to multiplying a matrix with another matrix.

The definition of matrix multiplication is much more natural when we consider the matrices as representations of linear maps.

Linear maps are functions that take a vector as input and produce a vector as output. Let’s say the linear maps corresponding to two matrices are f and g. How would you think of adding these maps (f+g)?

(f+g)(v) = f(v)+g(v)

This is reminiscent of the distributive property of addition where the argument goes inside the bracket to both the functions and we add the results. And if we fix a basis, this corresponds to applying both linear maps to the input vector and adding the result. By the distributive property of matrix and vector multiplication, this is the same as adding the matrices corresponding to the linear maps and applying the result to the vector.

Now, let’s think of multiplication (f.g).

(f.g)(v) = f(g(v))

Since linear maps are functions, the most natural interpretation of multiplication is to compose them (apply them one at a time, in sequence to the input vector).

When two matrices are multiplied, the resulting matrix represents the composition of the corresponding linear maps. Consider matrices A and B; the product AB embodies the transformation achieved by applying the linear map represented by B to the input vector first and then applying the linear map represented by A.

So we have a linear map corresponding to the matrix, A and a linear map corresponding to the matrix, B. We’d like to know the matrix, Ccorresponding to the composition of the two linear maps. So, applying B to any vector first and then applying A to the result should be equivalent to just applying C.

A.(B.v) = C.v = (A.B).v

In the last section, we learnt how to multiply a matrix and a vector. Let’s do that twice for A.(B.v). Say the columns of B are the column vectors, b1, b2, …, bm. From equation (1) in the previous section,

Proving that matrix multiplication is just linear maps applied in sequence. Image by author.

And what if we applied the linear map corresponding to C=A.B directly to the vector, v. The column vectors of the matrix C are c1, c2, …, ck.

The same result as animation (2), multiplying a matrix with a vector. Image by author.

Comparing the two equations above we get,

Eq (2): the column vectors of the matrix C = AB where b1, b2,… are the column vectors of B. Image by author.

So, the columns of the product matrix, C=AB are obtained by applying the linear map corresponding to matrix A to each of the columns of the matrix B. And collecting those resulting vectors into a matrix gives us C.

We have just extended our matrix-vector multiplication result from the previous section to the multiplication of two matrices. We just break the second matrix into a collection of vectors, multiply the first matrix to all of them and collect the resulting vectors into the columns of the result matrix.

Eq (3) The column vectors of the C matrix (C=A.B). Image by author.

So the first row and first column of the result matrix, C is the dot product of the first column of B and the first row of A. And in general the i-th row and j-th column of C is the dot product of the i-th row of A and the j-th column of B. This is the definition of matrix multiplication most of us first learn.

Animation (3): Matrix multiplication as dot products. Image by author.

Associative proof

We can also show that matrix multiplication is associative now. Instead of the single vector, v, let’s apply the product C=AB individually to a group of vectors, w1, w2, …, wl. Let’s say the matrix that has these as column vectors is W. We can use the exact same trick as above to show:

(A.B).W = A.(B.W)

It’s because (A.B).w1 = A.(B.w1) and the same for all the other w vectors.

Sum of outer products

Say we’re multiplying two matrices A and B:

Multiplication of two matrices. Image by author.

Equation (3) can be generalized to show that the i,j element of the resulting matrix, C is:

Eq (4): Generalization of equation (3). Image by author.

We have a sum over k terms. What if we took each of those terms and created k individual matrices out of them. For example, the first matrix will have as its i,j-th entry: b_{i,1}. a_{1,j}. The k matrices and their relationship to C:

Matrix multiplication is the sum of k sub-matrices. Image by author.

This process of summing over k matrices can be visualized as follows (reminiscent of the animation in section III-A that visualized a matrix multiplied to a vector):

Animation (4): Matrix multiplication by expanding to 3-d. Image by author.

We see here the sum over k matrices all of the same size (nxm) which is the same size as the result matrix, C. Notice in equation (4) how for the first matrix, A, the column index stays the same while for the second matrix, B, the row index stays the same. So the k matrices we’re getting are the matrix products of the i-th column of A and the i-th row of B.

Matrix multiplication as a sum of outer products. Image by author.
Inside the summation, two vectors are multiplied to produce matrices. It’s a special case of matrix multiplication when applied to vectors (special cases of matrices) and called “outer product”. Here is yet another animation to show this sum of outer products process:

Animation (5): Matrix multiplication as a sum of outer products. Image by author.

This tells us why the number of row vectors in B should be the same as the number of column vectors in A. Because they have to be mapped together to get the individual matrices.

We’ve seen a lot of visualizations and some math, now let’s see the same thing via code for the special case where A and B are square matrices. This is based on section 4.2 of the book “Introduction to Algorithms”, [2].

III-C) Matrix multiplication: the structural choices

Matrix multiplication seems to be structured in a weird way. It’s clear that we need to take a bunch of dot products. So, one of the dimensions has to match. But why make the columns of the first matrix be equal to the number of rows of the second one?

Won’t it make things more straightforward if we redefine it in a way that the number of rows of the two matrices should be the same (or the number of columns)? This would make it much easier to identify when two matrices can be multiplied.

The traditional definition where we require the rows of the first matrix to align with the columns of the second one has more than one advantage. Let’s go first to matrix-vector multiplication. Animation (1) in section III-A showed us how the traditional version works. Let’s visualize what it if we required the rows of the matrix to align with the number of elements in the vector instead. Now, the n rows of the matrix will need to align with the nelements of the vector.

Animation (6): what would an alternate setup for matrix multiplication look like? Image by author.

We see that we’d have to start with a column vector, v with n rows and one column and end up with a row vector, u with 1 row and m columns. This is awkward and makes defining an identity element for matrix multiplication challenging since the input and output vectors can never have the same shape. With the traditional definition, this isn’t an issue since the input is a column vector and the output is also a column vector (see animation (1)).

Another consideration is multiplying a chain of matrices. In the traditional method, it is so easy to see first of all that the chain of matrices below can be multiplied together based on their dimensionalities.

What matrix chain multiplication looks like with the traditional accepted method. Image by author.

Further, we can tell that the output matrix will have l rows and p columns.

In the framework where the rows of the two matrices should line up, this quickly becomes a mess. For the first two matrices, we can tell that the rows should align and that the result will have n rows and l columns. But visualizing how many rows and columns the result will have and then reasoning about weather it’ll be compatible with C, etc. becomes a nightmare.

What matrix chain multiplication might look like if we modified the definition of matrix multiplication. Image by author.

And that is why we require the rows of the first matrix to align with the columns of the second matrix. But maybe I missed something. Maybe there is an alternate definition that is “cleaner” and manager to side-step these two challenges. Would love to hear ideas in the comments 🙂

III-D) Matrix multiplication as a change of basis

So far, we’ve thought of matrix multiplication with vectors as a linear map that takes a vector as input and returns some other vector as output. But there is another way to think of matrix multiplication — as a way to change perspective.

Let’s consider two-dimensional space, R². We represent any vector in this space with two numbers. What do those numbers represent? The coordinates along the x-axis and y-axis. A unit vector that points just along the x-axis is [1,0] and one that points along the y-axis is [0,1]. These are our basis for the space. Every vector now has an address. For example, the vector [2,3] means we scale the first basis vector by 2 and the second one by 3.

But this isn’t the only basis for the space. Someone else (say, he who shall not be named) might want to use two other vectors as their basis. For example, the vectors e1=[3,2] and e2=[1,1]. Any vector in the space R² can also be expressed in their basis. The same vector would have different representations in our basis and their basis. Like different addresses for the same house (perhaps based on different postal systems).

When we’re in the basis of he who shall not be named, the vector e1 = [1,0]and the vector e2 = [0,1] (which are the basis vectors from his perspective by definition of basis vectors). And the functions that translates vectors from our basis system to that of he who shall not be named and vise-versa are linear maps. And so the translations can be represented as matrix multiplications. Let’s call the matrix that takes vectors from us to the vectors to he who shall not be named, M1 and the matrix that does the opposite, M2. How do we find the matrices for these matrices?

Shifting your perspective in seeing the world. Image by midjourney. Image by author.

We know that the vectors we call e1=[3,2] and e2=[1,1], he who shall not be named calls e1=[1,0] and e2=[0,1]. Let’s collect our version of the vectors into the columns of a matrix.

A basic 2 by 2 matrix composed of two column vectors. Image by author.

And also collect the vectors, e1 and e2 of he who shall not be named into the columns of another matrix. This is just the identity matrix.

We’d like to transform the 2 by 2 matrix into the identity matrix to change basis. Image by author.

Since matrix multiplication operates independently on the columns of the second matrix,

The equation that moves the matrix to the identity. Image by author.

Pre-multiplying by an appropriate matrix on both sides gives us M1:

Changing basis with a matrix. Image by author.

Doing the same thing in reverse gives us M2:

Reverse mapping to basis. Image by author.

This can all be generalized into the following statement: A matrix with column vectors; w1, w2, …, wn translates vectors expressed in a basis where w1, w2, …, wn are the basis vectors to our basis.

And the inverse of that matrix translates vectors from our basis to the one where w1, w2, …, wn are the basis.

All square matrices can hence be thought of as “basis changers”.

Note: In the special case of an orthonormal matrix (where every column is a unit vector and orthogonal to every other column), the inverse becomes the same as the transpose. So, changing to the basis of the columns of such a matrix becomes equivalent to taking the dot product of a vector with each of the rows.

For more on this, see the 3B1B video, [1].

Conclusion

Matrix multiplication is arguably one of the most important operations in modern computing and also with almost any data science field. Understanding deeply how it works is important for any data scientist. Most linear algebra textbooks describe the “what” but not why its structured the way it is. Hopefully this blog filled that gap.

[1] 3B1B video on change of basis: https://www.youtube.com/watch?v=P2LTAUO1TdA&t=2s

[2] Introduction to Algorithms by Cormen et.al. Third edition

[3] Matrix multiplication as sum of outer products: https://math.stackexchange.com/questions/2335457/matrix-at-a-as-sum-of-outer-products

[4] Catalan numbers wikipedia article https://en.wikipedia.org/wiki/Catalan_number