How Important is Linear Algebra for Deep Learning? (Part 2 of 3)

Eman
10 min readSep 8, 2024

--

Algebra You didn’t know you needed!

Hey there! Welcome back :-)

If you’re here, I’m guessing you’ve already tackled Part 1 of this series. If not, hit pause and check it out it'll make everything here a lot smoother.

In math, concepts are like building blocks.. miss one, and the whole thing might feel shaky

This is Part 2 of a 3-part series on Linear Algebra, where I’ll cover the essential concepts you’ll need to grasp the underlying principles of deep learning.

Ready? Let’s go!

Linear Algebra

Part 2:

In this section, we’re picking up right where we left off

  1. Vector Spaces
  2. Linear Transformation
  3. Linear Systems
  4. Eigenvalues and Eigenvectors

1. Vector Spaces

A vector space (or linear space) is a collection of all possible vectors with defined addition and scalar multiplication operations that satisfy certain properties.

You might be thinking, “All possible vectors??” Isn’t that a vector that has magnitude and direction (or an ordered list of numbers)?

Well, yes — but there’s more to the story...

Originally, vectors were used in geometry and physics to describe things like forces and velocities quantities with both magnitude and direction.

As these ideas developed, mathematicians realized that vectors could be more than just an arrows in space. They started to see that vectors could represent any object that followed similar rules of combination and scaling. So in this way instead of saying vector they used a broad term known as vector spaces.

For the sake of brevity, I am not explaining the rules here you can check out the👉 Properties of Vector Space in detail.

⋙ So, does that mean anything following these rules is a vector??

Yep! anything, whether it’s a point in space, a function, or even a more abstract object. These rules ensure consistent behavior of vectors, making them useful across various fields, including deep learning

Image from 3Blue1Brown

But wait, can there be different vector spaces? Isn’t it just one big collection of all possible vectors?

Not exactly. Different vector spaces exist because they can vary in dimension, the types of vectors they contain, and how operations are defined on them.

For instance, a 2D vector space is different from a 3D vector space because they have different dimensions. Similarly, a vector space of polynomials is different from a vector space of geometric vectors, even though both follow the same basic rules.

The vector space that represents images in a neural network is different from the vector space that represents text in a language model. Each vector space is tailored to the specific type of data or operations it’s designed to handle.

Now that we’ve established what vector spaces are, the next question is: “How can we move between or transform vectors in these spaces?

This brings us to linear transformations, which help us modify vectors without breaking the rules of the vector space.

2. Linear Transformations

A function that maps vectors from one vector space to another (or within the same space) while preserving the structure of the vector space.

Linear transformations are deeply connected to vector spaces because they operate within these spaces. They take vectors and transform them in a way that respects the space’s rules.

In other words, it’s a way to transform vectors like rotating, scaling, or projecting them without breaking the rules that define their space.

Why Linear?

The term linear refers to the fact that the transformation involves linear operations — specifically, addition and scalar multiplication. This means that if you apply the transformation to a combination of vectors, it behaves in a predictable way:

  1. Additivity: If you have two input vectors x1 and x2​, the transformation of their sum equals the sum of their individual transformations

2. Homogeneity: If you scale an input vector by a scalar c, the transformation of the scaled vector equals the scaled transformation of the original vector

3Blue1Brown

Matrix Transformation

To implement a linear transformation in practice, we use a matrix. This is known as a matrix transformation. It could scale, rotate, shift, or shear the vector depending on the properties of the matrix.

For example,

Applying matrix A to an input vector x produces a transformed output vector y

But wait a sec… Did you notice I said “depending on the properties of the matrix ?"

You’re probably thinking, “Huh? You didn’t mention that there are different types of matrices in Part 1!” And you’d be absolutely right! 😅

That’s because I thought this would be the perfect moment to introduce the idea. You see, matrices have specific properties that define the kind of transformations they can perform. Some matrices are really good at scaling vectors… some are used for rotating them, and others can shear or shift them.

Think of it like this: once you know the type of matrix, you can predict how it will transform any vector it’s multiplied by.

Don’t worry—I'm not going to bombard you with all the different types of matrices right now. We’ll keep it simple for now, and if you’re curious, check here!

Matrix Multiplication in Deep Learning

The most common way to represent a linear transformation in deep learning is through matrix multiplication, often expressed as:

  • 𝑥 is the input vector (representing your data).
  • 𝑊 is the weight matrix (this represents the transformation itself).
  • 𝑏 is the bias vector (which shifts the result, but doesn’t affect the linearity).
  • 𝑦 is the output vector (the transformed data).

The interesting part …

When a neural network is first initialized, the weight matrices start off as general matrices with random values, doesn’t know anything about data yet, so its transformation of input vectors is essentially random. But as the network trains, these matrices acquire certain properties over time, where these matrices are optimized through training processes like backpropagation.

Linear Transformation in Neural Networks

  • Neural Network Layers: Each layer in a neural network can be thought of as applying a linear transformation to the data. The data, represented as vectors, passes through these layers, and each layer applies a transformation using a matrix that modifies its features in some way
  • Feature Extraction: It helps in emphasizing important features of the data while suppressing irrelevant ones. This is crucial in tasks like image recognition, where certain features (like edges or colors) are more important than others.
  • Dimensionality Reduction: Sometimes, the data you’re working with is too high-dimensional to process efficiently. Linear transformations allow us to reduce the dimensionality of the data while preserving as much of the relevant information as possible. Techniques like Principal Component Analysis (PCA) are based on this concept (will discuss this in Part 3 ).🤫

⋙ But doesn’t deep learning also involve non-linearities?

Oh yeah! While linear transformations form the foundation, neural networks also rely on non-linear activation functions (e.g., ReLU) to introduce complexity and allow the model to learn from data that isn’t linearly separable. However, even in the presence of non-linearities, the underlying linear transformations are crucial for structuring the data and guiding the learning process.

phew…😮‍💨

3. Linear Systems

A linear system is a collection of linear equations involving multiple variables, where the goal is to find the values of variables that satisfy all the equations.

What is a Linear Equation?

A mathematical equation that expresses a straight-line relationship between variables. It takes the form

where:

  • x1, x2​,…, xn​ are the variables (or features),
  • a1, a2,...an are the coefficients (weights) that scale the variables,
  • b is a constant.

The key feature of linear equations is that they describe “proportional relationship” (change in one variable results in a proportional or consistent change in another variable)

These variables could represent pixel intensities in an image, or feature values in any dataset, and the coefficients represent how much each feature contributes to the outcome.

For Example:

A simple linear system with two variables x and y could look like this:

Linear Equations

In this system, x and y represent the unknowns we want to solve for. The solution to this system would be the values of x and y that make both equations true simultaneously.

3Blue1Brown

Solving a Linear System in Deep Learning

In deep learning, solving a linear system means adjusting the weights and biases of the network so that the model can accurately predict outputs from given inputs. This process is done iteratively during training using optimization algorithms like gradient descent.

A key aspect of neural networks is that they are composed of multiple layers, as you know, and each layer represents its own linear system. The output of one layer becomes the input of the next layer, forming a chain of transformations.

For example:

Let’s say our neural network has three layers. The process looks like this:

  1. Layer 1: The input vector x is transformed by weight matrix W1 and bias vector b1, producing the output y1

2. Layer 2: The output from Layer 1, y1​, is passed to Layer 2, where another linear transformation is applied using W2​ and b2.

3. Layer 3: The same process happens at Layer 3. The final output y3​ is the result of solving a series of linear systems across multiple layers.

Each layer represents its own linear system, and the network’s ability to learn complex patterns comes from the combination of these transformations. Solving each system correctly is what allows the network to process input data effectively.

4. Eigenvectors and Eigenvalues

We’ve covered a lot of ground, so let’s bring everything together with eigenstuff don't worry, they aren’t as scary as they sound!

Eigenvectors are special vectors that don’t change direction under a specific linear transformation — they only get stretched or compressed. The amount of this stretching or compressing is given by the eigenvalue.

Remember how we said linear transformations can rotate, scale, or shear vectors?”

You can think of an eigenvector as a vector that, when transformed by a matrix (linear transformation), only changes in magnitude but not direction. The amount by which its magnitude is scaled is determined by its corresponding eigenvalue.

Mathematically,

Where

  • A is our transformation matrix
  • v is the eigenvector
  • λ (lambda) is the eigenvalue

To put it simply, multiplying the matrix A by the vector v results in the same direction, but the vector v, which is an eigenvector, is stretched (or shrunk) by the eigenvalue λ.

If

  • λ > 1, the vector gets stretched (it becomes longer).
  • 0 < λ < 1, the vector gets compressed (it becomes shorter).
  • λ = 1, the vector stays the same length.
  • λ = 0, the vector collapses to a point.
  • λ < 0, the vector is flipped (negative eigenvalues).

Note:

Direction vs. Orientation: When I say an eigenvector “doesn’t change direction” under a transformation, I’m referring to the line along which it lies, not its orientation (which way it points along that line).

Eigenvectors and eigenvalues might seem abstract, but they help us understand how data behaves under certain transformations. They help identify important patterns and stabilize model training.

Now, you might be thinking,

This all sounds great in theory, but in real-world deep learning, we don’t actually work directly with eigenvectors and eigenvalues we often use frameworks like TensorFlow or PyTorch, we’re not calculating or manipulating these values ourselves. right ?”

And you’d be absolutely right!

Let me show you the answer with a practical example…

Let’s suppose …

you’re fine-tuning a pre-trained model for classification task using TensorFlow or PyTorch

If you notice that the training loss isn’t decreasing or if the network weights start showing NaN values (I've been there—what’s that NaN?! I know I’m dumb af🥲), but you might be facing vanishing or exploding gradients.

Roles of Eigenvectors and Eigenvalues:

  • Exploding Gradients: If the eigenvalues of the weight matrices are too large ( λ > 1), they can cause the gradients to increase exponentially as they are passed back through the network, leading to very large updates and numerical instability.
  • Vanishing Gradients: If the eigenvalues are too small (λ < 1), then the gradients may become very small, effectively halting progress as they are too minute to update the weights significantly.

How Eigenvalues Help Troubleshoot Issues

Understanding how eigenvalues affect gradient flow through the network allows you to troubleshoot and adjust your network’s training process strategically.

Here are a few things you can do:

  1. Gradient Clipping
  2. Adjust Weight Initialization
  3. Revise Network Architecture
  4. Adjust Activation Functions
  5. Batch Normalization

Understanding these concepts can make troubleshooting much easier. When you know the cause, finding the solution becomes so much easier right ? 😶

Enough for Today!

I hope you got something from this! If you did, don’t forget to 👏 50 times…

--

--