How Important is Linear Algebra for Deep Learning? (Part 1 of 3 )

Eman
12 min readSep 4, 2024

--

Algebra You didn’t know you needed!

If you’ve been working with machine learning or deep learning models, you’re likely familiar with the basics neural networks, backpropagation, and perhaps even some optimization techniques.

When it comes to the mathematical foundations that underpin these concepts, like

Linear Algebra, Calculus, Probability & Statistics.

you may feel a bit overwhelmed or even intimidated, especially if it’s been a while since you last studied it.

This brings up two important questions:

  1. Why should I even bother with the math if I already know how to work with these models?

When you understand the underlying principles, you gain deeper insights into why certain models behave the way they do, how to troubleshoot issues more effectively, and how to innovate beyond pre-built frameworks. This will help you to make better decisions and who knows, maybe you’ll even develop a new algorithm one day (okay, that might be thinking big… 😶). From being a user of these models to a true expert. Isn’t that cool ?

2. Do I really need to spend years mastering all of this?

In my opinion, not at all. While these fields are indeed important, the key is to build a strong foundation in the concepts that are most relevant to understanding deep learning models and their inner workings. This is essential for truly grasping what’s happening “under the hood.”

Sometimes, smart work can complement hard work.

I had the opportunity to study these topics during my college days, and I’ve continued to explore them since then. However, I understand that not everyone has had this background; that’s precisely why I’m here to share what I’ve learned and help others on their journey.

Article Structure

I’ve written a series of articles specifically designed for those who are familiar with deep learning models but want to dive deeper into the underlying mathematics to really connect the dots.

This is Part 1 of a 3-part series on Linear Algebra, where I’ll cover the essential concepts you’ll need to grasp the underlying principles of deep learning.

Now, let’s get started!

Linear Algebra

Linear algebra is the branch of mathematics that focuses on linear equations. It is often applied to the science and engineering fields, specifically machine learning. Linear algebra is also central to almost all areas of mathematics, like geometry and functional analysis.

In linear algebra data is represented in the form of linear equations. These linear equations are in turn represented in the form of matrices and vectors.

Part 1:

In this section, I’ll cover these essential concepts of Linear Algebra in detail:

  1. Scalar
  2. Vectors
  3. Dot Product and Projections
  4. Matrices and Matrix Operations
  5. Norms and Distance Metrics

Why Linear Algebra Matters in Deep Learning?

Before we explore the fundamental concepts, let’s take a moment to understand why linear Algebra is so important in deep learning.

Deep learning models are built on data—lots of it. To make sense of this data, we need to organize and manipulate it efficiently, and this is where Linear Algebra comes into play. It provides the mathematical framework that allows us to:

  • Represent Data
  • Perform Transformations
  • Understand Model Mechanics

1. Scalar

A scalar is simply a single number.

For example: 24, 5, -1 or 3, etc..

In linear algebra, scalars often appear in operations involving matrices and vectors. Instead of referring to “multiplying a number with a matrix,” we use the term “scalar-matrix multiplication.” This operation scales the matrix or vector by the scalar, and the outcome depends on the specific context — it could be another matrix or a transformed vector.

It may seem abstract now, but as you delve deeper, it will become clearer. I just wanted to introduce it here, so yeah…

2. Vectors

A vector is a mathematical entity, that has both magnitude (how much) and direction (where to).

But there’s another way to think about vectors: as an ordered list of numbers that tells you how to move in different directions.

Why an Ordered list?

Because the sequence of numbers in a vector is important. Changing the order changes the vector itself.

I’m supposing you’re familiar with basic dimension concepts!

Mathematically, a vector v in 2D, 3D, and n-dimensional space is written as:

The first number tells you how far to move horizontally, and the second number tells you how far to move vertically. If we add a third number, it would tell you how far to move in depth, adding another dimension to the space. Similarly, in n-dimensional space, each additional number tells you how far to move in that corresponding dimension.

For example:

The vector v in 2D space with two components [3, 4] essentially instructs you to move 3 units in the x-direction and 4 units in the y-direction.

Magnitude of a Vector

The length of arrow (vector) is called its magnitude. It can be calculated using the Pythagorean Theorem.

For our example, the magnitude is

  • This tells us that a vector moves a total distance of 5 units from the origin.

Why Vectors Matter in Deep Learning?

When you input data into a deep learning model, that data is often converted into vectors that the model can process.

For instance:

  • Text Processing: Words in a sentence can be represented as vectors using techniques like word embeddings.
  • Image Processing: Images are represented as vectors by flattening the pixel values into a long list.

3. Dot Product and Projection

In linear algebra, dot product and projection are two fundamental concepts that, while distinct, are deeply interconnected and play crucial roles in deep learning.

Dot product is a mathematical operation that takes two vectors and returns a single number, or scalar.

Given two vectors a and b, the dot product a . b is calculated as:

Component-wise (Algebraic) Form

Geometrically, the dot product can be interpreted as a measure of how much one vector aligns with another.

Geometric Form
  • When two vectors point in the same direction, their dot product is positive and large because the angle between them is zero, making cos⁡(0)=1. This indicates strong alignment between the vectors.
  • If the vectors are perpendicular, their dot product is zero because the angle between them is 90 degrees, and cos⁡(90) = 0. This shows that there is no alignment between the vectors.
  • When the vectors point in opposite directions, their dot product is negative because the angle between them is 180 degrees, making cos⁡(180∘) = −1 . This negative value reflects that the vectors are aligned in opposite directions, indicating disagreement.

The dot product is at the HEART of how neural networks process information. Each neuron computes the dot product between its input vector (such as pixel values of an image) and a weight vector (the learned parameters).

This operation helps determine how much the input aligns with the learned patterns, guiding the network’s decision-making process.

Projection

Projection builds on the concept of dot product, translating the abstract measure of alignment into a concrete vector.

The projection of vector a onto another vector b represents the component of a that lies in the direction of b.

In simpler terms, the dot product tells us “how much” two vectors agree, while projection shows you “what part” of one vector aligns with the other.

This is calculated using the formula:

where the dot product a.b resulting vector shows how much of a is “projected” along the direction of b.

Projection is key to understanding how data is transformed as it moves through layers of network. As data is projected onto different directions (learned features), the network extracts and focuses on the most relevant aspects of the input data.

4. Matrices and Matrix Operations

A matrix is a rectangular array of numbers arranged in rows and columns. It can be thought as a collection of vectors.

Note: The term “matrices” is the plural form of “matrix

A matrix look like this

Here, matrix A has 3 rows and 3 columns, making it a 3 x 3 matrix.

In the context of deep learning, matrices are fundamental because they provide a structured way to represent and process data efficiently, and this structured representation is what allows neural networks to perform complex calculations on large datasets.

Matrix as a Plane

Basic Matrix Operations

Operations on matrices are used to transform data, combine features, and perform calculations that are central to training and running neural networks.

Here are some most of the common operations on matrices that are crucial in deep learning. If you understand how these operations perform, you’ll have a strong foundation for understanding how deep learning models works

  • Matrix Addition
  • Matrix Multiplication
  • Element-wise Multiplication
  • Matrix Transpose

1 — Matrix Addition:

Matrix addition is straightforward — you can add two matrices only if they have the same dimensions. This is because addition happens element-wise, meaning each element in one matrix is added to the corresponding element in the other matrix.

Matrix addition is often used to combine outputs from multiple neurons or layers within a neural network.

2—Matrix Multiplication:

This is where things get a bit more interesting. For two matrices to be multiplied, the number of columns in the first matrix must be equal to the number of rows in the second matrix. And the dimension of the resulting matrix can be determined if we take the rows of the first matrix and columns of the second matrix.

Mathematically,

Let me show you how it works!!

You might be thinking,

But what if the condition n = m (number of columns = number of rows) isn’t met??

In such cases, standard matrix multiplication isn’t POSSIBLE (undefined).

IF-POSSIBLE , you might be able to reshape one or both matrices to make their dimensions compatible for multiplication. This is common in scenarios where data needs to be restructured or flattened.

And if reshaping isn’t suitable, consider using element-wise multiplication (Hadamard product)

Transformation of data as it moves through the layers of a neural network. Each layer’s weight matrix is multiplied by the input matrix to produce the next layer’s output.

Note: Dot product is a fundamental operation used within individual neurons, while matrix multiplication comes into play when we consider the entire layer of neurons at once.

3 — Element-Wise Multiplication (Hadamard Product)

It involves multiplying corresponding elements of two matrices. This operation requires that the matrices have the same dimensions

For example:

Let’s say we have two matrices, A and B, both with dimensions 2 x 3 (2 rows, 3 columns), but n ≠ m

Here’s how the multiplication works (element-wise):

Hadamard Product

Unlike matrix multiplication, the Hadamard product is much simpler because it doesn’t involve the complex row-by-column multiplication but rather a direct element-by-element multiplication.

This operation is particularly useful in certain scenarios, such as applying masks in neural networks or element-wise operations in convolutional layers.

4 — Matrix Transpose

The transpose of a matrix is an operation that flips the matrix over its diagonal, effectively switching the matrix rows and columns.

For example:

Given a matrix A of dimension m x n then the transpose of A will be a matrix of dimension n x m

Matrix Transpose

Matrix transposition is often used to adjust matrix dimensions, particularly when performing operations such as backpropagation, where the dimensions of gradients must align correctly with the weight matrices in a neural network.

5. Norm and Distance Metric

A norm is a function that measures the size or length of a vector or matrix.

Example:

  • For a vector: If you have a vector [3, 4], the norm will tell you how long this vector is. Think of it as the length of a line that starts from (0, 0) to the point (3, 4) on a graph.
  • For a matrix: Imagine a matrix as a table of numbers. The norm of this matrix measures how “large” or “spread out” the numbers are across the table.

Types of Norms:

  1. L1 Norm (Manhattan Norm): Sum of absolute values (non-negative value of that number without regard to its sign). It measures the “Manhattan” distance, which is why it’s called “Manhattan Norm”.

For a vector v = [1, -2, 3], the L1 norm is calculated as:

2. L2 Norm (Euclidean Norm): It’s the square root of sum of the squares of vector’s components, which measures the (Euclidean) distance of a vector from the origin. It is the most commonly used norm and often referred to as the “standard” norm in Euclidean space.

For the same vector v , if we calculate the L2 Norm

3. Frobenius Norm: The Frobenius norm is the square root of the sum of the squares of all elements in the matrix. It’s the same as the L2 norm but for all elements in the matrix.

∥A∥F denotes the Frobenius norm of matrix A

Application In Deep Learning:

  1. Regularization: Norms are directly used in regularization techniques like L1 regularization (Lasso) and L2 regularization (Ridge). In neural networks, this means adding a term to the loss function that penalizes large weights based on their L1 and L2 norm, which helps the model to generalize better.
  2. Optimization: During training, norm help in controlling the size of weight updates. In gradient descent, for example, if the gradient is large, the norm ensures that the steps taken are not too big, which could destabilize the training process.

Distance Metric

After understanding norms, it’s important to grasp how they relate to “distance metrics”.

Its measures how far apart two vectors, points, or object from each other.

Norms can be used to define distance metrics, which are essential in measuring how “far apart” points (or vectors) are in a space.

Types of Distance Metric:

  • Euclidean Distance: Derived from the L2 norm, it measures the straight-line distance between two points in space.

This metric is commonly used in many machine learning algorithms, such as K-nearest neighbors (KNN) and clustering algorithms like K-means.

Note: These are the endpoints of vectors A and B

  • Manhattan Distance: Based on the L1 norm, it measures the distance between two points along the axes at right angles, which is akin to navigating a grid-like path.

This metric is useful in scenarios where movement is restricted to horizontal and vertical directions, such as in certain optimization problems and grid-based games.

Take a Break, But Stay Tuned!

You’ve really put in the work — your brain deserves a break! Take some time to digest these concepts.

If you have any questions or need clarification on any topic covered, feel free to drop a comment below.

If you found this article helpful in clarifying some concepts, show some support by giving it a 👏

See ya’!

--

--