Mathematics for Machine Learning
Course Notes
Suraj Singh Khurana
Spring 2026
Mathematics for Machine Learning
Course Notes
Suraj Singh Khurana
Spring 2026
"Suppose a real-estate website wants to predict house prices based on features like size and number of bedrooms." To do anything, we shall need some information,
| Size (\(x_1\)) | Bedrooms (\(x_2\)) | Price (Rs. lakhs) |
|---|---|---|
| 10 | 2 | 70 |
| 20 | 3 | 130 |
Here \[\begin{aligned} x_1 &= \text{house size (in 100 sq ft)}, \\ x_2 &= \text{number of bedrooms}, \\ w_1 &= \text{price contribution per unit size (Rs. lakhs per 100 sq ft)}, \\ w_2 &= \text{price contribution per bedroom (Rs. lakhs)}. \end{aligned}\]
Now we want to build a model that predicts the price of a house based on its size and number of bedrooms. We assume that the price contributions from size and number of bedrooms are linear and additive.
Thus the linear model for the price (in Rs. lakhs) is \[\hat y = w_1 x_1 + w_2 x_2,\] where \(\hat y\) denotes the predicted price. Here \(\hat y\) is a linear combination of the input features \(x_1\) and \(x_2\) with the weights \(w_1\) and \(w_2\) representing the model parameters. The word feature refers to an individual measurable property or characteristic of the data being analyzed. In this case, the features are the size of the house and the number of bedrooms. To find the best-fitting weights \(w_1\) and \(w_2\), we can set up a system of equations based on the provided data points. From the data, we have two equations: \[\begin{aligned} 70 &= w_1 \cdot 10 + w_2 \cdot 2, \\ 130 &= w_1 \cdot 20 + w_2 \cdot 3. \end{aligned}\] We can solve this system of equations to find the values of \(w_1\) and \(w_2\). Rearranging the equations, we get: \[\begin{aligned} 10 w_1 + 2 w_2 &= 70, \\ 20 w_1 + 3 w_2 &= 130. \end{aligned}\] We can solve this system using methods such as substitution, elimination, or matrix methods. Substitution or elimination would be straightforward here, but for larger systems, matrix methods are more efficient. Let us therefore express the system in matrix form: \[\begin{bmatrix}10 & 2 \\ 20 & 3\end{bmatrix} \begin{bmatrix}w_1 \\ w_2\end{bmatrix} = \begin{bmatrix}70 \\ 130\end{bmatrix}.\] Let \(A = \begin{bmatrix}10 & 2 \\ 20 & 3\end{bmatrix}\), \(\mathbf{w} = \begin{bmatrix}w_1 \\ w_2\end{bmatrix}\), and \(\mathbf{b} = \begin{bmatrix}70 \\ 130\end{bmatrix}\). Then we have \(A \mathbf{w} = \mathbf{b}\). To solve for \(\mathbf{w}\), we can use the inverse of matrix \(A\) (if it exists) or use other methods like Gaussian elimination. \[\mathbf{w} = A^{-1} \mathbf{b}.\]
Calculating the inverse of \(A\): \[\begin{aligned} A^{-1} &= \frac{1}{(10)(3) - (2)(20)} \begin{bmatrix}3 & -2 \\ -20 & 10\end{bmatrix} \\ &= \frac{1}{30 - 40} \begin{bmatrix}3 & -2 \\ -20 & 10\end{bmatrix} \\ &= -\frac{1}{10} \begin{bmatrix}3 & -2 \\ -20 & 10\end{bmatrix} \\ &= \begin{bmatrix}-\dfrac{3}{10} & \dfrac{2}{10} \\ \dfrac{20}{10} & -\dfrac{10}{10}\end{bmatrix} \\ &= \begin{bmatrix}-0.3 & 0.2 \\ 2 & -1\end{bmatrix}. \end{aligned}\]
Now, multiplying \(A^{-1}\) with \(\mathbf{b}\): \[\begin{aligned} \mathbf{w} &= A^{-1} \mathbf{b} \\ &= \begin{bmatrix}-0.3 & 0.2 \\ 2 & -1\end{bmatrix} \begin{bmatrix}70 \\ 130\end{bmatrix} \\ &= \begin{bmatrix}-0.3 \cdot 70 + 0.2 \cdot 130 \\ 2 \cdot 70 - 1 \cdot 130\end{bmatrix} \\ &= \begin{bmatrix}-21 + 26 \\ 140 - 130\end{bmatrix} \\ &= \begin{bmatrix}5 \\ 10\end{bmatrix}. \end{aligned}\]
Thus, we find that \(w_1 = 5\) and \(w_2 = 10\). Therefore, the linear model for predicting house prices is: \[\hat y = 5 x_1 + 10 x_2.\] This means that for every additional 100 sq ft of house size, the price increases by Rs.5 lakhs, and for every additional bedroom, the price increases by Rs.10 lakhs.
We should see the formulation below \[\begin{bmatrix} 10 & 2 \\ 20 & 3 \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \end{bmatrix} = \begin{bmatrix} 70 \\ 130 \end{bmatrix}\] as \[X\mathbf{w}=\mathbf{y}\] "Machine learning systems only see this." Now suppose we get a third house with size 15 (i.e., 1500 sq ft) and 2 bedrooms. Or in other words, our data table looks like
| Size (\(x_1\)) | Bedrooms (\(x_2\)) | Price (Rs. lakhs) |
|---|---|---|
| 10 | 2 | 70 |
| 20 | 3 | 130 |
| 15 | 2 | 100 |
We can no longer solve for \(w_1\) and \(w_2\) exactly since we have three equations but only two unknowns. Even if we try to solve it, we will find that there is no exact solution that satisfies all three equations simultaneously. If you do not believe me, try to use Gaussian elimination. We will get the following augmented matrix: \[\left[\begin{array}{cc|c} 10 & 2 & 70 \\ 20 & 3 & 130 \\ 15 & 2 & 100 \end{array}\right]\] Applying row operations, we will get the following row echelon form: \[\left[\begin{array}{cc|c} 10 & 2 & 70 \\ 0 & 1 & 5 \\ 0 & 0 & 0 \end{array}\right]\] The last row corresponds to the equation \(0 = 0\), which indicates the system is consistent but underdetermined. However, this means the third equation is a linear combination of the first two, so we still cannot determine unique values for both weights. Or if you want to be more mathematical about it, the rank of the coefficient matrix is 2, while the rank of the augmented matrix is also 2, but we have only 2 independent equations for 2 unknowns with a dependent third equation.
But does it mean we cannot find a good model? Not really. We can try to find weights \(w_1\) and \(w_2\) that minimize the error between the predicted prices and the actual prices. Since we do not have an exact solution of \(X\mathbf{w}=\mathbf{y}\) we can try to find an approximate solution, that is \(X\mathbf{w} \approx \mathbf{y}\). Old question (math): “Find \(\mathbf w\) such that \(X\mathbf w=\mathbf y\).”
New question (ML): “Find \(\mathbf w\) such that \(X\mathbf w\approx\mathbf y\) as closely as possible.”
Define the error for the \(i\)-th house as \[e_i=\hat y_i-y_i, \qquad \hat y_i = w_1 x_{i1}+w_2 x_{i2}.\] One common approach is to consider the following optimization problem: \[\min_{w_1, w_2} \sum_{i=1}^{3} (y_i - \hat{y}_i)^2,\] where \(y_i\) is the actual price of the \(i\)-th house, and \(\hat{y}_i\) is the predicted price using our linear model. This approach is known as the least squares method
More formally,
In many real-world problems, such as house price prediction, we attempt to model the relationship between several input variables (features) and an observed output using a linear rule. While simple linear systems may admit exact solutions, real data is noisy, inconsistent, and overdetermined. Linear regression provides a principled mathematical framework to handle such situations.
Let \(x_i \in \mathbb{R}^d\) denote the feature vector associated with the \(i\)-th data point (for example, house size, number of bedrooms, age, etc.), and let \(y_i \in \mathbb{R}\) denote the corresponding observed output (house price). We assume a linear relationship between inputs and output of the form \[\hat{y}_i = w^\top x_i,\] where \(w \in \mathbb{R}^d\) is an unknown parameter vector to be learned from data. This assumption means that the output is linear in the parameters \(w\), which is the defining feature of linear regression.
Given \(n\) data points, we stack the feature vectors row-wise into a design matrix \[X = \begin{bmatrix} x_1^\top \\ x_2^\top \\ \vdots \\ x_n^\top \end{bmatrix} \in \mathbb{R}^{n \times d}, \quad y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \in \mathbb{R}^n.\] The model predictions for all data points can then be written compactly as \[\hat{y} = Xw.\] In idealized situations, one might hope for an exact solution to the equation \(Xw = y\). However, when \(n > d\) and the data is noisy, such an equation is typically inconsistent, as demonstrated earlier using rank and Gaussian elimination.
The discrepancy between the model predictions and the observed outputs is captured by the residual vector \[r = Xw - y.\] Each component \(r_i = w^\top x_i - y_i\) represents the prediction error on a single data point. Since exact satisfaction of all equations is generally impossible, the goal is to find a parameter vector \(w\) that makes these residuals collectively as small as possible.
To quantify the total error, we introduce a scalar loss function defined as the squared Euclidean norm of the residual vector: \[\mathcal{L}(w) = \|Xw - y\|^2 = \sum_{i=1}^n (w^\top x_i - y_i)^2.\] This choice, known as the least squares loss, is motivated by several factors: it penalizes large errors more strongly, is differentiable, leads to a convex optimization problem, and admits a clear geometric interpretation.
Linear regression is thus formulated as the following optimization problem: \[\min_{w \in \mathbb{R}^d} \; \|Xw - y\|^2.\] This problem seeks the parameter vector whose predictions are closest to the observed data in the least squares sense. Importantly, this is no longer a problem of solving linear equations exactly, but of minimizing an error function.
It is important to clarify how the vectors involved in the linear model are represented and why the expression \(w^\top x_i\) is mathematically valid.
Each feature vector \(x_i\) is represented as a column vector in \(\mathbb{R}^d\): \[x_i = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{id} \end{bmatrix} \in \mathbb{R}^{d \times 1},\] where each component corresponds to a measurable attribute of the data point (for example, house size, number of bedrooms, etc.).
Similarly, the parameter vector \(w\) is also represented as a column vector: \[w = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix} \in \mathbb{R}^{d \times 1}.\]
Since both \(w\) and \(x_i\) are column vectors, their direct product \(wx_i\) is not defined. To obtain a scalar prediction, we take the transpose of \(w\), yielding a row vector: \[w^\top = \begin{bmatrix} w_1 & w_2 & \cdots & w_d \end{bmatrix} \in \mathbb{R}^{1 \times d}.\]
The prediction for the \(i\)-th data point is then given by the matrix product \[\hat{y}_i = w^\top x_i,\] where \[(1 \times d)(d \times 1) = 1 \times 1.\] Thus, \(w^\top x_i\) is a scalar, as required to model a real-valued output such as house price.
Explicitly, this dot product can be written as \[w^\top x_i = w_1 x_{i1} + w_2 x_{i2} + \cdots + w_d x_{id},\] which shows that the linear model is a weighted sum of the input features.
When stacking all feature vectors into the design matrix \[X = \begin{bmatrix} x_1^\top \\ x_2^\top \\ \vdots \\ x_n^\top \end{bmatrix} \in \mathbb{R}^{n \times d},\] each row of \(X\) corresponds to a single data point written as a transposed feature vector. The matrix–vector product \[Xw\] then yields an \(n \times 1\) vector, where the \(i\)-th entry is precisely \(w^\top x_i\). This ensures dimensional consistency between the individual prediction formula and the global matrix formulation.
This careful handling of vector orientation is not merely a notational choice; it is essential for extending linear regression to high-dimensional settings and for implementing learning algorithms correctly in numerical software.
To find the minimizer, we differentiate the loss function with respect to \(w\): \[\frac{\partial}{\partial w} \|Xw - y\|^2 = \frac{\partial}{\partial w} (Xw - y)^\top (Xw - y) =\] How we achieve this is by applying the chain rule and properties of matrix calculus. Let us derive from scratch in high detail: \[\begin{aligned} \frac{\partial}{\partial w} \|Xw - y\|^2 &= \frac{\partial}{\partial w} (Xw - y)^\top (Xw - y) \\ &= \frac{\partial}{\partial w} \left( w^\top X^\top X w - 2 y^\top X w + y^\top y \right) \\ &= 2 X^\top X w - 2 X^\top y. \end{aligned}\] Why above? Because we use the following rules of matrix calculus:
For a vector \(a\), \(\frac{\partial}{\partial w} (w^\top a) = a\). For the derivative \(\frac{\partial}{\partial w} (w^\top a)\), we use the linearity of differentiation. Since \(w^\top a = \sum_j w_j a_j\), we have \(\frac{\partial}{\partial w_i} (w^\top a) = a_i\), which gives \(\frac{\partial}{\partial w} (w^\top a) = a\).
For a matrix \(A\), \(\frac{\partial}{\partial w} (w^\top A w) = (A + A^\top) w\). If \(A\) is symmetric, this simplifies to \(2 A w\).
For a symmetric matrix \(A\), we have \(\frac{\partial}{\partial w} (w^\top A w) = 2 A w\). To see this, expand \(w^\top A w = \sum_i \sum_j w_i A_{ij} w_j\). Then \(\frac{\partial}{\partial w_k}(w^\top A w) = \sum_j A_{kj} w_j + \sum_i w_i A_{ik} = \sum_j (A_{kj} + A_{jk}) w_j\). Since \(A\) is symmetric, \(A_{jk} = A_{kj}\), so this becomes \(2 \sum_j A_{kj} w_j = 2(Aw)_k\), giving \(\frac{\partial}{\partial w}(w^\top A w) = 2Aw\).
Setting the gradient to zero yields the normal equations: \[X^\top X w = X^\top y.\]
These equations form a system of \(d\) linear equations in \(d\) unknowns. Unlike the original system \(Xw = y\), the normal equations are consistent under mild conditions.
If the matrix \(X^\top X\) is invertible, the solution to the normal equations is given by \[w = (X^\top X)^{-1} X^\top y.\] This expression provides a closed-form formula for the linear regression estimator. It characterizes the theoretical solution but is not always used directly in practice due to computational and numerical stability considerations.
From a geometric perspective, the columns of \(X\) span a subspace of \(\mathbb{R}^n\), known as the column space of \(X\). The vector \(Xw\) represents a point in this subspace. Linear regression finds the vector \(Xw\) that is closest to \(y\) in Euclidean distance. Consequently, the residual vector \(r = Xw - y\) is orthogonal to the column space of \(X\), which explains the condition \[X^\top r = 0.\] Thus, linear regression can be interpreted as the orthogonal projection of \(y\) onto the column space of \(X\).
Linear regression qualifies as a machine learning model because its parameters are inferred from data rather than specified explicitly, the data typically contains noise and inconsistencies, and the learned model is used to make predictions on unseen data. The process involves empirical risk minimization rather than exact algebraic solution.
In modern machine learning applications, the number of data points \(n\) may be in the millions and the number of features \(d\) may be in the thousands. In such settings, forming and inverting \(X^\top X\) is computationally expensive and numerically unstable. Consequently, linear regression is commonly solved using numerical optimization methods such as gradient descent, stochastic gradient descent, or iterative linear solvers. These methods approximate the solution to the normal equations efficiently at scale.
Linear regression transforms an inconsistent, overdetermined system of linear equations into a well-posed optimization problem. It replaces the impossible requirement of satisfying all equations exactly with the achievable goal of minimizing overall error. In this sense, linear regression is best understood as numerical linear algebra applied to real-world data.
Now let us return to our housing price example and compute the optimal weights using the least squares approach. We have the design matrix \[X = \begin{bmatrix} 10 & 2 \\ 20 & 3 \\ 15 & 2 \end{bmatrix},\] and the output vector \[y = \begin{bmatrix} 70 \\ 130 \\ 100 \end{bmatrix}.\] The normal equations are \[X^\top X w = X^\top y.\] Calculating \(X^\top X\) and \(X^\top y\): \[X^\top X = \begin{bmatrix}10 & 20 & 15 \\ 2 & 3 & 2\end{bmatrix} \begin{bmatrix}10 & 2 \\ 20 & 3 \\ 15 & 2\end{bmatrix} = \begin{bmatrix}725 & 110 \\ 110 & 17\end{bmatrix},\] and \[X^\top y = \begin{bmatrix}10 & 20 & 15 \\ 2 & 3 & 2\end{bmatrix} \begin{bmatrix}70 \\ 130 \\ 100\end{bmatrix} = \begin{bmatrix} 10(70) + 20(130) + 15(100) \\ 2(70) + 3(130) + 2(100) \end{bmatrix}\] \[= \begin{bmatrix} 700 + 2600 + 1500 \\ 140 + 390 + 200 \end{bmatrix} = \begin{bmatrix}4800 \\ 730\end{bmatrix}.\] Thus, the normal equations become \[\begin{bmatrix}725 & 110 \\ 110 & 17\end{bmatrix} \begin{bmatrix}w_1 \\ w_2\end{bmatrix} = \begin{bmatrix}4800 \\ 730\end{bmatrix}.\] Solving via the determinant \(\Delta = 725 \cdot 17 - 110^2 = 12325 - 12100 = 225\): \[\begin{bmatrix}w_1 \\ w_2\end{bmatrix} = \frac{1}{225}\begin{bmatrix}17 & -110 \\ -110 & 725\end{bmatrix} \begin{bmatrix}4800 \\ 730\end{bmatrix}\] \[= \frac{1}{225}\begin{bmatrix} 81600 - 80300 \\ -528000 + 529250 \end{bmatrix} = \frac{1}{225}\begin{bmatrix}1300 \\ 1250\end{bmatrix} = \begin{bmatrix}5.78 \\ 5.56\end{bmatrix}.\] Thus \(w_1 \approx 5.78\) and \(w_2 \approx 5.56\), representing the best-fit weights that minimize the squared error across all three data points. How come we got different weights than before? Because now we are minimizing the squared error across all three data points, rather than trying to fit exactly two points. Earlier weights were 5 and 10, which perfectly fit the first two points but did not account for the third point. Now with third point weights became approximately 5.78 and 5.56, which balance the errors across all three points. Let us check the predictions: \[\hat{y}_1 = 5.78 \cdot 10 + 5.56 \cdot 2 = \approx 70.0,\] \[\hat{y}_2 = 5.78 \cdot 20 + 5.56 \cdot 3 \approx 130.0,\] \[\hat{y}_3 = 5.78 \cdot 15 + 5.56 \cdot 2 \approx 100.0.\] but actually \[\hat{y}_1 = 5.78 \cdot 10 + 5.56 \cdot 2 = 57.8 + 11.12 = 68.92,\] \[\hat{y}_2 = 5.78 \cdot 20 + 5.56 \cdot 3 = 115.6 + 16.68 = 132.28,\] \[\hat{y}_3 = 5.78 \cdot 15 + 5.56 \cdot 2 = 86.7 + 11.12 = 97.82.\]
In practice, computing matrix inverses is computationally expensive and numerically unstable. Instead, we use gradient descent to minimize \(\mathcal{L}(w) = \|Xw - y\|^2\) iteratively, starting from an initial guess \(w^{(0)} = 0\): \[w^{(k+1)} = w^{(k)} - \eta \nabla \mathcal{L}(w^{(k)}),\] where \(\eta\) is the learning rate. The gradient \(\nabla \mathcal{L}(w)\) is given by \[\nabla \mathcal{L}(w) = 2 X^\top (Xw - y).\] By repeatedly applying this update rule, we can converge to the optimal weights without explicitly computing the matrix inverse. Why this works? Because gradient descent is an optimization algorithm that iteratively moves towards the minimum of a function by taking steps proportional to the negative of the gradient. In the case of linear regression, the loss function is convex, ensuring that gradient descent will converge to the global minimum. Here is a pictural representation of gradient descent:
First recall what is convexity. A function \(f: \mathbb{R}^d \to \mathbb{R}\) is convex if for any two points \(x, y \in \mathbb{R}^d\) and any \(\theta \in [0, 1]\), \[f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y).\] Geometrically, this means that the line segment connecting any two points on the graph of the function lies above or on the graph itself. Convex functions have a single global minimum, making optimization problems easier to solve. Why they have a single global minimum? Because if there were two distinct local minima, the line segment connecting them would violate the convexity condition.
Why loss function is convex, ensuring that gradient descent will converge to the global minimum. To prove that the loss function \(\mathcal{L}(w) = \|Xw - y\|^2\) is convex, we can show that its Hessian matrix is positive semidefinite. The Hessian matrix of \(\mathcal{L}(w)\) is given by \[H = \nabla^2 \mathcal{L}(w) = 2 X^\top X.\] To show that \(H\) is positive semidefinite, we need to show that for any non-zero vector \(z \in \mathbb{R}^d\), \[z^\top H z \geq 0.\] Calculating \(z^\top H z\): \[z^\top H z = z^\top (2 X^\top X) z = 2 (X z)^\top (X z) = 2 \|X z\|^2 \geq 0.\] Since \(\|X z\|^2\) is always non-negative, it follows that \(H\) is positive semidefinite. Therefore, the loss function \(\mathcal{L}(w)\) is convex.
We need to justify why this gradient descent is a good idea. Gradient descent is a good idea for several reasons:
Scalability: Gradient descent can handle large datasets and high-dimensional feature spaces efficiently, as it does not require storing or inverting large matrices.
Flexibility: It can be easily adapted to different loss functions and regularization techniques.
Convergence Guarantees: For convex loss functions, gradient descent is guaranteed to converge to the global minimum, provided an appropriate learning rate is chosen.
Simplicity: The algorithm is straightforward to implement and understand, making it accessible for practitioners.
Why computing inverse of a matrix is computationally expensive and
numerically unstable? Computing the inverse of a matrix, especially
large matrices, can be computationally expensive due to the time
complexity of matrix inversion algorithms, which is typically \(O(d^3)\) for a \(d \times d\) matrix. This becomes
prohibitive for high-dimensional data. Additionally, matrix inversion
can be numerically unstable if the matrix is ill-conditioned (i.e., has
a high condition number), leading to significant errors in the computed
inverse. Small perturbations in the input data can result in large
changes in the inverse, making it unreliable for practical applications.
But in contrast, gradient descent avoids these issues by iteratively
refining the solution without requiring matrix inversion, making it more
robust and efficient for large-scale problems. So let us now implement
gradient descent to find the optimal weights for our housing price
prediction problem by iteratively updating the weights until convergence
by calculating the gradient at each step non programmatically. Starting
with an initial guess \(w^{(0)} =
\begin{bmatrix}0 \\ 0\end{bmatrix}\) and a learning rate \(\eta = 0.0001\), we compute the gradient at
each step: \[\nabla \mathcal{L}(w^{(k)}) = 2
X^\top (X w^{(k)} - y).\] We then update the weights: \[w^{(k+1)} = w^{(k)} - \eta \nabla
\mathcal{L}(w^{(k)}).\] Repeating this process and get the
following sequence of weight updates: \[\begin{aligned}
w^{(0)} &= \begin{bmatrix}0 \\ 0\end{bmatrix}, \\
w^{(1)} &\approx \begin{bmatrix}0.096 \\ 0.015\end{bmatrix}, \\
w^{(2)} &\approx \begin{bmatrix}0.187 \\ 0.028\end{bmatrix}, \\
w^{(3)} &\approx \begin{bmatrix}0.273 \\ 0.040\end{bmatrix}, \\
&\vdots \\
w^{(1000)} &\approx \begin{bmatrix}5.562 \\ 5.722\end{bmatrix}.
\end{aligned}\] After sufficient iterations, the weights converge
to approximately \(\begin{bmatrix}5.562 \\
5.722\end{bmatrix}\), which is very close to the closed-form
solution we computed earlier. This demonstrates the effectiveness of
gradient descent in finding the optimal weights for linear regression.
There are many small things to consider here like how to choose the
learning rate, when to stop the iterations (convergence criteria), etc.
Below is a pictorial representation of gradient descent converging to
the optimal weights:
The small things to consider here like how to choose the learning rate,
when to stop the iterations (convergence criteria), etc. are important
practical aspects of implementing gradient descent effectively. See
summary below:
Learning Rate: A small learning rate may lead to slow convergence, while a large learning rate may cause divergence. Techniques like learning rate schedules or adaptive learning rates (e.g., Adam, RMSprop) can help.
Convergence Criteria: Common criteria include a maximum number of iterations, a threshold on the change in loss between iterations, or a threshold on the norm of the gradient.
Batch Size: In practice, stochastic gradient descent (SGD) or mini-batch gradient descent is often used, where gradients are computed on subsets of the data to improve efficiency and convergence.
Regularization: Techniques like L2 regularization (Ridge regression) or L1 regularization (Lasso) can be incorporated to prevent overfitting and improve generalization.
We will learn these small things in detail in later lectures. But for now, let us completely change our perspective and derive linear regression from a probabilistic viewpoint. For this let us recall our housing price prediction problem. We have the design matrix \[X = \begin{bmatrix} 10 & 2 \\ 20 & 3 \\ 15 & 2 \end{bmatrix},\] and the output vector \[y = \begin{bmatrix} 70 \\ 125 \\ 95 \end{bmatrix}.\] We will now assume a probabilistic model for the data generation process. Specifically, we assume that the observed prices are generated according to the linear model plus Gaussian noise: \[y_i = w^\top x_i + \epsilon_i,\] where \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\) are independent and identically distributed Gaussian noise terms with mean zero and variance \(\sigma^2\). This assumption reflects the idea that the true relationship between features and prices is linear, but there are random fluctuations due to unobserved factors. Recall probability basics below:
Before diving into the probabilistic interpretation of linear regression, it is essential to review foundational concepts in probability theory.
The sample space \(\Omega\) is the set of all possible outcomes of a random experiment. An event is a subset of the sample space. For example, when rolling a fair die, \(\Omega = \{1, 2, 3, 4, 5, 6\}\), and the event “rolling an even number” is \(\{2, 4, 6\}\).
The probability of an event \(A\), denoted \(P(A)\), is a number in \([0, 1]\) satisfying:
\(P(\Omega) = 1\) (the sample space has probability 1).
\(P(\emptyset) = 0\) (the empty event has probability 0).
For mutually exclusive events \(A_1, A_2, \ldots\), we have \(P(A_1 \cup A_2 \cup \cdots) = P(A_1) + P(A_2) + \cdots\).
A random variable \(X\) is a function that assigns a numerical value to each outcome in the sample space. Random variables can be discrete (taking values in a countable set) or continuous (taking values in an interval or uncountable set). For example, the outcome of a die roll is a discrete random variable, while the height of a person is continuous.
A probability distribution describes how probability is distributed over the possible values of a random variable.
For a discrete random variable \(X\) taking values \(x_1, x_2, \ldots\), the probability mass function (PMF) is \(p(x_i) = P(X = x_i)\). The probabilities satisfy \(\sum_i p(x_i) = 1\).
For a continuous random variable \(X\), the probability density function (PDF) is \(f(x)\), which satisfies \(\int_{-\infty}^{\infty} f(x) \, dx = 1\). The probability that \(X\) lies in an interval \([a, b]\) is \[P(a \leq X \leq b) = \int_a^b f(x) \, dx.\]
The cumulative distribution function (CDF) is defined as \[F(x) = P(X \leq x).\] For a continuous random variable with PDF \(f(x)\), we have \(F(x) = \int_{-\infty}^x f(t) \, dt\), and the PDF is the derivative of the CDF: \(f(x) = \frac{dF(x)}{dx}\).
The expected value (or mean) of a random variable \(X\) is a measure of its central tendency.
For a discrete random variable, \[E[X] = \sum_i x_i \cdot p(x_i).\]
For a continuous random variable, \[E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx.\]
The expected value is linear: \(E[aX + b] = aE[X] + b\) for constants \(a\) and \(b\).
The variance of a random variable \(X\) measures the spread of its distribution around the mean: \[\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2.\]
The standard deviation is \(\sigma = \sqrt{\text{Var}(X)}\).
Properties of variance include:
\(\text{Var}(aX + b) = a^2 \text{Var}(X)\) for constants \(a\) and \(b\).
\(\text{Var}(X) \geq 0\), with equality if and only if \(X\) is constant almost surely.
The Gaussian or normal distribution is one of the most important distributions in probability and statistics. A continuous random variable \(X\) follows a normal distribution with mean \(\mu\) and variance \(\sigma^2\), denoted \(X \sim \mathcal{N}(\mu, \sigma^2)\), if its PDF is \[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right).\]
Key properties of the normal distribution include:
The PDF is symmetric about \(\mu\) and bell-shaped.
The mean, median, and mode are all equal to \(\mu\).
Approximately 68% of the probability mass lies within one standard deviation of the mean: \(P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 0.68\).
Approximately 95% lies within two standard deviations: \(P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 0.95\).
Approximately 99.7% lies within three standard deviations.
The standard normal distribution has mean \(\mu = 0\) and variance \(\sigma^2 = 1\), often denoted \(\mathcal{N}(0, 1)\).
When dealing with multiple random variables, we use the multivariate normal distribution. A random vector \(\mathbf{x} \in \mathbb{R}^d\) follows a multivariate normal distribution with mean vector \(\boldsymbol{\mu} \in \mathbb{R}^d\) and covariance matrix \(\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}\), denoted \(\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\), if its PDF is \[f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right),\] where \(|\boldsymbol{\Sigma}|\) denotes the determinant of \(\boldsymbol{\Sigma}\).
Two random variables \(X\) and \(Y\) are independent if knowledge of one provides no information about the other. Formally, they are independent if \[P(X = x \text{ and } Y = y) = P(X = x) \cdot P(Y = y)\] for all values \(x\) and \(y\) (in the discrete case), or if their joint PDF factors as \(f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y)\) (in the continuous case).
The conditional probability of event \(A\) given event \(B\) is \[P(A \mid B) = \frac{P(A \text{ and } B)}{P(B)},\] provided \(P(B) > 0\).
Bayes’ theorem relates conditional probabilities: \[P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}.\] This theorem is fundamental in machine learning and statistical inference, allowing us to update beliefs (posterior) based on new evidence (likelihood) and prior knowledge (prior).
In the context of parameter estimation, the likelihood is the probability of observing the data given a particular value of the parameter. If we have data \(\mathbf{y} = (y_1, \ldots, y_n)\) and parameter \(w\), the likelihood is \[L(w \mid \mathbf{y}) = P(\mathbf{y} \mid w).\] The log-likelihood is \(\ell(w) = \log L(w \mid \mathbf{y})\), which is often easier to work with computationally.
Maximum likelihood estimation (MLE) is a method for estimating parameters by choosing the parameter value that maximizes the likelihood of the observed data: \[\hat{w} = \arg\max_w L(w \mid \mathbf{y}) = \arg\max_w \ell(w \mid \mathbf{y}).\]
These concepts will be essential for understanding the probabilistic formulation of linear regression, where we model noise using the normal distribution and estimate parameters via maximum likelihood. Do all models with noise follow this approach? No, not all models with noise follow this approach. While the Gaussian noise assumption is common due to its mathematical convenience and the Central Limit Theorem, other noise distributions can be used depending on the nature of the data and the specific application. For example, if the noise is believed to be heavy-tailed, a Laplace distribution might be more appropriate. Similarly, for count data, a Poisson distribution could be used. The choice of noise model affects the likelihood function and consequently the parameter estimation method. Recall that the Central Limit Theorem states that the sum of a large number of independent random variables, regardless of their individual distributions, tends to follow a normal distribution, provided certain conditions are met. This theorem justifies the common use of Gaussian noise in modeling real-world phenomena, as many types of measurement errors and natural variations can be viewed as the cumulative effect of numerous small, independent factors. In the house price example, are the noise really independent and identically distributed Gaussian? In practice, the assumption of independent and identically distributed (i.i.d.) Gaussian noise may not always hold true in real-world scenarios. While it is a convenient and widely used assumption, there are several factors that can lead to violations of this assumption:
Heteroscedasticity: The variance of the noise may change with the level of the input features. For example, more expensive houses might have larger price fluctuations.
Autocorrelation: In time series data or spatial data, errors may be correlated with each other, violating the independence assumption.
Non-Gaussian Noise: The noise distribution may not be Gaussian; it could be skewed or have heavier tails than a normal distribution.
Outliers: Real-world data often contains outliers that can significantly affect the model fit and violate the i.i.d. assumption.
Despite these potential violations, the Gaussian noise assumption is often used as a first approximation due to its mathematical tractability and the Central Limit Theorem. However, it is important to assess the validity of this assumption in practice and consider alternative models or robust estimation techniques if necessary. Now let us derive the likelihood function for our linear regression model under the Gaussian noise assumption.
Given the model \[y_i = w^\top x_i + \epsilon_i,\] where \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\), the conditional probability of observing \(y_i\) given \(x_i\) and parameters \(w\) is \[P(y_i \mid x_i, w) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - w^\top x_i)^2}{2\sigma^2}\right).\] Assuming the data points are independent, the likelihood of observing the entire dataset \((X, y)\) is the product of the individual probabilities: \[L(w \mid X, y) = \prod_{i=1}^n P(y_i \mid x_i, w) = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n \exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - w^\top x_i)^2\right).\] Taking the logarithm of the likelihood function gives the log-likelihood: \[\ell(w \mid X, y) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - w^\top x_i)^2.\] To find the maximum likelihood estimate of \(w\), we can ignore the constant term \(-\frac{n}{2} \log(2\pi\sigma^2)\) and focus on maximizing the negative of the sum of squared errors: \[\hat{w} = \arg\min_w \sum_{i=1}^n (y_i - w^\top x_i)^2.\] This is precisely the same optimization problem we encountered in the least squares formulation of linear regression. Thus, the maximum likelihood estimate of \(w\) under the Gaussian noise assumption coincides with the least squares solution. This connection highlights the probabilistic interpretation of linear regression as fitting a model that maximizes the likelihood of the observed data under a Gaussian noise model.
Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features in a dataset while retaining as much variance as possible. It achieves this by identifying the directions (principal components) along which the data varies the most.
In linear regression, we study a supervised problem: \[y = Xw + \varepsilon,\] where the goal is to predict an output variable \(y\) (e.g. house price) from features \(X\).
In PCA, we deliberately remove the output variable \(y\) and ask a different question:
Given only the feature data \(X\), how can we describe or represent the data as faithfully as possible using fewer numbers?
Thus PCA is an unsupervised method whose goal is not prediction, but representation and compression of data.
—
Let \[x_1, x_2, \dots, x_n \in \mathbb{R}^d\] be data points (e.g. houses described by size, bedrooms, etc.).
Since each data point \(x_i\) is a \(d\)-dimensional vector, the mean is computed as a vector average: \[\mu = \frac{1}{n}\sum_{i=1}^n x_i = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_d \end{bmatrix} \in \mathbb{R}^d,\] where each component \(\mu_j\) is the average of the \(j\)-th feature across all data points. When we subtract \(\mu\) from \(x_i\), we perform component-wise subtraction: \[x_i - \mu = \begin{bmatrix} x_{i1} - \mu_1 \\ x_{i2} - \mu_2 \\ \vdots \\ x_{id} - \mu_d \end{bmatrix}.\] This compact notation \(x_i \leftarrow x_i - \mu\) applies uniformly regardless of dimensionality.
We first perform mean centering: \[x_i \leftarrow x_i - \mu, \quad \text{where } \mu = \frac{1}{n}\sum_{i=1}^n x_i.\]
After centering, \[\frac{1}{n}\sum_{i=1}^n x_i = 0.\]
Mean centering ensures that we study variation around the average, which is essential for defining variance and covariance.
—
After mean centering, the centered data points are distributed around the origin. To understand the structure of this distribution, we need a single mathematical object that captures:
How much each feature varies (spread along each axis),
How features vary together (whether large values in one feature tend to occur with large or small values in another).
Recall from probability that the variance of a random variable \(X\) is \(\text{Var}(X) = E[X^2]\) (when centered at zero). For data, this becomes \(\frac{1}{n}\sum_i x_i^2\), measuring spread along a single dimension.
For centered data points \(x_1, \ldots, x_n\), the outer product \(x_i x_i^\top\) creates a \(d \times d\) matrix that captures all pairwise products of coordinates: \[x_i x_i^\top = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{id} \end{bmatrix} \begin{bmatrix} x_{i1} & x_{i2} & \cdots & x_{id} \end{bmatrix} = \begin{bmatrix} x_{i1}^2 & x_{i1}x_{i2} & \cdots \\ x_{i2}x_{i1} & x_{i2}^2 & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}.\]
Averaging these matrices across all data points accumulates information about:
Diagonal entries: \(\frac{1}{n}\sum_i x_{ij}^2\) (variance of feature \(j\)),
Off-diagonal entries: \(\frac{1}{n}\sum_i x_{ij}x_{ik}\) (covariance between features \(j\) and \(k\)).
This is the natural generalization of univariate variance to the multivariate setting.
The covariance matrix is defined as \[\Sigma = \frac{1}{n}\sum_{i=1}^n x_i x_i^\top.\]
Each entry \[\Sigma_{jk} = \frac{1}{n}\sum_{i=1}^n x_{ij}x_{ik}\] is the covariance between feature \(j\) and feature \(k\).
Thus, the covariance matrix compactly stores:
variances of individual features (diagonal entries),
pairwise relationships between features (off-diagonal entries).
—
Choose a vector of weights \[v \in \mathbb{R}^d, \quad \|v\| = 1.\]
For each data point \(x_i\), define a one-dimensional summary: \[z_i = x_i^\top v.\]
This compresses each \(d\)-dimensional data point into a single number.
Different choices of \(v\) preserve different aspects of the data.
—
The variance of the summary values \(\{z_i\}\) is \[\operatorname{Var}(z) = \frac{1}{n}\sum_{i=1}^n z_i^2 = v^\top \Sigma v.\]
Interpretation:
Large variance \(\Rightarrow\) the summary distinguishes data points well.
Zero variance \(\Rightarrow\) all data points collapse to the same value (no information).
Thus, variance measures how much of the differences between data points survive compression.
—
To find the best one-dimensional representation, PCA solves: \[\max_{\|v\|=1} \; v^\top \Sigma v.\]
This chooses the linear combination of features whose values vary the most across the dataset.
The solution is called the first principal component \(v_1\).
—
Given the summary \[z_i = x_i^\top v,\] the natural reconstruction of the original data point is \[\hat{x}_i = z_i\, v.\]
All reconstructed points lie in the one-dimensional subspace spanned by \(v\).
The average reconstruction error is \[\frac{1}{n}\sum_{i=1}^n \|x_i - \hat{x}_i\|^2 = \frac{1}{n}\sum_{i=1}^n \|x_i - (x_i^\top v)v\|^2.\]
A fundamental result is:
The vector \(v\) that maximizes variance also minimizes reconstruction error.
Hence, PCA finds the best low-dimensional approximation of the data in the least-squares sense.
—
Consider houses described by size and number of bedrooms (mean-centered):
\[x_1 = (-5,-0.3),\quad x_2 = (5,0.7),\quad x_3 = (0,-0.3).\]
Choosing \(v=(0,1)\) (only bedrooms) produces low variance summaries.
Choosing \(v=(1,0)\) (only size) produces high variance summaries.
Thus PCA automatically discovers that house size is the dominant pattern of variation.
—
PCA provides:
optimal linear compression of data,
noise reduction by discarding low-variance directions,
interpretable dominant patterns,
better conditioning for downstream tasks (regression, clustering, density estimation).
PCA does not discover meaning or causality. It guarantees only this:
Among all linear low-dimensional representations, PCA loses the least information under squared error.
—
Let \(\mathcal{X} = \{x_1, \ldots, x_N\}\) be a dataset where \(x_n \in \mathbb{R}^D\). We seek to find a lower-dimensional representation in \(\mathbb{R}^M\) where \(M < D\), while preserving as much information as possible.
PCA projects data onto a lower-dimensional linear subspace. The key is to find an orthonormal basis \(\{b_1, \ldots, b_M\}\) of this \(M\)-dimensional subspace such that the projected data retains maximum variance.
For a given orthonormal basis \(B = [b_1, \ldots, b_M]\) where \(b_i \in \mathbb{R}^D\) and \(b_i^\top b_j = \delta_{ij}\), the projection of a data point \(x_n\) onto this subspace is: \[\tilde{x}_n = \sum_{i=1}^M z_{ni} b_i = B z_n,\] where \(z_n = [z_{n1}, \ldots, z_{nM}]^\top \in \mathbb{R}^M\) are the coordinates (principal component scores) in the new basis, computed as \(z_{ni} = x_n^\top b_i\).
PCA finds the orthonormal basis that maximizes the variance of the projected data. For mean-centered data (where \(\sum_{n=1}^N x_n = 0\)), the variance of the projections onto a single direction \(b_1\) is: \[V_1 = \frac{1}{N} \sum_{n=1}^N (x_n^\top b_1)^2 = \frac{1}{N} \sum_{n=1}^N b_1^\top x_n x_n^\top b_1 = b_1^\top S b_1,\] where \(S = \frac{1}{N} \sum_{n=1}^N x_n x_n^\top\) is the data covariance matrix.
The first principal component is found by solving: \[\max_{b_1} \; b_1^\top S b_1 \quad \text{subject to} \quad \|b_1\|^2 = 1.\]
Using the method of Lagrange multipliers with Lagrangian \(\mathcal{L}(b_1, \lambda_1) = b_1^\top S b_1 - \lambda_1(b_1^\top b_1 - 1)\), we obtain: \[\frac{\partial \mathcal{L}}{\partial b_1} = 2S b_1 - 2\lambda_1 b_1 = 0 \quad \Rightarrow \quad S b_1 = \lambda_1 b_1.\]
Thus, \(b_1\) is an eigenvector of \(S\) with eigenvalue \(\lambda_1\). The variance along this direction is: \[V_1 = b_1^\top S b_1 = b_1^\top (\lambda_1 b_1) = \lambda_1 b_1^\top b_1 = \lambda_1.\]
To maximize variance, we choose \(b_1\) to be the eigenvector corresponding to the largest eigenvalue \(\lambda_1\) of \(S\).
For the \(m\)-th principal component, we solve: \[\max_{b_m} \; b_m^\top S b_m \quad \text{subject to} \quad \|b_m\|^2 = 1 \text{ and } b_m \perp b_i \text{ for } i = 1, \ldots, m-1.\]
The solution is the eigenvector corresponding to the \(m\)-th largest eigenvalue of \(S\).
PCA can equivalently be formulated as minimizing the average squared reconstruction error. The reconstruction of \(x_n\) from its projection is: \[\tilde{x}_n = \sum_{i=1}^M (x_n^\top b_i) b_i = BB^\top x_n,\] where \(B = [b_1, \ldots, b_M]\).
The reconstruction error for the entire dataset is: \[J = \frac{1}{N} \sum_{n=1}^N \|x_n - \tilde{x}_n\|^2 = \frac{1}{N} \sum_{n=1}^N \|x_n - BB^\top x_n\|^2.\]
Using the identity \(\|x_n\|^2 = \|BB^\top x_n\|^2 + \|(I - BB^\top)x_n\|^2\) for orthonormal \(B\), we have: \[J = \frac{1}{N} \sum_{n=1}^N \|x_n\|^2 - \frac{1}{N} \sum_{n=1}^N \|BB^\top x_n\|^2.\]
Since the first term is constant, minimizing \(J\) is equivalent to maximizing \(\frac{1}{N} \sum_{n=1}^N \|BB^\top x_n\|^2\), which is the variance of the projected data. Thus, maximum variance and minimum reconstruction error are equivalent objectives.
Center the data: Compute mean \(\mu = \frac{1}{N}\sum_{n=1}^N x_n\) and set \(x_n \leftarrow x_n - \mu\).
Compute covariance matrix: \(S = \frac{1}{N} \sum_{n=1}^N x_n x_n^\top = \frac{1}{N} X^\top X\), where \(X = [x_1, \ldots, x_N]^\top \in \mathbb{R}^{N \times D}\).
Eigendecomposition: Compute eigenvectors and eigenvalues of \(S\): \(S = \sum_{i=1}^D \lambda_i b_i b_i^\top\) where \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_D\).
Select principal components: Choose the top \(M\) eigenvectors \(B = [b_1, \ldots, b_M]\).
Project data: For each \(x_n\), compute \(z_n = B^\top x_n \in \mathbb{R}^M\).
The proportion of variance captured by the first \(M\) principal components is: \[\frac{\sum_{i=1}^M \lambda_i}{\sum_{i=1}^D \lambda_i}.\]
Orthogonality: Principal components are mutually orthogonal: \(b_i^\top b_j = \delta_{ij}\).
Uncorrelated Projections: The coordinates \(z_{ni}\) and \(z_{nj}\) for \(i \neq j\) are uncorrelated across the dataset.
Variance Ordering: Variance along principal components decreases: \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_D \geq 0\).
Linear Transformation: PCA is a linear method that cannot capture nonlinear structure in data.
Scale Sensitivity: PCA is sensitive to feature scales; standardization is often applied before PCA.
For high-dimensional data where \(N < D\), computing the \(D \times D\) covariance matrix \(S\) is expensive. Instead, we can use the singular value decomposition (SVD) of the centered data matrix \(X \in \mathbb{R}^{N \times D}\): \[X = U \Sigma V^\top,\] where \(U \in \mathbb{R}^{N \times N}\), \(\Sigma \in \mathbb{R}^{N \times D}\), \(V \in \mathbb{R}^{D \times D}\) are the SVD factors.
The relationship with PCA:
The columns of \(V\) are the principal components (eigenvectors of \(S = \frac{1}{N}X^\top X\)).
The eigenvalues of \(S\) are \(\lambda_i = \frac{\sigma_i^2}{N}\), where \(\sigma_i\) are the singular values.
The projections are given by \(Z = XV = U\Sigma \in \mathbb{R}^{N \times D}\).
This approach is numerically more stable and efficient, especially when \(N \ll D\).
PCA identifies an orthogonal coordinate system aligned with the directions of maximum variance in the data cloud. Geometrically:
The first principal component \(b_1\) points in the direction of greatest spread.
The second principal component \(b_2\) is orthogonal to \(b_1\) and points in the direction of second-greatest spread.
Subsequent components follow similarly, forming a nested sequence of best-fit subspaces.
The \(M\)-dimensional subspace spanned by \(\{b_1, \ldots, b_M\}\) is the best-fitting \(M\)-dimensional linear subspace in the least-squares sense.
—
Let the dataset consist of \(N\) centered samples in \(\mathbb{R}^d\): \[X \in \mathbb{R}^{N \times d}, \qquad X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Nd} \end{bmatrix}.\] The \(n\)-th data point is the row vector \[x_n = \begin{bmatrix} x_{n1} & x_{n2} & \cdots & x_{nd} \end{bmatrix} \in \mathbb{R}^d.\]
The empirical covariance matrix is \[S = \frac{1}{N} X^T X \in \mathbb{R}^{d \times d}.\]
Let \[B \in \mathbb{R}^{d \times m}, \qquad B = \begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1m} \\ b_{21} & b_{22} & \cdots & b_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ b_{d1} & b_{d2} & \cdots & b_{dm} \end{bmatrix} = \begin{bmatrix} b_1 & b_2 & \cdots & b_m \end{bmatrix},\] where the columns \(b_k \in \mathbb{R}^d\) are orthonormal: \[B^T B = I_m.\]
Each direction satisfies the eigenvalue problem \[S b_k = \lambda_k b_k.\]
The low-dimensional representation is \[Z = X B \in \mathbb{R}^{N \times m},\] \[Z = \begin{bmatrix} z_{11} & z_{12} & \cdots & z_{1m} \\ z_{21} & z_{22} & \cdots & z_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ z_{N1} & z_{N2} & \cdots & z_{Nm} \end{bmatrix}.\]
Componentwise, \[z_{nk} = \sum_{j=1}^d x_{nj} \, b_{jk} = b_k^T x_n.\]
In particular, the first principal component is \[z_{n1} = b_{11}x_{n1} + b_{21}x_{n2} +\cdots+ b_{d1}x_{nd}.\] Thus the first principal component is a linear combination of all original features.
Let \[z_1 = \begin{bmatrix} z_{11} \\ z_{21} \\ \vdots \\ z_{N1} \end{bmatrix}.\] Then \[\mathrm{Var}(z_1) = \frac{1}{N} \sum_{n=1}^N z_{n1}^2 = \frac{1}{N} \sum_{n=1}^N (b_1^T x_n)^2 = b_1^T S b_1.\] PCA chooses \(b_1\) to maximize this variance: \[\max_{\|b_1\|=1} \; b_1^T S b_1.\]
The reconstructed data is \[\hat{X} = Z B^T = X B B^T.\]
The matrix \[P = B B^T \in \mathbb{R}^{d \times d}\] has components \[P_{ij} = \sum_{k=1}^m b_{ik} b_{jk}.\]
Thus reconstruction componentwise is \[\hat{x}_{ni} = \sum_{j=1}^d x_{nj} P_{ji} = \sum_{j=1}^d x_{nj} \left( \sum_{k=1}^m b_{jk} b_{ik} \right).\]
If \(m < d\), then \[B^T B = I_m, \qquad BB^T \neq I_d.\] Instead, \[(BB^T)^2 = BB^T,\] so \(BB^T\) is a projection matrix onto the \(m\)-dimensional subspace spanned by \(\{b_1,\dots,b_m\}\).
Therefore \[\hat{X} = XBB^T \neq X\] unless \(m=d\), in which case \(B\) is square orthogonal and \[BB^T = I_d.\]
PCA performs the transformation \[X_{N \times d} \;\longrightarrow\; Z = X B \; (N \times m),\] where each new coordinate is \[z_{nk} = b_k^T x_n,\] and reconstruction is the orthogonal projection \[\hat{X} = XBB^T.\]
Let us return to our same housing price example which is the following dataset with three features: \[\begin{bmatrix} 10 & 2 & 70 \\ 20 & 3 & 130 \\ 15 & 2 & 100 \end{bmatrix}.\] Here the first two columns are features (size in square feet and number of bedrooms) and the last column is the target variable (price in thousands of dollars). For PCA, we will only consider the feature columns: \[\begin{bmatrix} 10 & 2 \\ 20 & 3 \\ 15 & 2 \end{bmatrix}.\] Why we are considering only feature columns? Because in real life we may not know the target variable for all data points, does it mean the dataset is useless? No, we can still use the feature columns to learn about the structure of the data. PCA helps us find a lower-dimensional representation of the data that captures the most important information. This is useful for visualization, noise reduction, and as a preprocessing step for other machine learning algorithms as it can help improve their performance by reducing overfitting and computational complexity. So in this example, we will reduce the two features (size and number of bedrooms) into a single feature that captures the most variance in the data. First step is to center the data by subtracting the mean of each feature from the corresponding feature values. This ensures that each feature has a mean of zero, which is important for PCA as it focuses on the directions of maximum variance rather than absolute values.
The mean of the first feature (size) is \((10 + 20 + 15) / 3 = 15\), and the mean of the second feature (number of bedrooms) is \((2 + 3 + 2) / 3 \approx 2.33\). Centering the data gives: \[\begin{bmatrix} 10 - 15 & 2 - 2.33 \\ 20 - 15 & 3 - 2.33 \\ 15 - 15 & 2 - 2.33 \end{bmatrix}.\] This results in the centered data matrix: \[\begin{bmatrix} -5 & -0.33 \\ 5 & 0.67 \\ 0 & -0.33 \end{bmatrix}.\] Now we can proceed to compute the covariance matrix of the centered data, because PCA relies on the covariance structure of the data to identify the principal components. This is due to the fact that principal components are directions in the feature space along which the data varies the most, and the covariance matrix captures this variance and the relationships between different features. Geometrically, the covariance matrix provides information about how the data is spread out in different directions, which is essential for determining the axes of maximum variance that PCA seeks to identify. Graphically see below:
This graph shows the centered data points in a 2D space defined by size and number of bedrooms. The points are plotted relative to the origin, which represents the mean of the original data. The spread of these points indicates the variance in the data, which PCA will analyze to find the principal components.
The covariance matrix \(C\) is computed as: \[C = \frac{1}{n} Z^\top Z,\] where \(Z\) is the centered data matrix and \(n\) is the number of data points. In our case, \(n = 3\), so we have: \[C = \frac{1}{3} \begin{bmatrix} -5 & 5 & 0 \\ -0.33 & 0.67 & -0.33 \end{bmatrix} \begin{bmatrix} -5 & -0.33 \\ 5 & 0.67 \\ 0 & -0.33 \end{bmatrix}.\] Calculating this gives: \[C = \frac{1}{3} \begin{bmatrix} (-5)(-5) + (5)(5) + (0)(0) & (-5)(-0.33) + (5)(0.67) + (0)(-0.33) \\ (-0.33)(-5) + (0.67)(5) + (-0.33)(0) & (-0.33)(-0.33) + (0.67)(0.67) + (-0.33)(-0.33) \end{bmatrix}\] \[= \frac{1}{3} \begin{bmatrix} 25 + 25 + 0 & 1.65 + 3.35 + 0 \\ 1.65 + 3.35 + 0 & 0.1089 + 0.4489 + 0.1089 \end{bmatrix}\] \[= \frac{1}{3} \begin{bmatrix} 50 & 5 \\ 5 & 0.6667 \end{bmatrix} = \begin{bmatrix}16.6667 & 1.6667 \\ 1.6667 & 0.2222\end{bmatrix}.\] Next step is to compute the eigenvalues and eigenvectors of the covariance matrix \(C\). The eigenvalues represent the amount of variance captured by each principal component, while the eigenvectors indicate the directions of these components in the feature space.
At the heart of PCA lies the optimization problem \[\max_{\|v\|=1} \; v^\top \Sigma v,\] where \(\Sigma\) is the covariance matrix of the mean-centered data.
This problem asks for a linear combination of features whose values have maximum variance. Without the unit-length constraint \(\|v\|=1\), the objective would increase arbitrarily by scaling \(v\), so the constraint is essential.
To solve this constrained maximization, we introduce a Lagrange multiplier: \[\mathcal{L}(v,\lambda) = v^\top \Sigma v - \lambda (v^\top v - 1).\]
Taking derivatives with respect to \(v\) and setting them to zero yields \[\nabla_v \mathcal{L} = 2\Sigma v - 2\lambda v = 0,\] which simplifies to \[\Sigma v = \lambda v.\]
This equation is precisely the definition of an eigenvector–eigenvalue pair. Thus, eigenvectors are not assumed in PCA; they emerge inevitably as the stationary points of the variance maximization problem.
For any unit vector \(v\), \[v^\top \Sigma v\] equals the variance of the one-dimensional projection \(z = x^\top v\).
When \(v\) is an eigenvector of \(\Sigma\) with eigenvalue \(\lambda\), this quantity becomes \[v^\top \Sigma v = \lambda.\]
Hence, the eigenvalue directly measures the variance captured by its corresponding eigenvector. Selecting the eigenvector with the largest eigenvalue therefore yields the direction of maximum variance, called the first principal component.
After the first component is chosen, PCA seeks additional components that:
capture as much remaining variance as possible,
do not duplicate information already captured.
This leads naturally to the requirement that subsequent vectors be orthogonal to earlier ones. The eigenvectors of a symmetric matrix such as \(\Sigma\) are automatically orthogonal, so they provide an ordered set of non-redundant directions.
Eigenvectors define weight patterns across features.
Eigenvalues quantify how much variance each pattern explains.
Keeping the top \(k\) eigenvectors yields the best \(k\)-dimensional linear approximation of the data.
Eigenvectors should not be interpreted as causes or predictors. They represent optimal coordinates in which the structure of the data is most economically expressed.
Eigenvectors and eigenvalues appear in PCA because maximizing variance under a normalization constraint mathematically forces the eigenvalue equation; they quantify optimal directions and the amount of variation captured along them.
To find the eigenvalues \(\lambda\) of the covariance matrix \(C\), we solve the characteristic equation: \[\det(C - \lambda I) = 0,\] where \(I\) is the identity matrix. Thus, we have: \[\det\begin{bmatrix}16.6667 - \lambda & 1.6667 \\ 1.6667 & 0.2222 - \lambda\end{bmatrix} = 0.\] Calculating the determinant gives: \[(16.6667 - \lambda)(0.2222 - \lambda) - (1.6667)(1.6667) = 0.\] Expanding this, we get: \[\lambda^2 - 16.8889\lambda + (16.6667 \cdot 0.2222 - 2.7778) = 0,\] \[\lambda^2 - 16.8889\lambda + 1.1111 = 0.\] Using the quadratic formula: \[\lambda = \frac{16.8889 \pm \sqrt{(16.8889)^2 - 4 \cdot 1 \cdot 1.1111}}{2}.\] Calculating the discriminant: \[(16.8889)^2 - 4 \cdot 1.1111 \approx 285.1852 - 4.4444 = 280.7408.\] Thus, \[\lambda \approx \frac{16.8889 \pm 16.7553}{2}.\] This gives us two eigenvalues: \[\lambda_1 \approx 16.8444, \quad \lambda_2 \approx 0.0444.\] Next, we find the eigenvectors corresponding to each eigenvalue by solving the equation: \[(C - \lambda I)v = 0.\] For \(\lambda_1 \approx 16.8444\): \[\begin{bmatrix}16.6667 - 16.8444 & 1.6667 \\ 1.6667 & 0.2222 - 16.8444\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] This simplifies to: \[\begin{bmatrix}-0.1777 & 1.6667 \\ 1.6667 & -16.6222\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] Solving this system gives the eigenvector: \[v_1 \approx \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix}.\] For \(\lambda_2 \approx 0.0444\): \[\begin{bmatrix}16.6667 - 0.0444 & 1.6667 \\ 1.6667 & 0.2222 - 0.0444\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] This simplifies to: \[\begin{bmatrix}16.6223 & 1.6667 \\ 1.6667 & 0.1778\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] Solving this system gives the eigenvector: \[v_2 \approx \begin{bmatrix}-0.1960 \\ 0.9806\end{bmatrix}.\] The eigenvector corresponding to the largest eigenvalue \(\lambda_1\) represents the direction of the first principal component, which captures the most variance in the data. In this case, the first principal component is approximately: \[PC_1 \approx \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix}.\] To reduce the dimensionality of the data from two features to one, we project the centered data onto the first principal component. This is done by taking the dot product of each centered data point with the first principal component vector.
The projection of a centered data point \(z_i\) onto the first principal component \(PC_1\) is given by: \[p_i = z_i^\top PC_1.\] Calculating the projections for each centered data point: \[\begin{aligned} p_1 &= \begin{bmatrix}-5 & -0.33\end{bmatrix} \cdot \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix} \approx -5 \cdot 0.9806 + (-0.33) \cdot 0.1960 \approx -4.903 - 0.0647 \approx -4.9677, \\ p_2 &= \begin{bmatrix}5 & 0.67\end{bmatrix} \cdot \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix} \approx 5 \cdot 0.9806 + 0.67 \cdot 0.1960 \approx 4.903 + 0.1313 \approx 5.0343, \\ p_3 &= \begin{bmatrix}0 & -0.33\end{bmatrix} \cdot \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix} \approx 0 \cdot 0.9806 + (-0.33) \cdot 0.1960 \approx -0.0647. \end{aligned}\] Thus, the one-dimensional representation of the data after PCA is approximately: \[\begin{bmatrix}-4.9677 \\ 5.0343 \\ -0.0647\end{bmatrix}.\] This reduced representation captures the most significant variance in the original two-dimensional feature space, allowing us to analyze and visualize the data more effectively while reducing complexity. Let us find the variance of this one-dimensional representation: \[\mathrm{Var}(p) = \frac{1}{3} \sum_{i=1}^3 p_i^2 \approx \frac{1}{3}((-4.9677)^2 + (5.0343)^2 + (-0.0647)^2) \approx \frac{1}{3}(24.678 + 25.344 + 0.0042) \approx 16.674.\] This variance is close to the largest eigenvalue \(\lambda_1 \approx 16.8444\), confirming that the first principal component captures the majority of the variance in the data. Second principal component captures the remaining variance, which is much smaller in this case, indicating that the data is primarily spread along the direction of the first principal component. See calculation below: \[\mathrm{Var}(p_2) = \frac{1}{3} \sum_{i=1}^3 p_{2i}^2 \approx \frac{1}{3}((-0.1960)^2 + (0.9806)^2 + (0.1960)^2) \approx \frac{1}{3}(0.0384 + 0.9616 + 0.0384) \approx 0.044\] This variance is close to the second eigenvalue \(\lambda_2 \approx 0.0444\), confirming that the second principal component captures the remaining variance in the data. The first principal component is the most informative direction in the feature space, while the second principal component captures a much smaller amount of variance, indicating that it may be less important for understanding the structure of the data. This example illustrates how PCA identifies the directions of maximum variance in the data and allows us to reduce the dimensionality while retaining as much information as possible. By projecting the data onto the first principal component, we can analyze the most significant patterns in the data with a simpler representation.