Housing Price Problem

"Suppose a real-estate website wants to predict house prices based on features like size and number of bedrooms." To do anything, we shall need some information,

Size (\(x_1\))	Bedrooms (\(x_2\))	Price (Rs. lakhs)
10	2	70
20	3	130

Here \[\begin{aligned} x_1 &= \text{house size (in 100 sq ft)}, \\ x_2 &= \text{number of bedrooms}, \\ w_1 &= \text{price contribution per unit size (Rs. lakhs per 100 sq ft)}, \\ w_2 &= \text{price contribution per bedroom (Rs. lakhs)}. \end{aligned}\]

Now we want to build a model that predicts the price of a house based on its size and number of bedrooms. We assume that the price contributions from size and number of bedrooms are linear and additive.

Thus the linear model for the price (in Rs. lakhs) is \[\hat y = w_1 x_1 + w_2 x_2,\] where \(\hat y\) denotes the predicted price. Here \(\hat y\) is a linear combination of the input features \(x_1\) and \(x_2\) with the weights \(w_1\) and \(w_2\) representing the model parameters. The word feature refers to an individual measurable property or characteristic of the data being analyzed. In this case, the features are the size of the house and the number of bedrooms. To find the best-fitting weights \(w_1\) and \(w_2\), we can set up a system of equations based on the provided data points. From the data, we have two equations: \[\begin{aligned} 70 &= w_1 \cdot 10 + w_2 \cdot 2, \\ 130 &= w_1 \cdot 20 + w_2 \cdot 3. \end{aligned}\] We can solve this system of equations to find the values of \(w_1\) and \(w_2\). Rearranging the equations, we get: \[\begin{aligned} 10 w_1 + 2 w_2 &= 70, \\ 20 w_1 + 3 w_2 &= 130. \end{aligned}\] We can solve this system using methods such as substitution, elimination, or matrix methods. Substitution or elimination would be straightforward here, but for larger systems, matrix methods are more efficient. Let us therefore express the system in matrix form: \[\begin{bmatrix}10 & 2 \\ 20 & 3\end{bmatrix} \begin{bmatrix}w_1 \\ w_2\end{bmatrix} = \begin{bmatrix}70 \\ 130\end{bmatrix}.\] Let \(A = \begin{bmatrix}10 & 2 \\ 20 & 3\end{bmatrix}\), \(\mathbf{w} = \begin{bmatrix}w_1 \\ w_2\end{bmatrix}\), and \(\mathbf{b} = \begin{bmatrix}70 \\ 130\end{bmatrix}\). Then we have \(A \mathbf{w} = \mathbf{b}\). To solve for \(\mathbf{w}\), we can use the inverse of matrix \(A\) (if it exists) or use other methods like Gaussian elimination. \[\mathbf{w} = A^{-1} \mathbf{b}.\]

Calculating the inverse of \(A\): \[\begin{aligned} A^{-1} &= \frac{1}{(10)(3) - (2)(20)} \begin{bmatrix}3 & -2 \\ -20 & 10\end{bmatrix} \\ &= \frac{1}{30 - 40} \begin{bmatrix}3 & -2 \\ -20 & 10\end{bmatrix} \\ &= -\frac{1}{10} \begin{bmatrix}3 & -2 \\ -20 & 10\end{bmatrix} \\ &= \begin{bmatrix}-\dfrac{3}{10} & \dfrac{2}{10} \\ \dfrac{20}{10} & -\dfrac{10}{10}\end{bmatrix} \\ &= \begin{bmatrix}-0.3 & 0.2 \\ 2 & -1\end{bmatrix}. \end{aligned}\]

Now, multiplying \(A^{-1}\) with \(\mathbf{b}\): \[\begin{aligned} \mathbf{w} &= A^{-1} \mathbf{b} \\ &= \begin{bmatrix}-0.3 & 0.2 \\ 2 & -1\end{bmatrix} \begin{bmatrix}70 \\ 130\end{bmatrix} \\ &= \begin{bmatrix}-0.3 \cdot 70 + 0.2 \cdot 130 \\ 2 \cdot 70 - 1 \cdot 130\end{bmatrix} \\ &= \begin{bmatrix}-21 + 26 \\ 140 - 130\end{bmatrix} \\ &= \begin{bmatrix}5 \\ 10\end{bmatrix}. \end{aligned}\]

Thus, we find that \(w_1 = 5\) and \(w_2 = 10\). Therefore, the linear model for predicting house prices is: \[\hat y = 5 x_1 + 10 x_2.\] This means that for every additional 100 sq ft of house size, the price increases by Rs.5 lakhs, and for every additional bedroom, the price increases by Rs.10 lakhs.

We should see the formulation below \[\begin{bmatrix} 10 & 2 \\ 20 & 3 \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \end{bmatrix} = \begin{bmatrix} 70 \\ 130 \end{bmatrix}\] as \[X\mathbf{w}=\mathbf{y}\] "Machine learning systems only see this." Now suppose we get a third house with size 15 (i.e., 1500 sq ft) and 2 bedrooms. Or in other words, our data table looks like

Size (\(x_1\))	Bedrooms (\(x_2\))	Price (Rs. lakhs)
10	2	70
20	3	130
15	2	100

We can no longer solve for \(w_1\) and \(w_2\) exactly since we have three equations but only two unknowns. Even if we try to solve it, we will find that there is no exact solution that satisfies all three equations simultaneously. If you do not believe me, try to use Gaussian elimination. We will get the following augmented matrix: \[\left[\begin{array}{cc|c} 10 & 2 & 70 \\ 20 & 3 & 130 \\ 15 & 2 & 100 \end{array}\right]\] Applying row operations, we will get the following row echelon form: \[\left[\begin{array}{cc|c} 10 & 2 & 70 \\ 0 & 1 & 5 \\ 0 & 0 & 0 \end{array}\right]\] The last row corresponds to the equation \(0 = 0\), which indicates the system is consistent but underdetermined. However, this means the third equation is a linear combination of the first two, so we still cannot determine unique values for both weights. Or if you want to be more mathematical about it, the rank of the coefficient matrix is 2, while the rank of the augmented matrix is also 2, but we have only 2 independent equations for 2 unknowns with a dependent third equation.

But does it mean we cannot find a good model? Not really. We can try to find weights \(w_1\) and \(w_2\) that minimize the error between the predicted prices and the actual prices. Since we do not have an exact solution of \(X\mathbf{w}=\mathbf{y}\) we can try to find an approximate solution, that is \(X\mathbf{w} \approx \mathbf{y}\). Old question (math): “Find \(\mathbf w\) such that \(X\mathbf w=\mathbf y\).”

New question (ML): “Find \(\mathbf w\) such that \(X\mathbf w\approx\mathbf y\) as closely as possible.”

Define the error for the \(i\)-th house as \[e_i=\hat y_i-y_i, \qquad \hat y_i = w_1 x_{i1}+w_2 x_{i2}.\] One common approach is to consider the following optimization problem: \[\min_{w_1, w_2} \sum_{i=1}^{3} (y_i - \hat{y}_i)^2,\] where \(y_i\) is the actual price of the \(i\)-th house, and \(\hat{y}_i\) is the predicted price using our linear model. This approach is known as the least squares method

More formally,

Linear Regression: Theory and Mathematical Foundations

In many real-world problems, such as house price prediction, we attempt to model the relationship between several input variables (features) and an observed output using a linear rule. While simple linear systems may admit exact solutions, real data is noisy, inconsistent, and overdetermined. Linear regression provides a principled mathematical framework to handle such situations.

Linear Model Assumption

Let \(x_i \in \mathbb{R}^d\) denote the feature vector associated with the \(i\)-th data point (for example, house size, number of bedrooms, age, etc.), and let \(y_i \in \mathbb{R}\) denote the corresponding observed output (house price). We assume a linear relationship between inputs and output of the form \[\hat{y}_i = w^\top x_i,\] where \(w \in \mathbb{R}^d\) is an unknown parameter vector to be learned from data. This assumption means that the output is linear in the parameters \(w\), which is the defining feature of linear regression.

Matrix Formulation

Given \(n\) data points, we stack the feature vectors row-wise into a design matrix \[X = \begin{bmatrix} x_1^\top \\ x_2^\top \\ \vdots \\ x_n^\top \end{bmatrix} \in \mathbb{R}^{n \times d}, \quad y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \in \mathbb{R}^n.\] The model predictions for all data points can then be written compactly as \[\hat{y} = Xw.\] In idealized situations, one might hope for an exact solution to the equation \(Xw = y\). However, when \(n > d\) and the data is noisy, such an equation is typically inconsistent, as demonstrated earlier using rank and Gaussian elimination.

Residuals and Error Vector

The discrepancy between the model predictions and the observed outputs is captured by the residual vector \[r = Xw - y.\] Each component \(r_i = w^\top x_i - y_i\) represents the prediction error on a single data point. Since exact satisfaction of all equations is generally impossible, the goal is to find a parameter vector \(w\) that makes these residuals collectively as small as possible.

Least Squares Criterion

To quantify the total error, we introduce a scalar loss function defined as the squared Euclidean norm of the residual vector: \[\mathcal{L}(w) = \|Xw - y\|^2 = \sum_{i=1}^n (w^\top x_i - y_i)^2.\] This choice, known as the least squares loss, is motivated by several factors: it penalizes large errors more strongly, is differentiable, leads to a convex optimization problem, and admits a clear geometric interpretation.

Optimization Problem

Linear regression is thus formulated as the following optimization problem: \[\min_{w \in \mathbb{R}^d} \; \|Xw - y\|^2.\] This problem seeks the parameter vector whose predictions are closest to the observed data in the least squares sense. Importantly, this is no longer a problem of solving linear equations exactly, but of minimizing an error function.

Vector Representation and Dimensional Consistency

It is important to clarify how the vectors involved in the linear model are represented and why the expression \(w^\top x_i\) is mathematically valid.

Each feature vector \(x_i\) is represented as a column vector in \(\mathbb{R}^d\): \[x_i = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{id} \end{bmatrix} \in \mathbb{R}^{d \times 1},\] where each component corresponds to a measurable attribute of the data point (for example, house size, number of bedrooms, etc.).

Similarly, the parameter vector \(w\) is also represented as a column vector: \[w = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix} \in \mathbb{R}^{d \times 1}.\]

Since both \(w\) and \(x_i\) are column vectors, their direct product \(wx_i\) is not defined. To obtain a scalar prediction, we take the transpose of \(w\), yielding a row vector: \[w^\top = \begin{bmatrix} w_1 & w_2 & \cdots & w_d \end{bmatrix} \in \mathbb{R}^{1 \times d}.\]

The prediction for the \(i\)-th data point is then given by the matrix product \[\hat{y}_i = w^\top x_i,\] where \[(1 \times d)(d \times 1) = 1 \times 1.\] Thus, \(w^\top x_i\) is a scalar, as required to model a real-valued output such as house price.

Explicitly, this dot product can be written as \[w^\top x_i = w_1 x_{i1} + w_2 x_{i2} + \cdots + w_d x_{id},\] which shows that the linear model is a weighted sum of the input features.

Consistency with Matrix Formulation

When stacking all feature vectors into the design matrix \[X = \begin{bmatrix} x_1^\top \\ x_2^\top \\ \vdots \\ x_n^\top \end{bmatrix} \in \mathbb{R}^{n \times d},\] each row of \(X\) corresponds to a single data point written as a transposed feature vector. The matrix–vector product \[Xw\] then yields an \(n \times 1\) vector, where the \(i\)-th entry is precisely \(w^\top x_i\). This ensures dimensional consistency between the individual prediction formula and the global matrix formulation.

This careful handling of vector orientation is not merely a notational choice; it is essential for extending linear regression to high-dimensional settings and for implementing learning algorithms correctly in numerical software.

Normal Equations

To find the minimizer, we differentiate the loss function with respect to \(w\): \[\frac{\partial}{\partial w} \|Xw - y\|^2 = \frac{\partial}{\partial w} (Xw - y)^\top (Xw - y) =\] How we achieve this is by applying the chain rule and properties of matrix calculus. Let us derive from scratch in high detail: \[\begin{aligned} \frac{\partial}{\partial w} \|Xw - y\|^2 &= \frac{\partial}{\partial w} (Xw - y)^\top (Xw - y) \\ &= \frac{\partial}{\partial w} \left( w^\top X^\top X w - 2 y^\top X w + y^\top y \right) \\ &= 2 X^\top X w - 2 X^\top y. \end{aligned}\] Why above? Because we use the following rules of matrix calculus:

For a vector \(a\), \(\frac{\partial}{\partial w} (w^\top a) = a\). For the derivative \(\frac{\partial}{\partial w} (w^\top a)\), we use the linearity of differentiation. Since \(w^\top a = \sum_j w_j a_j\), we have \(\frac{\partial}{\partial w_i} (w^\top a) = a_i\), which gives \(\frac{\partial}{\partial w} (w^\top a) = a\).
For a matrix \(A\), \(\frac{\partial}{\partial w} (w^\top A w) = (A + A^\top) w\). If \(A\) is symmetric, this simplifies to \(2 A w\).

For a symmetric matrix \(A\), we have \(\frac{\partial}{\partial w} (w^\top A w) = 2 A w\). To see this, expand \(w^\top A w = \sum_i \sum_j w_i A_{ij} w_j\). Then \(\frac{\partial}{\partial w_k}(w^\top A w) = \sum_j A_{kj} w_j + \sum_i w_i A_{ik} = \sum_j (A_{kj} + A_{jk}) w_j\). Since \(A\) is symmetric, \(A_{jk} = A_{kj}\), so this becomes \(2 \sum_j A_{kj} w_j = 2(Aw)_k\), giving \(\frac{\partial}{\partial w}(w^\top A w) = 2Aw\).

Setting the gradient to zero yields the normal equations: \[X^\top X w = X^\top y.\]

These equations form a system of \(d\) linear equations in \(d\) unknowns. Unlike the original system \(Xw = y\), the normal equations are consistent under mild conditions.

Closed-Form Solution

If the matrix \(X^\top X\) is invertible, the solution to the normal equations is given by \[w = (X^\top X)^{-1} X^\top y.\] This expression provides a closed-form formula for the linear regression estimator. It characterizes the theoretical solution but is not always used directly in practice due to computational and numerical stability considerations.

Geometric Interpretation

From a geometric perspective, the columns of \(X\) span a subspace of \(\mathbb{R}^n\), known as the column space of \(X\). The vector \(Xw\) represents a point in this subspace. Linear regression finds the vector \(Xw\) that is closest to \(y\) in Euclidean distance. Consequently, the residual vector \(r = Xw - y\) is orthogonal to the column space of \(X\), which explains the condition \[X^\top r = 0.\] Thus, linear regression can be interpreted as the orthogonal projection of \(y\) onto the column space of \(X\).

Why Linear Regression Is a Machine Learning Model

Linear regression qualifies as a machine learning model because its parameters are inferred from data rather than specified explicitly, the data typically contains noise and inconsistencies, and the learned model is used to make predictions on unseen data. The process involves empirical risk minimization rather than exact algebraic solution.

Scalability and Numerical Methods

In modern machine learning applications, the number of data points \(n\) may be in the millions and the number of features \(d\) may be in the thousands. In such settings, forming and inverting \(X^\top X\) is computationally expensive and numerically unstable. Consequently, linear regression is commonly solved using numerical optimization methods such as gradient descent, stochastic gradient descent, or iterative linear solvers. These methods approximate the solution to the normal equations efficiently at scale.

Conceptual Summary

Linear regression transforms an inconsistent, overdetermined system of linear equations into a well-posed optimization problem. It replaces the impossible requirement of satisfying all equations exactly with the achievable goal of minimizing overall error. In this sense, linear regression is best understood as numerical linear algebra applied to real-world data.

Now let us return to our housing price example and compute the optimal weights using the least squares approach. We have the design matrix \[X = \begin{bmatrix} 10 & 2 \\ 20 & 3 \\ 15 & 2 \end{bmatrix},\] and the output vector \[y = \begin{bmatrix} 70 \\ 130 \\ 100 \end{bmatrix}.\] The normal equations are \[X^\top X w = X^\top y.\] Calculating \(X^\top X\) and \(X^\top y\): \[X^\top X = \begin{bmatrix}10 & 20 & 15 \\ 2 & 3 & 2\end{bmatrix} \begin{bmatrix}10 & 2 \\ 20 & 3 \\ 15 & 2\end{bmatrix} = \begin{bmatrix}725 & 110 \\ 110 & 17\end{bmatrix},\] and \[X^\top y = \begin{bmatrix}10 & 20 & 15 \\ 2 & 3 & 2\end{bmatrix} \begin{bmatrix}70 \\ 130 \\ 100\end{bmatrix} = \begin{bmatrix} 10(70) + 20(130) + 15(100) \\ 2(70) + 3(130) + 2(100) \end{bmatrix}\] \[= \begin{bmatrix} 700 + 2600 + 1500 \\ 140 + 390 + 200 \end{bmatrix} = \begin{bmatrix}4800 \\ 730\end{bmatrix}.\] Thus, the normal equations become \[\begin{bmatrix}725 & 110 \\ 110 & 17\end{bmatrix} \begin{bmatrix}w_1 \\ w_2\end{bmatrix} = \begin{bmatrix}4800 \\ 730\end{bmatrix}.\] Solving via the determinant \(\Delta = 725 \cdot 17 - 110^2 = 12325 - 12100 = 225\): \[\begin{bmatrix}w_1 \\ w_2\end{bmatrix} = \frac{1}{225}\begin{bmatrix}17 & -110 \\ -110 & 725\end{bmatrix} \begin{bmatrix}4800 \\ 730\end{bmatrix}\] \[= \frac{1}{225}\begin{bmatrix} 81600 - 80300 \\ -528000 + 529250 \end{bmatrix} = \frac{1}{225}\begin{bmatrix}1300 \\ 1250\end{bmatrix} = \begin{bmatrix}5.78 \\ 5.56\end{bmatrix}.\] Thus \(w_1 \approx 5.78\) and \(w_2 \approx 5.56\), representing the best-fit weights that minimize the squared error across all three data points. How come we got different weights than before? Because now we are minimizing the squared error across all three data points, rather than trying to fit exactly two points. Earlier weights were 5 and 10, which perfectly fit the first two points but did not account for the third point. Now with third point weights became approximately 5.78 and 5.56, which balance the errors across all three points. Let us check the predictions: \[\hat{y}_1 = 5.78 \cdot 10 + 5.56 \cdot 2 = \approx 70.0,\] \[\hat{y}_2 = 5.78 \cdot 20 + 5.56 \cdot 3 \approx 130.0,\] \[\hat{y}_3 = 5.78 \cdot 15 + 5.56 \cdot 2 \approx 100.0.\] but actually \[\hat{y}_1 = 5.78 \cdot 10 + 5.56 \cdot 2 = 57.8 + 11.12 = 68.92,\] \[\hat{y}_2 = 5.78 \cdot 20 + 5.56 \cdot 3 = 115.6 + 16.68 = 132.28,\] \[\hat{y}_3 = 5.78 \cdot 15 + 5.56 \cdot 2 = 86.7 + 11.12 = 97.82.\]

In practice, computing matrix inverses is computationally expensive and numerically unstable. Instead, we use gradient descent to minimize \(\mathcal{L}(w) = \|Xw - y\|^2\) iteratively, starting from an initial guess \(w^{(0)} = 0\): \[w^{(k+1)} = w^{(k)} - \eta \nabla \mathcal{L}(w^{(k)}),\] where \(\eta\) is the learning rate. The gradient \(\nabla \mathcal{L}(w)\) is given by \[\nabla \mathcal{L}(w) = 2 X^\top (Xw - y).\] By repeatedly applying this update rule, we can converge to the optimal weights without explicitly computing the matrix inverse. Why this works? Because gradient descent is an optimization algorithm that iteratively moves towards the minimum of a function by taking steps proportional to the negative of the gradient. In the case of linear regression, the loss function is convex, ensuring that gradient descent will converge to the global minimum. Here is a pictural representation of gradient descent:

First recall what is convexity. A function \(f: \mathbb{R}^d \to \mathbb{R}\) is convex if for any two points \(x, y \in \mathbb{R}^d\) and any \(\theta \in [0, 1]\), \[f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y).\] Geometrically, this means that the line segment connecting any two points on the graph of the function lies above or on the graph itself. Convex functions have a single global minimum, making optimization problems easier to solve. Why they have a single global minimum? Because if there were two distinct local minima, the line segment connecting them would violate the convexity condition.

Why loss function is convex, ensuring that gradient descent will converge to the global minimum. To prove that the loss function \(\mathcal{L}(w) = \|Xw - y\|^2\) is convex, we can show that its Hessian matrix is positive semidefinite. The Hessian matrix of \(\mathcal{L}(w)\) is given by \[H = \nabla^2 \mathcal{L}(w) = 2 X^\top X.\] To show that \(H\) is positive semidefinite, we need to show that for any non-zero vector \(z \in \mathbb{R}^d\), \[z^\top H z \geq 0.\] Calculating \(z^\top H z\): \[z^\top H z = z^\top (2 X^\top X) z = 2 (X z)^\top (X z) = 2 \|X z\|^2 \geq 0.\] Since \(\|X z\|^2\) is always non-negative, it follows that \(H\) is positive semidefinite. Therefore, the loss function \(\mathcal{L}(w)\) is convex.

We need to justify why this gradient descent is a good idea. Gradient descent is a good idea for several reasons:

Scalability: Gradient descent can handle large datasets and high-dimensional feature spaces efficiently, as it does not require storing or inverting large matrices.
Flexibility: It can be easily adapted to different loss functions and regularization techniques.
Convergence Guarantees: For convex loss functions, gradient descent is guaranteed to converge to the global minimum, provided an appropriate learning rate is chosen.
Simplicity: The algorithm is straightforward to implement and understand, making it accessible for practitioners.

Why computing inverse of a matrix is computationally expensive and numerically unstable? Computing the inverse of a matrix, especially large matrices, can be computationally expensive due to the time complexity of matrix inversion algorithms, which is typically \(O(d^3)\) for a \(d \times d\) matrix. This becomes prohibitive for high-dimensional data. Additionally, matrix inversion can be numerically unstable if the matrix is ill-conditioned (i.e., has a high condition number), leading to significant errors in the computed inverse. Small perturbations in the input data can result in large changes in the inverse, making it unreliable for practical applications. But in contrast, gradient descent avoids these issues by iteratively refining the solution without requiring matrix inversion, making it more robust and efficient for large-scale problems. So let us now implement gradient descent to find the optimal weights for our housing price prediction problem by iteratively updating the weights until convergence by calculating the gradient at each step non programmatically. Starting with an initial guess \(w^{(0)} = \begin{bmatrix}0 \\ 0\end{bmatrix}\) and a learning rate \(\eta = 0.0001\), we compute the gradient at each step: \[\nabla \mathcal{L}(w^{(k)}) = 2 X^\top (X w^{(k)} - y).\] We then update the weights: \[w^{(k+1)} = w^{(k)} - \eta \nabla \mathcal{L}(w^{(k)}).\] Repeating this process and get the following sequence of weight updates: \[\begin{aligned} w^{(0)} &= \begin{bmatrix}0 \\ 0\end{bmatrix}, \\ w^{(1)} &\approx \begin{bmatrix}0.096 \\ 0.015\end{bmatrix}, \\ w^{(2)} &\approx \begin{bmatrix}0.187 \\ 0.028\end{bmatrix}, \\ w^{(3)} &\approx \begin{bmatrix}0.273 \\ 0.040\end{bmatrix}, \\ &\vdots \\ w^{(1000)} &\approx \begin{bmatrix}5.562 \\ 5.722\end{bmatrix}. \end{aligned}\] After sufficient iterations, the weights converge to approximately \(\begin{bmatrix}5.562 \\ 5.722\end{bmatrix}\), which is very close to the closed-form solution we computed earlier. This demonstrates the effectiveness of gradient descent in finding the optimal weights for linear regression. There are many small things to consider here like how to choose the learning rate, when to stop the iterations (convergence criteria), etc. Below is a pictorial representation of gradient descent converging to the optimal weights:

The small things to consider here like how to choose the learning rate, when to stop the iterations (convergence criteria), etc. are important practical aspects of implementing gradient descent effectively. See summary below:

Learning Rate: A small learning rate may lead to slow convergence, while a large learning rate may cause divergence. Techniques like learning rate schedules or adaptive learning rates (e.g., Adam, RMSprop) can help.
Convergence Criteria: Common criteria include a maximum number of iterations, a threshold on the change in loss between iterations, or a threshold on the norm of the gradient.
Batch Size: In practice, stochastic gradient descent (SGD) or mini-batch gradient descent is often used, where gradients are computed on subsets of the data to improve efficiency and convergence.
Regularization: Techniques like L2 regularization (Ridge regression) or L1 regularization (Lasso) can be incorporated to prevent overfitting and improve generalization.

We will learn these small things in detail in later lectures. But for now, let us completely change our perspective and derive linear regression from a probabilistic viewpoint. For this let us recall our housing price prediction problem. We have the design matrix \[X = \begin{bmatrix} 10 & 2 \\ 20 & 3 \\ 15 & 2 \end{bmatrix},\] and the output vector \[y = \begin{bmatrix} 70 \\ 125 \\ 95 \end{bmatrix}.\] We will now assume a probabilistic model for the data generation process. Specifically, we assume that the observed prices are generated according to the linear model plus Gaussian noise: \[y_i = w^\top x_i + \epsilon_i,\] where \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\) are independent and identically distributed Gaussian noise terms with mean zero and variance \(\sigma^2\). This assumption reflects the idea that the true relationship between features and prices is linear, but there are random fluctuations due to unobserved factors. Recall probability basics below:

Probability Basics

Before diving into the probabilistic interpretation of linear regression, it is essential to review foundational concepts in probability theory.

Sample Space and Events

The sample space \(\Omega\) is the set of all possible outcomes of a random experiment. An event is a subset of the sample space. For example, when rolling a fair die, \(\Omega = \{1, 2, 3, 4, 5, 6\}\), and the event “rolling an even number” is \(\{2, 4, 6\}\).

Probability of an Event

The probability of an event \(A\), denoted \(P(A)\), is a number in \([0, 1]\) satisfying:

\(P(\Omega) = 1\) (the sample space has probability 1).
\(P(\emptyset) = 0\) (the empty event has probability 0).
For mutually exclusive events \(A_1, A_2, \ldots\), we have \(P(A_1 \cup A_2 \cup \cdots) = P(A_1) + P(A_2) + \cdots\).

Random Variables

A random variable \(X\) is a function that assigns a numerical value to each outcome in the sample space. Random variables can be discrete (taking values in a countable set) or continuous (taking values in an interval or uncountable set). For example, the outcome of a die roll is a discrete random variable, while the height of a person is continuous.

Probability Distributions

A probability distribution describes how probability is distributed over the possible values of a random variable.

For a discrete random variable \(X\) taking values \(x_1, x_2, \ldots\), the probability mass function (PMF) is \(p(x_i) = P(X = x_i)\). The probabilities satisfy \(\sum_i p(x_i) = 1\).

For a continuous random variable \(X\), the probability density function (PDF) is \(f(x)\), which satisfies \(\int_{-\infty}^{\infty} f(x) \, dx = 1\). The probability that \(X\) lies in an interval \([a, b]\) is \[P(a \leq X \leq b) = \int_a^b f(x) \, dx.\]

Cumulative Distribution Function

The cumulative distribution function (CDF) is defined as \[F(x) = P(X \leq x).\] For a continuous random variable with PDF \(f(x)\), we have \(F(x) = \int_{-\infty}^x f(t) \, dt\), and the PDF is the derivative of the CDF: \(f(x) = \frac{dF(x)}{dx}\).

Expected Value

The expected value (or mean) of a random variable \(X\) is a measure of its central tendency.

For a discrete random variable, \[E[X] = \sum_i x_i \cdot p(x_i).\]

For a continuous random variable, \[E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx.\]

The expected value is linear: \(E[aX + b] = aE[X] + b\) for constants \(a\) and \(b\).

Variance and Standard Deviation

The variance of a random variable \(X\) measures the spread of its distribution around the mean: \[\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2.\]

The standard deviation is \(\sigma = \sqrt{\text{Var}(X)}\).

Properties of variance include:

\(\text{Var}(aX + b) = a^2 \text{Var}(X)\) for constants \(a\) and \(b\).
\(\text{Var}(X) \geq 0\), with equality if and only if \(X\) is constant almost surely.

Normal Distribution

The Gaussian or normal distribution is one of the most important distributions in probability and statistics. A continuous random variable \(X\) follows a normal distribution with mean \(\mu\) and variance \(\sigma^2\), denoted \(X \sim \mathcal{N}(\mu, \sigma^2)\), if its PDF is \[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right).\]

Key properties of the normal distribution include:

The PDF is symmetric about \(\mu\) and bell-shaped.
The mean, median, and mode are all equal to \(\mu\).
Approximately 68% of the probability mass lies within one standard deviation of the mean: \(P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 0.68\).
Approximately 95% lies within two standard deviations: \(P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 0.95\).
Approximately 99.7% lies within three standard deviations.

The standard normal distribution has mean \(\mu = 0\) and variance \(\sigma^2 = 1\), often denoted \(\mathcal{N}(0, 1)\).

Multivariate Normal Distribution

When dealing with multiple random variables, we use the multivariate normal distribution. A random vector \(\mathbf{x} \in \mathbb{R}^d\) follows a multivariate normal distribution with mean vector \(\boldsymbol{\mu} \in \mathbb{R}^d\) and covariance matrix \(\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}\), denoted \(\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\), if its PDF is \[f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right),\] where \(|\boldsymbol{\Sigma}|\) denotes the determinant of \(\boldsymbol{\Sigma}\).

Independence

Two random variables \(X\) and \(Y\) are independent if knowledge of one provides no information about the other. Formally, they are independent if \[P(X = x \text{ and } Y = y) = P(X = x) \cdot P(Y = y)\] for all values \(x\) and \(y\) (in the discrete case), or if their joint PDF factors as \(f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y)\) (in the continuous case).

Conditional Probability

The conditional probability of event \(A\) given event \(B\) is \[P(A \mid B) = \frac{P(A \text{ and } B)}{P(B)},\] provided \(P(B) > 0\).

Bayes’ Theorem

Bayes’ theorem relates conditional probabilities: \[P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}.\] This theorem is fundamental in machine learning and statistical inference, allowing us to update beliefs (posterior) based on new evidence (likelihood) and prior knowledge (prior).

Likelihood

In the context of parameter estimation, the likelihood is the probability of observing the data given a particular value of the parameter. If we have data \(\mathbf{y} = (y_1, \ldots, y_n)\) and parameter \(w\), the likelihood is \[L(w \mid \mathbf{y}) = P(\mathbf{y} \mid w).\] The log-likelihood is \(\ell(w) = \log L(w \mid \mathbf{y})\), which is often easier to work with computationally.

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is a method for estimating parameters by choosing the parameter value that maximizes the likelihood of the observed data: \[\hat{w} = \arg\max_w L(w \mid \mathbf{y}) = \arg\max_w \ell(w \mid \mathbf{y}).\]

These concepts will be essential for understanding the probabilistic formulation of linear regression, where we model noise using the normal distribution and estimate parameters via maximum likelihood. Do all models with noise follow this approach? No, not all models with noise follow this approach. While the Gaussian noise assumption is common due to its mathematical convenience and the Central Limit Theorem, other noise distributions can be used depending on the nature of the data and the specific application. For example, if the noise is believed to be heavy-tailed, a Laplace distribution might be more appropriate. Similarly, for count data, a Poisson distribution could be used. The choice of noise model affects the likelihood function and consequently the parameter estimation method. Recall that the Central Limit Theorem states that the sum of a large number of independent random variables, regardless of their individual distributions, tends to follow a normal distribution, provided certain conditions are met. This theorem justifies the common use of Gaussian noise in modeling real-world phenomena, as many types of measurement errors and natural variations can be viewed as the cumulative effect of numerous small, independent factors. In the house price example, are the noise really independent and identically distributed Gaussian? In practice, the assumption of independent and identically distributed (i.i.d.) Gaussian noise may not always hold true in real-world scenarios. While it is a convenient and widely used assumption, there are several factors that can lead to violations of this assumption:

Heteroscedasticity: The variance of the noise may change with the level of the input features. For example, more expensive houses might have larger price fluctuations.
Autocorrelation: In time series data or spatial data, errors may be correlated with each other, violating the independence assumption.
Non-Gaussian Noise: The noise distribution may not be Gaussian; it could be skewed or have heavier tails than a normal distribution.
Outliers: Real-world data often contains outliers that can significantly affect the model fit and violate the i.i.d. assumption.

Despite these potential violations, the Gaussian noise assumption is often used as a first approximation due to its mathematical tractability and the Central Limit Theorem. However, it is important to assess the validity of this assumption in practice and consider alternative models or robust estimation techniques if necessary. Now let us derive the likelihood function for our linear regression model under the Gaussian noise assumption.

Likelihood Function

Given the model \[y_i = w^\top x_i + \epsilon_i,\] where \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\), the conditional probability of observing \(y_i\) given \(x_i\) and parameters \(w\) is \[P(y_i \mid x_i, w) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - w^\top x_i)^2}{2\sigma^2}\right).\] Assuming the data points are independent, the likelihood of observing the entire dataset \((X, y)\) is the product of the individual probabilities: \[L(w \mid X, y) = \prod_{i=1}^n P(y_i \mid x_i, w) = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n \exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - w^\top x_i)^2\right).\] Taking the logarithm of the likelihood function gives the log-likelihood: \[\ell(w \mid X, y) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - w^\top x_i)^2.\] To find the maximum likelihood estimate of \(w\), we can ignore the constant term \(-\frac{n}{2} \log(2\pi\sigma^2)\) and focus on maximizing the negative of the sum of squared errors: \[\hat{w} = \arg\min_w \sum_{i=1}^n (y_i - w^\top x_i)^2.\] This is precisely the same optimization problem we encountered in the least squares formulation of linear regression. Thus, the maximum likelihood estimate of \(w\) under the Gaussian noise assumption coincides with the least squares solution. This connection highlights the probabilistic interpretation of linear regression as fitting a model that maximizes the likelihood of the observed data under a Gaussian noise model.

Principal component analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features in a dataset while retaining as much variance as possible. It achieves this by identifying the directions (principal components) along which the data varies the most.

Principal Component Analysis (PCA): Philosophy and Foundations

1. From Prediction to Representation

In linear regression, we study a supervised problem: \[y = Xw + \varepsilon,\] where the goal is to predict an output variable \(y\) (e.g. house price) from features \(X\).

In PCA, we deliberately remove the output variable \(y\) and ask a different question:

Given only the feature data \(X\), how can we describe or represent the data as faithfully as possible using fewer numbers?

Thus PCA is an unsupervised method whose goal is not prediction, but representation and compression of data.

—

2. Data Setup and Mean Centering

Let \[x_1, x_2, \dots, x_n \in \mathbb{R}^d\] be data points (e.g. houses described by size, bedrooms, etc.).

Compact Vector Notation

Since each data point \(x_i\) is a \(d\)-dimensional vector, the mean is computed as a vector average: \[\mu = \frac{1}{n}\sum_{i=1}^n x_i = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_d \end{bmatrix} \in \mathbb{R}^d,\] where each component \(\mu_j\) is the average of the \(j\)-th feature across all data points. When we subtract \(\mu\) from \(x_i\), we perform component-wise subtraction: \[x_i - \mu = \begin{bmatrix} x_{i1} - \mu_1 \\ x_{i2} - \mu_2 \\ \vdots \\ x_{id} - \mu_d \end{bmatrix}.\] This compact notation \(x_i \leftarrow x_i - \mu\) applies uniformly regardless of dimensionality.

We first perform mean centering: \[x_i \leftarrow x_i - \mu, \quad \text{where } \mu = \frac{1}{n}\sum_{i=1}^n x_i.\]

After centering, \[\frac{1}{n}\sum_{i=1}^n x_i = 0.\]

Mean centering ensures that we study variation around the average, which is essential for defining variance and covariance.

—

3. Covariance Matrix as a Summary of Variation

Motivation: Measuring Spread and Relationships

After mean centering, the centered data points are distributed around the origin. To understand the structure of this distribution, we need a single mathematical object that captures:

How much each feature varies (spread along each axis),
How features vary together (whether large values in one feature tend to occur with large or small values in another).

Recall from probability that the variance of a random variable \(X\) is \(\text{Var}(X) = E[X^2]\) (when centered at zero). For data, this becomes \(\frac{1}{n}\sum_i x_i^2\), measuring spread along a single dimension.

For centered data points \(x_1, \ldots, x_n\), the outer product \(x_i x_i^\top\) creates a \(d \times d\) matrix that captures all pairwise products of coordinates: \[x_i x_i^\top = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{id} \end{bmatrix} \begin{bmatrix} x_{i1} & x_{i2} & \cdots & x_{id} \end{bmatrix} = \begin{bmatrix} x_{i1}^2 & x_{i1}x_{i2} & \cdots \\ x_{i2}x_{i1} & x_{i2}^2 & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}.\]

Averaging these matrices across all data points accumulates information about:

Diagonal entries: \(\frac{1}{n}\sum_i x_{ij}^2\) (variance of feature \(j\)),
Off-diagonal entries: \(\frac{1}{n}\sum_i x_{ij}x_{ik}\) (covariance between features \(j\) and \(k\)).

This is the natural generalization of univariate variance to the multivariate setting.

Formal Definition

The covariance matrix is defined as \[\Sigma = \frac{1}{n}\sum_{i=1}^n x_i x_i^\top.\]

Each entry \[\Sigma_{jk} = \frac{1}{n}\sum_{i=1}^n x_{ij}x_{ik}\] is the covariance between feature \(j\) and feature \(k\).

Thus, the covariance matrix compactly stores:

variances of individual features (diagonal entries),
pairwise relationships between features (off-diagonal entries).

—

4. Linear Summaries of Data

Choose a vector of weights \[v \in \mathbb{R}^d, \quad \|v\| = 1.\]

For each data point \(x_i\), define a one-dimensional summary: \[z_i = x_i^\top v.\]

This compresses each \(d\)-dimensional data point into a single number.

Different choices of \(v\) preserve different aspects of the data.

—

5. Variance as Information Preservation

The variance of the summary values \(\{z_i\}\) is \[\operatorname{Var}(z) = \frac{1}{n}\sum_{i=1}^n z_i^2 = v^\top \Sigma v.\]

Interpretation:

Large variance \(\Rightarrow\) the summary distinguishes data points well.
Zero variance \(\Rightarrow\) all data points collapse to the same value (no information).

Thus, variance measures how much of the differences between data points survive compression.

—

6. The Core PCA Optimization Problem

To find the best one-dimensional representation, PCA solves: \[\max_{\|v\|=1} \; v^\top \Sigma v.\]

This chooses the linear combination of features whose values vary the most across the dataset.

The solution is called the first principal component \(v_1\).

—

7. Reconstruction Viewpoint (Key Insight)

Given the summary \[z_i = x_i^\top v,\] the natural reconstruction of the original data point is \[\hat{x}_i = z_i\, v.\]

All reconstructed points lie in the one-dimensional subspace spanned by \(v\).

The average reconstruction error is \[\frac{1}{n}\sum_{i=1}^n \|x_i - \hat{x}_i\|^2 = \frac{1}{n}\sum_{i=1}^n \|x_i - (x_i^\top v)v\|^2.\]

A fundamental result is:

The vector \(v\) that maximizes variance also minimizes reconstruction error.

Hence, PCA finds the best low-dimensional approximation of the data in the least-squares sense.

—

8. Example: Housing Data

Consider houses described by size and number of bedrooms (mean-centered):

\[x_1 = (-5,-0.3),\quad x_2 = (5,0.7),\quad x_3 = (0,-0.3).\]

Choosing \(v=(0,1)\) (only bedrooms) produces low variance summaries.
Choosing \(v=(1,0)\) (only size) produces high variance summaries.

Thus PCA automatically discovers that house size is the dominant pattern of variation.

—

9. Ultimate Benefit of PCA

PCA provides:

optimal linear compression of data,
noise reduction by discarding low-variance directions,
interpretable dominant patterns,
better conditioning for downstream tasks (regression, clustering, density estimation).

PCA does not discover meaning or causality. It guarantees only this:

Among all linear low-dimensional representations, PCA loses the least information under squared error.

—

10. PCA: Mathematical Formulation (Deisenroth et al.)

10.1 Problem Setup and Notation

Let \(\mathcal{X} = \{x_1, \ldots, x_N\}\) be a dataset where \(x_n \in \mathbb{R}^D\). We seek to find a lower-dimensional representation in \(\mathbb{R}^M\) where \(M < D\), while preserving as much information as possible.

PCA projects data onto a lower-dimensional linear subspace. The key is to find an orthonormal basis \(\{b_1, \ldots, b_M\}\) of this \(M\)-dimensional subspace such that the projected data retains maximum variance.

Projection onto a Subspace

For a given orthonormal basis \(B = [b_1, \ldots, b_M]\) where \(b_i \in \mathbb{R}^D\) and \(b_i^\top b_j = \delta_{ij}\), the projection of a data point \(x_n\) onto this subspace is: \[\tilde{x}_n = \sum_{i=1}^M z_{ni} b_i = B z_n,\] where \(z_n = [z_{n1}, \ldots, z_{nM}]^\top \in \mathbb{R}^M\) are the coordinates (principal component scores) in the new basis, computed as \(z_{ni} = x_n^\top b_i\).

10.2 Maximum Variance Formulation

Objective

PCA finds the orthonormal basis that maximizes the variance of the projected data. For mean-centered data (where \(\sum_{n=1}^N x_n = 0\)), the variance of the projections onto a single direction \(b_1\) is: \[V_1 = \frac{1}{N} \sum_{n=1}^N (x_n^\top b_1)^2 = \frac{1}{N} \sum_{n=1}^N b_1^\top x_n x_n^\top b_1 = b_1^\top S b_1,\] where \(S = \frac{1}{N} \sum_{n=1}^N x_n x_n^\top\) is the data covariance matrix.

Constrained Optimization

The first principal component is found by solving: \[\max_{b_1} \; b_1^\top S b_1 \quad \text{subject to} \quad \|b_1\|^2 = 1.\]

Using the method of Lagrange multipliers with Lagrangian \(\mathcal{L}(b_1, \lambda_1) = b_1^\top S b_1 - \lambda_1(b_1^\top b_1 - 1)\), we obtain: \[\frac{\partial \mathcal{L}}{\partial b_1} = 2S b_1 - 2\lambda_1 b_1 = 0 \quad \Rightarrow \quad S b_1 = \lambda_1 b_1.\]

Thus, \(b_1\) is an eigenvector of \(S\) with eigenvalue \(\lambda_1\). The variance along this direction is: \[V_1 = b_1^\top S b_1 = b_1^\top (\lambda_1 b_1) = \lambda_1 b_1^\top b_1 = \lambda_1.\]

To maximize variance, we choose \(b_1\) to be the eigenvector corresponding to the largest eigenvalue \(\lambda_1\) of \(S\).

Higher Principal Components

For the \(m\)-th principal component, we solve: \[\max_{b_m} \; b_m^\top S b_m \quad \text{subject to} \quad \|b_m\|^2 = 1 \text{ and } b_m \perp b_i \text{ for } i = 1, \ldots, m-1.\]

The solution is the eigenvector corresponding to the \(m\)-th largest eigenvalue of \(S\).

10.3 Minimum Reconstruction Error Formulation

Alternative Characterization

PCA can equivalently be formulated as minimizing the average squared reconstruction error. The reconstruction of \(x_n\) from its projection is: \[\tilde{x}_n = \sum_{i=1}^M (x_n^\top b_i) b_i = BB^\top x_n,\] where \(B = [b_1, \ldots, b_M]\).

The reconstruction error for the entire dataset is: \[J = \frac{1}{N} \sum_{n=1}^N \|x_n - \tilde{x}_n\|^2 = \frac{1}{N} \sum_{n=1}^N \|x_n - BB^\top x_n\|^2.\]

Equivalence to Maximum Variance

Using the identity \(\|x_n\|^2 = \|BB^\top x_n\|^2 + \|(I - BB^\top)x_n\|^2\) for orthonormal \(B\), we have: \[J = \frac{1}{N} \sum_{n=1}^N \|x_n\|^2 - \frac{1}{N} \sum_{n=1}^N \|BB^\top x_n\|^2.\]

Since the first term is constant, minimizing \(J\) is equivalent to maximizing \(\frac{1}{N} \sum_{n=1}^N \|BB^\top x_n\|^2\), which is the variance of the projected data. Thus, maximum variance and minimum reconstruction error are equivalent objectives.

10.4 Practical Algorithm

Center the data: Compute mean \(\mu = \frac{1}{N}\sum_{n=1}^N x_n\) and set \(x_n \leftarrow x_n - \mu\).
Compute covariance matrix: \(S = \frac{1}{N} \sum_{n=1}^N x_n x_n^\top = \frac{1}{N} X^\top X\), where \(X = [x_1, \ldots, x_N]^\top \in \mathbb{R}^{N \times D}\).
Eigendecomposition: Compute eigenvectors and eigenvalues of \(S\): \(S = \sum_{i=1}^D \lambda_i b_i b_i^\top\) where \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_D\).
Select principal components: Choose the top \(M\) eigenvectors \(B = [b_1, \ldots, b_M]\).
Project data: For each \(x_n\), compute \(z_n = B^\top x_n \in \mathbb{R}^M\).

Explained Variance

The proportion of variance captured by the first \(M\) principal components is: \[\frac{\sum_{i=1}^M \lambda_i}{\sum_{i=1}^D \lambda_i}.\]

10.5 Key Properties

Orthogonality: Principal components are mutually orthogonal: \(b_i^\top b_j = \delta_{ij}\).
Uncorrelated Projections: The coordinates \(z_{ni}\) and \(z_{nj}\) for \(i \neq j\) are uncorrelated across the dataset.
Variance Ordering: Variance along principal components decreases: \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_D \geq 0\).
Linear Transformation: PCA is a linear method that cannot capture nonlinear structure in data.
Scale Sensitivity: PCA is sensitive to feature scales; standardization is often applied before PCA.

10.6 Computational Considerations

For high-dimensional data where \(N < D\), computing the \(D \times D\) covariance matrix \(S\) is expensive. Instead, we can use the singular value decomposition (SVD) of the centered data matrix \(X \in \mathbb{R}^{N \times D}\): \[X = U \Sigma V^\top,\] where \(U \in \mathbb{R}^{N \times N}\), \(\Sigma \in \mathbb{R}^{N \times D}\), \(V \in \mathbb{R}^{D \times D}\) are the SVD factors.

The relationship with PCA:

The columns of \(V\) are the principal components (eigenvectors of \(S = \frac{1}{N}X^\top X\)).
The eigenvalues of \(S\) are \(\lambda_i = \frac{\sigma_i^2}{N}\), where \(\sigma_i\) are the singular values.
The projections are given by \(Z = XV = U\Sigma \in \mathbb{R}^{N \times D}\).

This approach is numerically more stable and efficient, especially when \(N \ll D\).

10.7 Geometric Interpretation

PCA identifies an orthogonal coordinate system aligned with the directions of maximum variance in the data cloud. Geometrically:

The first principal component \(b_1\) points in the direction of greatest spread.
The second principal component \(b_2\) is orthogonal to \(b_1\) and points in the direction of second-greatest spread.
Subsequent components follow similarly, forming a nested sequence of best-fit subspaces.

The \(M\)-dimensional subspace spanned by \(\{b_1, \ldots, b_M\}\) is the best-fitting \(M\)-dimensional linear subspace in the least-squares sense.

—

11. One-Sentence Summary

Revision

PCA in Full Matrix and Component Form

Data matrix.

Let the dataset consist of \(N\) centered samples in \(\mathbb{R}^d\): \[X \in \mathbb{R}^{N \times d}, \qquad X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Nd} \end{bmatrix}.\] The \(n\)-th data point is the row vector \[x_n = \begin{bmatrix} x_{n1} & x_{n2} & \cdots & x_{nd} \end{bmatrix} \in \mathbb{R}^d.\]

Covariance matrix.

The empirical covariance matrix is \[S = \frac{1}{N} X^T X \in \mathbb{R}^{d \times d}.\]

Principal directions (decoder matrix).

Let \[B \in \mathbb{R}^{d \times m}, \qquad B = \begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1m} \\ b_{21} & b_{22} & \cdots & b_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ b_{d1} & b_{d2} & \cdots & b_{dm} \end{bmatrix} = \begin{bmatrix} b_1 & b_2 & \cdots & b_m \end{bmatrix},\] where the columns \(b_k \in \mathbb{R}^d\) are orthonormal: \[B^T B = I_m.\]

Each direction satisfies the eigenvalue problem \[S b_k = \lambda_k b_k.\]

Encoding (principal components).

The low-dimensional representation is \[Z = X B \in \mathbb{R}^{N \times m},\] \[Z = \begin{bmatrix} z_{11} & z_{12} & \cdots & z_{1m} \\ z_{21} & z_{22} & \cdots & z_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ z_{N1} & z_{N2} & \cdots & z_{Nm} \end{bmatrix}.\]

Componentwise, \[z_{nk} = \sum_{j=1}^d x_{nj} \, b_{jk} = b_k^T x_n.\]

In particular, the first principal component is \[z_{n1} = b_{11}x_{n1} + b_{21}x_{n2} +\cdots+ b_{d1}x_{nd}.\] Thus the first principal component is a linear combination of all original features.

Variance of the first component.

Let \[z_1 = \begin{bmatrix} z_{11} \\ z_{21} \\ \vdots \\ z_{N1} \end{bmatrix}.\] Then \[\mathrm{Var}(z_1) = \frac{1}{N} \sum_{n=1}^N z_{n1}^2 = \frac{1}{N} \sum_{n=1}^N (b_1^T x_n)^2 = b_1^T S b_1.\] PCA chooses \(b_1\) to maximize this variance: \[\max_{\|b_1\|=1} \; b_1^T S b_1.\]

Reconstruction.

The reconstructed data is \[\hat{X} = Z B^T = X B B^T.\]

The matrix \[P = B B^T \in \mathbb{R}^{d \times d}\] has components \[P_{ij} = \sum_{k=1}^m b_{ik} b_{jk}.\]

Thus reconstruction componentwise is \[\hat{x}_{ni} = \sum_{j=1}^d x_{nj} P_{ji} = \sum_{j=1}^d x_{nj} \left( \sum_{k=1}^m b_{jk} b_{ik} \right).\]

Why \(BB^T \neq I_d\).

If \(m < d\), then \[B^T B = I_m, \qquad BB^T \neq I_d.\] Instead, \[(BB^T)^2 = BB^T,\] so \(BB^T\) is a projection matrix onto the \(m\)-dimensional subspace spanned by \(\{b_1,\dots,b_m\}\).

Therefore \[\hat{X} = XBB^T \neq X\] unless \(m=d\), in which case \(B\) is square orthogonal and \[BB^T = I_d.\]

Summary.

PCA performs the transformation \[X_{N \times d} \;\longrightarrow\; Z = X B \; (N \times m),\] where each new coordinate is \[z_{nk} = b_k^T x_n,\] and reconstruction is the orthogonal projection \[\hat{X} = XBB^T.\]

PCA: Example and Intuition

Let us return to our same housing price example which is the following dataset with three features: \[\begin{bmatrix} 10 & 2 & 70 \\ 20 & 3 & 130 \\ 15 & 2 & 100 \end{bmatrix}.\] Here the first two columns are features (size in square feet and number of bedrooms) and the last column is the target variable (price in thousands of dollars). For PCA, we will only consider the feature columns: \[\begin{bmatrix} 10 & 2 \\ 20 & 3 \\ 15 & 2 \end{bmatrix}.\] Why we are considering only feature columns? Because in real life we may not know the target variable for all data points, does it mean the dataset is useless? No, we can still use the feature columns to learn about the structure of the data. PCA helps us find a lower-dimensional representation of the data that captures the most important information. This is useful for visualization, noise reduction, and as a preprocessing step for other machine learning algorithms as it can help improve their performance by reducing overfitting and computational complexity. So in this example, we will reduce the two features (size and number of bedrooms) into a single feature that captures the most variance in the data. First step is to center the data by subtracting the mean of each feature from the corresponding feature values. This ensures that each feature has a mean of zero, which is important for PCA as it focuses on the directions of maximum variance rather than absolute values.

Data Centering

The mean of the first feature (size) is \((10 + 20 + 15) / 3 = 15\), and the mean of the second feature (number of bedrooms) is \((2 + 3 + 2) / 3 \approx 2.33\). Centering the data gives: \[\begin{bmatrix} 10 - 15 & 2 - 2.33 \\ 20 - 15 & 3 - 2.33 \\ 15 - 15 & 2 - 2.33 \end{bmatrix}.\] This results in the centered data matrix: \[\begin{bmatrix} -5 & -0.33 \\ 5 & 0.67 \\ 0 & -0.33 \end{bmatrix}.\] Now we can proceed to compute the covariance matrix of the centered data, because PCA relies on the covariance structure of the data to identify the principal components. This is due to the fact that principal components are directions in the feature space along which the data varies the most, and the covariance matrix captures this variance and the relationships between different features. Geometrically, the covariance matrix provides information about how the data is spread out in different directions, which is essential for determining the axes of maximum variance that PCA seeks to identify. Graphically see below:

This graph shows the centered data points in a 2D space defined by size and number of bedrooms. The points are plotted relative to the origin, which represents the mean of the original data. The spread of these points indicates the variance in the data, which PCA will analyze to find the principal components.

Covariance Matrix Calculation

The covariance matrix \(C\) is computed as: \[C = \frac{1}{n} Z^\top Z,\] where \(Z\) is the centered data matrix and \(n\) is the number of data points. In our case, \(n = 3\), so we have: \[C = \frac{1}{3} \begin{bmatrix} -5 & 5 & 0 \\ -0.33 & 0.67 & -0.33 \end{bmatrix} \begin{bmatrix} -5 & -0.33 \\ 5 & 0.67 \\ 0 & -0.33 \end{bmatrix}.\] Calculating this gives: \[C = \frac{1}{3} \begin{bmatrix} (-5)(-5) + (5)(5) + (0)(0) & (-5)(-0.33) + (5)(0.67) + (0)(-0.33) \\ (-0.33)(-5) + (0.67)(5) + (-0.33)(0) & (-0.33)(-0.33) + (0.67)(0.67) + (-0.33)(-0.33) \end{bmatrix}\] \[= \frac{1}{3} \begin{bmatrix} 25 + 25 + 0 & 1.65 + 3.35 + 0 \\ 1.65 + 3.35 + 0 & 0.1089 + 0.4489 + 0.1089 \end{bmatrix}\] \[= \frac{1}{3} \begin{bmatrix} 50 & 5 \\ 5 & 0.6667 \end{bmatrix} = \begin{bmatrix}16.6667 & 1.6667 \\ 1.6667 & 0.2222\end{bmatrix}.\] Next step is to compute the eigenvalues and eigenvectors of the covariance matrix \(C\). The eigenvalues represent the amount of variance captured by each principal component, while the eigenvectors indicate the directions of these components in the feature space.

Why Eigenvalues and Eigenvectors Appear in PCA

At the heart of PCA lies the optimization problem \[\max_{\|v\|=1} \; v^\top \Sigma v,\] where \(\Sigma\) is the covariance matrix of the mean-centered data.

This problem asks for a linear combination of features whose values have maximum variance. Without the unit-length constraint \(\|v\|=1\), the objective would increase arbitrarily by scaling \(v\), so the constraint is essential.

Constrained Optimization and the Eigenvalue Equation

To solve this constrained maximization, we introduce a Lagrange multiplier: \[\mathcal{L}(v,\lambda) = v^\top \Sigma v - \lambda (v^\top v - 1).\]

Taking derivatives with respect to \(v\) and setting them to zero yields \[\nabla_v \mathcal{L} = 2\Sigma v - 2\lambda v = 0,\] which simplifies to \[\Sigma v = \lambda v.\]

This equation is precisely the definition of an eigenvector–eigenvalue pair. Thus, eigenvectors are not assumed in PCA; they emerge inevitably as the stationary points of the variance maximization problem.

Why the Largest Eigenvalue Matters

For any unit vector \(v\), \[v^\top \Sigma v\] equals the variance of the one-dimensional projection \(z = x^\top v\).

When \(v\) is an eigenvector of \(\Sigma\) with eigenvalue \(\lambda\), this quantity becomes \[v^\top \Sigma v = \lambda.\]

Hence, the eigenvalue directly measures the variance captured by its corresponding eigenvector. Selecting the eigenvector with the largest eigenvalue therefore yields the direction of maximum variance, called the first principal component.

Multiple Components and Orthogonality

After the first component is chosen, PCA seeks additional components that:

capture as much remaining variance as possible,
do not duplicate information already captured.

This leads naturally to the requirement that subsequent vectors be orthogonal to earlier ones. The eigenvectors of a symmetric matrix such as \(\Sigma\) are automatically orthogonal, so they provide an ordered set of non-redundant directions.

Interpretation and Practical Handling

Eigenvectors define weight patterns across features.
Eigenvalues quantify how much variance each pattern explains.
Keeping the top \(k\) eigenvectors yields the best \(k\)-dimensional linear approximation of the data.

Eigenvectors should not be interpreted as causes or predictors. They represent optimal coordinates in which the structure of the data is most economically expressed.

Key Takeaway

Eigenvectors and eigenvalues appear in PCA because maximizing variance under a normalization constraint mathematically forces the eigenvalue equation; they quantify optimal directions and the amount of variation captured along them.

Eigenvalue and Eigenvector Calculation

To find the eigenvalues \(\lambda\) of the covariance matrix \(C\), we solve the characteristic equation: \[\det(C - \lambda I) = 0,\] where \(I\) is the identity matrix. Thus, we have: \[\det\begin{bmatrix}16.6667 - \lambda & 1.6667 \\ 1.6667 & 0.2222 - \lambda\end{bmatrix} = 0.\] Calculating the determinant gives: \[(16.6667 - \lambda)(0.2222 - \lambda) - (1.6667)(1.6667) = 0.\] Expanding this, we get: \[\lambda^2 - 16.8889\lambda + (16.6667 \cdot 0.2222 - 2.7778) = 0,\] \[\lambda^2 - 16.8889\lambda + 1.1111 = 0.\] Using the quadratic formula: \[\lambda = \frac{16.8889 \pm \sqrt{(16.8889)^2 - 4 \cdot 1 \cdot 1.1111}}{2}.\] Calculating the discriminant: \[(16.8889)^2 - 4 \cdot 1.1111 \approx 285.1852 - 4.4444 = 280.7408.\] Thus, \[\lambda \approx \frac{16.8889 \pm 16.7553}{2}.\] This gives us two eigenvalues: \[\lambda_1 \approx 16.8444, \quad \lambda_2 \approx 0.0444.\] Next, we find the eigenvectors corresponding to each eigenvalue by solving the equation: \[(C - \lambda I)v = 0.\] For \(\lambda_1 \approx 16.8444\): \[\begin{bmatrix}16.6667 - 16.8444 & 1.6667 \\ 1.6667 & 0.2222 - 16.8444\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] This simplifies to: \[\begin{bmatrix}-0.1777 & 1.6667 \\ 1.6667 & -16.6222\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] Solving this system gives the eigenvector: \[v_1 \approx \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix}.\] For \(\lambda_2 \approx 0.0444\): \[\begin{bmatrix}16.6667 - 0.0444 & 1.6667 \\ 1.6667 & 0.2222 - 0.0444\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] This simplifies to: \[\begin{bmatrix}16.6223 & 1.6667 \\ 1.6667 & 0.1778\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] Solving this system gives the eigenvector: \[v_2 \approx \begin{bmatrix}-0.1960 \\ 0.9806\end{bmatrix}.\] The eigenvector corresponding to the largest eigenvalue \(\lambda_1\) represents the direction of the first principal component, which captures the most variance in the data. In this case, the first principal component is approximately: \[PC_1 \approx \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix}.\] To reduce the dimensionality of the data from two features to one, we project the centered data onto the first principal component. This is done by taking the dot product of each centered data point with the first principal component vector.

Introduction to Linear Regression