Mathematics for Machine Learning
Course Notes
Suraj Singh Khurana
Spring 2026
Click a section to jump to it ¡ â â arrow keys to navigate
Mathematics for Machine Learning
Course Notes
Suraj Singh Khurana
Spring 2026
"Suppose a real-estate website wants to predict house prices based on features like size and number of bedrooms." To do anything, we shall need some information,
| Size (\(x_1\)) | Bedrooms (\(x_2\)) | Price (Rs. lakhs) |
|---|---|---|
| 10 | 2 | 70 |
| 20 | 3 | 130 |
Here \[\begin{aligned} x_1 &= \text{house size (in 100 sq ft)}, \\ x_2 &= \text{number of bedrooms}, \\ w_1 &= \text{price contribution per unit size (Rs. lakhs per 100 sq ft)}, \\ w_2 &= \text{price contribution per bedroom (Rs. lakhs)}. \end{aligned}\]
Now we want to build a model that predicts the price of a house based on its size and number of bedrooms. We assume that the price contributions from size and number of bedrooms are linear and additive.
Thus the linear model for the price (in Rs. lakhs) is \[\hat y = w_1 x_1 + w_2 x_2,\] where \(\hat y\) denotes the predicted price. Here \(\hat y\) is a linear combination of the input features \(x_1\) and \(x_2\) with the weights \(w_1\) and \(w_2\) representing the model parameters. The word feature refers to an individual measurable property or characteristic of the data being analyzed. In this case, the features are the size of the house and the number of bedrooms. To find the best-fitting weights \(w_1\) and \(w_2\), we can set up a system of equations based on the provided data points. From the data, we have two equations: \[\begin{aligned} 70 &= w_1 \cdot 10 + w_2 \cdot 2, \\ 130 &= w_1 \cdot 20 + w_2 \cdot 3. \end{aligned}\] We can solve this system of equations to find the values of \(w_1\) and \(w_2\). Rearranging the equations, we get: \[\begin{aligned} 10 w_1 + 2 w_2 &= 70, \\ 20 w_1 + 3 w_2 &= 130. \end{aligned}\] We can solve this system using methods such as substitution, elimination, or matrix methods. Substitution or elimination would be straightforward here, but for larger systems, matrix methods are more efficient. Let us therefore express the system in matrix form: \[\begin{bmatrix}10 & 2 \\ 20 & 3\end{bmatrix} \begin{bmatrix}w_1 \\ w_2\end{bmatrix} = \begin{bmatrix}70 \\ 130\end{bmatrix}.\] Let \(A = \begin{bmatrix}10 & 2 \\ 20 & 3\end{bmatrix}\), \(\mathbf{w} = \begin{bmatrix}w_1 \\ w_2\end{bmatrix}\), and \(\mathbf{b} = \begin{bmatrix}70 \\ 130\end{bmatrix}\). Then we have \(A \mathbf{w} = \mathbf{b}\). To solve for \(\mathbf{w}\), we can use the inverse of matrix \(A\) (if it exists) or use other methods like Gaussian elimination. \[\mathbf{w} = A^{-1} \mathbf{b}.\]
Calculating the inverse of \(A\): \[\begin{aligned} A^{-1} &= \frac{1}{(10)(3) - (2)(20)} \begin{bmatrix}3 & -2 \\ -20 & 10\end{bmatrix} \\ &= \frac{1}{30 - 40} \begin{bmatrix}3 & -2 \\ -20 & 10\end{bmatrix} \\ &= -\frac{1}{10} \begin{bmatrix}3 & -2 \\ -20 & 10\end{bmatrix} \\ &= \begin{bmatrix}-\dfrac{3}{10} & \dfrac{2}{10} \\ \dfrac{20}{10} & -\dfrac{10}{10}\end{bmatrix} \\ &= \begin{bmatrix}-0.3 & 0.2 \\ 2 & -1\end{bmatrix}. \end{aligned}\]
Now, multiplying \(A^{-1}\) with \(\mathbf{b}\): \[\begin{aligned} \mathbf{w} &= A^{-1} \mathbf{b} \\ &= \begin{bmatrix}-0.3 & 0.2 \\ 2 & -1\end{bmatrix} \begin{bmatrix}70 \\ 130\end{bmatrix} \\ &= \begin{bmatrix}-0.3 \cdot 70 + 0.2 \cdot 130 \\ 2 \cdot 70 - 1 \cdot 130\end{bmatrix} \\ &= \begin{bmatrix}-21 + 26 \\ 140 - 130\end{bmatrix} \\ &= \begin{bmatrix}5 \\ 10\end{bmatrix}. \end{aligned}\]
Thus, we find that \(w_1 = 5\) and \(w_2 = 10\). Therefore, the linear model for predicting house prices is: \[\hat y = 5 x_1 + 10 x_2.\] This means that for every additional 100 sq ft of house size, the price increases by Rs.5 lakhs, and for every additional bedroom, the price increases by Rs.10 lakhs.
We should see the formulation below \[\begin{bmatrix} 10 & 2 \\ 20 & 3 \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \end{bmatrix} = \begin{bmatrix} 70 \\ 130 \end{bmatrix}\] as \[X\mathbf{w}=\mathbf{y}\] "Machine learning systems only see this." Now suppose we get a third house with size 15 (i.e., 1500 sq ft) and 2 bedrooms. Or in other words, our data table looks like
| Size (\(x_1\)) | Bedrooms (\(x_2\)) | Price (Rs. lakhs) |
|---|---|---|
| 10 | 2 | 70 |
| 20 | 3 | 130 |
| 15 | 2 | 100 |
We can no longer solve for \(w_1\) and \(w_2\) exactly since we have three equations but only two unknowns. Even if we try to solve it, we will find that there is no exact solution that satisfies all three equations simultaneously. If you do not believe me, try to use Gaussian elimination. We will get the following augmented matrix: \[\left[\begin{array}{cc|c} 10 & 2 & 70 \\ 20 & 3 & 130 \\ 15 & 2 & 100 \end{array}\right]\] Applying row operations, we will get the following row echelon form: \[\left[\begin{array}{cc|c} 10 & 2 & 70 \\ 0 & 1 & 5 \\ 0 & 0 & 0 \end{array}\right]\] The last row corresponds to the equation \(0 = 0\), which indicates the system is consistent but underdetermined. However, this means the third equation is a linear combination of the first two, so we still cannot determine unique values for both weights. Or if you want to be more mathematical about it, the rank of the coefficient matrix is 2, while the rank of the augmented matrix is also 2, but we have only 2 independent equations for 2 unknowns with a dependent third equation.
But does it mean we cannot find a good model? Not really. We can try to find weights \(w_1\) and \(w_2\) that minimize the error between the predicted prices and the actual prices. Since we do not have an exact solution of \(X\mathbf{w}=\mathbf{y}\) we can try to find an approximate solution, that is \(X\mathbf{w} \approx \mathbf{y}\). Old question (math): âFind \(\mathbf w\) such that \(X\mathbf w=\mathbf y\).â
New question (ML): âFind \(\mathbf w\) such that \(X\mathbf w\approx\mathbf y\) as closely as possible.â
Define the error for the \(i\)-th house as \[e_i=\hat y_i-y_i, \qquad \hat y_i = w_1 x_{i1}+w_2 x_{i2}.\] One common approach is to consider the following optimization problem: \[\min_{w_1, w_2} \sum_{i=1}^{3} (y_i - \hat{y}_i)^2,\] where \(y_i\) is the actual price of the \(i\)-th house, and \(\hat{y}_i\) is the predicted price using our linear model. This approach is known as the least squares method
More formally,
In many real-world problems, such as house price prediction, we attempt to model the relationship between several input variables (features) and an observed output using a linear rule. While simple linear systems may admit exact solutions, real data is noisy, inconsistent, and overdetermined. Linear regression provides a principled mathematical framework to handle such situations.
Let \(x_i \in \mathbb{R}^d\) denote the feature vector associated with the \(i\)-th data point (for example, house size, number of bedrooms, age, etc.), and let \(y_i \in \mathbb{R}\) denote the corresponding observed output (house price). We assume a linear relationship between inputs and output of the form \[\hat{y}_i = w^\top x_i,\] where \(w \in \mathbb{R}^d\) is an unknown parameter vector to be learned from data. This assumption means that the output is linear in the parameters \(w\), which is the defining feature of linear regression.
Given \(n\) data points, we stack the feature vectors row-wise into a design matrix \[X = \begin{bmatrix} x_1^\top \\ x_2^\top \\ \vdots \\ x_n^\top \end{bmatrix} \in \mathbb{R}^{n \times d}, \quad y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \in \mathbb{R}^n.\] The model predictions for all data points can then be written compactly as \[\hat{y} = Xw.\] In idealized situations, one might hope for an exact solution to the equation \(Xw = y\). However, when \(n > d\) and the data is noisy, such an equation is typically inconsistent, as demonstrated earlier using rank and Gaussian elimination.
The discrepancy between the model predictions and the observed outputs is captured by the residual vector \[r = Xw - y.\] Each component \(r_i = w^\top x_i - y_i\) represents the prediction error on a single data point. Since exact satisfaction of all equations is generally impossible, the goal is to find a parameter vector \(w\) that makes these residuals collectively as small as possible.
To quantify the total error, we introduce a scalar loss function defined as the squared Euclidean norm of the residual vector: \[\mathcal{L}(w) = \|Xw - y\|^2 = \sum_{i=1}^n (w^\top x_i - y_i)^2.\] This choice, known as the least squares loss, is motivated by several factors: it penalizes large errors more strongly, is differentiable, leads to a convex optimization problem, and admits a clear geometric interpretation.
Linear regression is thus formulated as the following optimization problem: \[\min_{w \in \mathbb{R}^d} \; \|Xw - y\|^2.\] This problem seeks the parameter vector whose predictions are closest to the observed data in the least squares sense. Importantly, this is no longer a problem of solving linear equations exactly, but of minimizing an error function.
It is important to clarify how the vectors involved in the linear model are represented and why the expression \(w^\top x_i\) is mathematically valid.
Each feature vector \(x_i\) is represented as a column vector in \(\mathbb{R}^d\): \[x_i = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{id} \end{bmatrix} \in \mathbb{R}^{d \times 1},\] where each component corresponds to a measurable attribute of the data point (for example, house size, number of bedrooms, etc.).
Similarly, the parameter vector \(w\) is also represented as a column vector: \[w = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix} \in \mathbb{R}^{d \times 1}.\]
Since both \(w\) and \(x_i\) are column vectors, their direct product \(wx_i\) is not defined. To obtain a scalar prediction, we take the transpose of \(w\), yielding a row vector: \[w^\top = \begin{bmatrix} w_1 & w_2 & \cdots & w_d \end{bmatrix} \in \mathbb{R}^{1 \times d}.\]
The prediction for the \(i\)-th data point is then given by the matrix product \[\hat{y}_i = w^\top x_i,\] where \[(1 \times d)(d \times 1) = 1 \times 1.\] Thus, \(w^\top x_i\) is a scalar, as required to model a real-valued output such as house price.
Explicitly, this dot product can be written as \[w^\top x_i = w_1 x_{i1} + w_2 x_{i2} + \cdots + w_d x_{id},\] which shows that the linear model is a weighted sum of the input features.
When stacking all feature vectors into the design matrix \[X = \begin{bmatrix} x_1^\top \\ x_2^\top \\ \vdots \\ x_n^\top \end{bmatrix} \in \mathbb{R}^{n \times d},\] each row of \(X\) corresponds to a single data point written as a transposed feature vector. The matrixâvector product \[Xw\] then yields an \(n \times 1\) vector, where the \(i\)-th entry is precisely \(w^\top x_i\). This ensures dimensional consistency between the individual prediction formula and the global matrix formulation.
This careful handling of vector orientation is not merely a notational choice; it is essential for extending linear regression to high-dimensional settings and for implementing learning algorithms correctly in numerical software.
To find the minimizer, we differentiate the loss function with respect to \(w\): \[\frac{\partial}{\partial w} \|Xw - y\|^2 = \frac{\partial}{\partial w} (Xw - y)^\top (Xw - y) =\] How we achieve this is by applying the chain rule and properties of matrix calculus. Let us derive from scratch in high detail: \[\begin{aligned} \frac{\partial}{\partial w} \|Xw - y\|^2 &= \frac{\partial}{\partial w} (Xw - y)^\top (Xw - y) \\ &= \frac{\partial}{\partial w} \left( w^\top X^\top X w - 2 y^\top X w + y^\top y \right) \\ &= 2 X^\top X w - 2 X^\top y. \end{aligned}\] Why above? Because we use the following rules of matrix calculus:
For a vector \(a\), \(\frac{\partial}{\partial w} (w^\top a) = a\). For the derivative \(\frac{\partial}{\partial w} (w^\top a)\), we use the linearity of differentiation. Since \(w^\top a = \sum_j w_j a_j\), we have \(\frac{\partial}{\partial w_i} (w^\top a) = a_i\), which gives \(\frac{\partial}{\partial w} (w^\top a) = a\).
For a matrix \(A\), \(\frac{\partial}{\partial w} (w^\top A w) = (A + A^\top) w\). If \(A\) is symmetric, this simplifies to \(2 A w\).
For a symmetric matrix \(A\), we have \(\frac{\partial}{\partial w} (w^\top A w) = 2 A w\). To see this, expand \(w^\top A w = \sum_i \sum_j w_i A_{ij} w_j\). Then \(\frac{\partial}{\partial w_k}(w^\top A w) = \sum_j A_{kj} w_j + \sum_i w_i A_{ik} = \sum_j (A_{kj} + A_{jk}) w_j\). Since \(A\) is symmetric, \(A_{jk} = A_{kj}\), so this becomes \(2 \sum_j A_{kj} w_j = 2(Aw)_k\), giving \(\frac{\partial}{\partial w}(w^\top A w) = 2Aw\).
Setting the gradient to zero yields the normal equations: \[X^\top X w = X^\top y.\]
These equations form a system of \(d\) linear equations in \(d\) unknowns. Unlike the original system \(Xw = y\), the normal equations are consistent under mild conditions.
If the matrix \(X^\top X\) is invertible, the solution to the normal equations is given by \[w = (X^\top X)^{-1} X^\top y.\] This expression provides a closed-form formula for the linear regression estimator. It characterizes the theoretical solution but is not always used directly in practice due to computational and numerical stability considerations.
From a geometric perspective, the columns of \(X\) span a subspace of \(\mathbb{R}^n\), known as the column space of \(X\). The vector \(Xw\) represents a point in this subspace. Linear regression finds the vector \(Xw\) that is closest to \(y\) in Euclidean distance. Consequently, the residual vector \(r = Xw - y\) is orthogonal to the column space of \(X\), which explains the condition \[X^\top r = 0.\] Thus, linear regression can be interpreted as the orthogonal projection of \(y\) onto the column space of \(X\).
Linear regression qualifies as a machine learning model because its parameters are inferred from data rather than specified explicitly, the data typically contains noise and inconsistencies, and the learned model is used to make predictions on unseen data. The process involves empirical risk minimization rather than exact algebraic solution.
In modern machine learning applications, the number of data points \(n\) may be in the millions and the number of features \(d\) may be in the thousands. In such settings, forming and inverting \(X^\top X\) is computationally expensive and numerically unstable. Consequently, linear regression is commonly solved using numerical optimization methods such as gradient descent, stochastic gradient descent, or iterative linear solvers. These methods approximate the solution to the normal equations efficiently at scale.
Linear regression transforms an inconsistent, overdetermined system of linear equations into a well-posed optimization problem. It replaces the impossible requirement of satisfying all equations exactly with the achievable goal of minimizing overall error. In this sense, linear regression is best understood as numerical linear algebra applied to real-world data.
Now let us return to our housing price example and compute the optimal weights using the least squares approach. We have the design matrix \[X = \begin{bmatrix} 10 & 2 \\ 20 & 3 \\ 15 & 2 \end{bmatrix},\] and the output vector \[y = \begin{bmatrix} 70 \\ 130 \\ 100 \end{bmatrix}.\] The normal equations are \[X^\top X w = X^\top y.\] Calculating \(X^\top X\) and \(X^\top y\): \[X^\top X = \begin{bmatrix}10 & 20 & 15 \\ 2 & 3 & 2\end{bmatrix} \begin{bmatrix}10 & 2 \\ 20 & 3 \\ 15 & 2\end{bmatrix} = \begin{bmatrix}725 & 110 \\ 110 & 17\end{bmatrix},\] and \[X^\top y = \begin{bmatrix}10 & 20 & 15 \\ 2 & 3 & 2\end{bmatrix} \begin{bmatrix}70 \\ 130 \\ 100\end{bmatrix} = \begin{bmatrix} 10(70) + 20(130) + 15(100) \\ 2(70) + 3(130) + 2(100) \end{bmatrix}\] \[= \begin{bmatrix} 700 + 2600 + 1500 \\ 140 + 390 + 200 \end{bmatrix} = \begin{bmatrix}4800 \\ 730\end{bmatrix}.\] Thus, the normal equations become \[\begin{bmatrix}725 & 110 \\ 110 & 17\end{bmatrix} \begin{bmatrix}w_1 \\ w_2\end{bmatrix} = \begin{bmatrix}4800 \\ 730\end{bmatrix}.\] Solving via the determinant \(\Delta = 725 \cdot 17 - 110^2 = 12325 - 12100 = 225\): \[\begin{bmatrix}w_1 \\ w_2\end{bmatrix} = \frac{1}{225}\begin{bmatrix}17 & -110 \\ -110 & 725\end{bmatrix} \begin{bmatrix}4800 \\ 730\end{bmatrix}\] \[= \frac{1}{225}\begin{bmatrix} 81600 - 80300 \\ -528000 + 529250 \end{bmatrix} = \frac{1}{225}\begin{bmatrix}1300 \\ 1250\end{bmatrix} = \begin{bmatrix}5.78 \\ 5.56\end{bmatrix}.\] Thus \(w_1 \approx 5.78\) and \(w_2 \approx 5.56\), representing the best-fit weights that minimize the squared error across all three data points. How come we got different weights than before? Because now we are minimizing the squared error across all three data points, rather than trying to fit exactly two points. Earlier weights were 5 and 10, which perfectly fit the first two points but did not account for the third point. Now with third point weights became approximately 5.78 and 5.56, which balance the errors across all three points. Let us check the predictions: \[\hat{y}_1 = 5.78 \cdot 10 + 5.56 \cdot 2 = \approx 70.0,\] \[\hat{y}_2 = 5.78 \cdot 20 + 5.56 \cdot 3 \approx 130.0,\] \[\hat{y}_3 = 5.78 \cdot 15 + 5.56 \cdot 2 \approx 100.0.\] but actually \[\hat{y}_1 = 5.78 \cdot 10 + 5.56 \cdot 2 = 57.8 + 11.12 = 68.92,\] \[\hat{y}_2 = 5.78 \cdot 20 + 5.56 \cdot 3 = 115.6 + 16.68 = 132.28,\] \[\hat{y}_3 = 5.78 \cdot 15 + 5.56 \cdot 2 = 86.7 + 11.12 = 97.82.\]
In practice, computing matrix inverses is computationally expensive and numerically unstable. Instead, we use gradient descent to minimize \(\mathcal{L}(w) = \|Xw - y\|^2\) iteratively, starting from an initial guess \(w^{(0)} = 0\): \[w^{(k+1)} = w^{(k)} - \eta \nabla \mathcal{L}(w^{(k)}),\] where \(\eta\) is the learning rate. The gradient \(\nabla \mathcal{L}(w)\) is given by \[\nabla \mathcal{L}(w) = 2 X^\top (Xw - y).\] By repeatedly applying this update rule, we can converge to the optimal weights without explicitly computing the matrix inverse. Why this works? Because gradient descent is an optimization algorithm that iteratively moves towards the minimum of a function by taking steps proportional to the negative of the gradient. In the case of linear regression, the loss function is convex, ensuring that gradient descent will converge to the global minimum. Here is a pictural representation of gradient descent:
First recall what is convexity. A function \(f: \mathbb{R}^d \to \mathbb{R}\) is convex if for any two points \(x, y \in \mathbb{R}^d\) and any \(\theta \in [0, 1]\), \[f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y).\] Geometrically, this means that the line segment connecting any two points on the graph of the function lies above or on the graph itself. Convex functions have a single global minimum, making optimization problems easier to solve. Why they have a single global minimum? Because if there were two distinct local minima, the line segment connecting them would violate the convexity condition.
Why loss function is convex, ensuring that gradient descent will converge to the global minimum. To prove that the loss function \(\mathcal{L}(w) = \|Xw - y\|^2\) is convex, we can show that its Hessian matrix is positive semidefinite. The Hessian matrix of \(\mathcal{L}(w)\) is given by \[H = \nabla^2 \mathcal{L}(w) = 2 X^\top X.\] To show that \(H\) is positive semidefinite, we need to show that for any non-zero vector \(z \in \mathbb{R}^d\), \[z^\top H z \geq 0.\] Calculating \(z^\top H z\): \[z^\top H z = z^\top (2 X^\top X) z = 2 (X z)^\top (X z) = 2 \|X z\|^2 \geq 0.\] Since \(\|X z\|^2\) is always non-negative, it follows that \(H\) is positive semidefinite. Therefore, the loss function \(\mathcal{L}(w)\) is convex.
We need to justify why this gradient descent is a good idea. Gradient descent is a good idea for several reasons:
Scalability: Gradient descent can handle large datasets and high-dimensional feature spaces efficiently, as it does not require storing or inverting large matrices.
Flexibility: It can be easily adapted to different loss functions and regularization techniques.
Convergence Guarantees: For convex loss functions, gradient descent is guaranteed to converge to the global minimum, provided an appropriate learning rate is chosen.
Simplicity: The algorithm is straightforward to implement and understand, making it accessible for practitioners.
Why computing inverse of a matrix is computationally expensive and
numerically unstable? Computing the inverse of a matrix, especially
large matrices, can be computationally expensive due to the time
complexity of matrix inversion algorithms, which is typically \(O(d^3)\) for a \(d \times d\) matrix. This becomes
prohibitive for high-dimensional data. Additionally, matrix inversion
can be numerically unstable if the matrix is ill-conditioned (i.e., has
a high condition number), leading to significant errors in the computed
inverse. Small perturbations in the input data can result in large
changes in the inverse, making it unreliable for practical applications.
But in contrast, gradient descent avoids these issues by iteratively
refining the solution without requiring matrix inversion, making it more
robust and efficient for large-scale problems. So let us now implement
gradient descent to find the optimal weights for our housing price
prediction problem by iteratively updating the weights until convergence
by calculating the gradient at each step non programmatically. Starting
with an initial guess \(w^{(0)} =
\begin{bmatrix}0 \\ 0\end{bmatrix}\) and a learning rate \(\eta = 0.0001\), we compute the gradient at
each step: \[\nabla \mathcal{L}(w^{(k)}) = 2
X^\top (X w^{(k)} - y).\] We then update the weights: \[w^{(k+1)} = w^{(k)} - \eta \nabla
\mathcal{L}(w^{(k)}).\] Repeating this process and get the
following sequence of weight updates: \[\begin{aligned}
w^{(0)} &= \begin{bmatrix}0 \\ 0\end{bmatrix}, \\
w^{(1)} &\approx \begin{bmatrix}0.096 \\ 0.015\end{bmatrix}, \\
w^{(2)} &\approx \begin{bmatrix}0.187 \\ 0.028\end{bmatrix}, \\
w^{(3)} &\approx \begin{bmatrix}0.273 \\ 0.040\end{bmatrix}, \\
&\vdots \\
w^{(1000)} &\approx \begin{bmatrix}5.562 \\ 5.722\end{bmatrix}.
\end{aligned}\] After sufficient iterations, the weights converge
to approximately \(\begin{bmatrix}5.562 \\
5.722\end{bmatrix}\), which is very close to the closed-form
solution we computed earlier. This demonstrates the effectiveness of
gradient descent in finding the optimal weights for linear regression.
There are many small things to consider here like how to choose the
learning rate, when to stop the iterations (convergence criteria), etc.
Below is a pictorial representation of gradient descent converging to
the optimal weights:
The small things to consider here like how to choose the learning rate,
when to stop the iterations (convergence criteria), etc. are important
practical aspects of implementing gradient descent effectively. See
summary below:
Learning Rate: A small learning rate may lead to slow convergence, while a large learning rate may cause divergence. Techniques like learning rate schedules or adaptive learning rates (e.g., Adam, RMSprop) can help.
Convergence Criteria: Common criteria include a maximum number of iterations, a threshold on the change in loss between iterations, or a threshold on the norm of the gradient.
Batch Size: In practice, stochastic gradient descent (SGD) or mini-batch gradient descent is often used, where gradients are computed on subsets of the data to improve efficiency and convergence.
Regularization: Techniques like L2 regularization (Ridge regression) or L1 regularization (Lasso) can be incorporated to prevent overfitting and improve generalization.
We will learn these small things in detail in later lectures. But for now, let us completely change our perspective and derive linear regression from a probabilistic viewpoint. For this let us recall our housing price prediction problem. We have the design matrix \[X = \begin{bmatrix} 10 & 2 \\ 20 & 3 \\ 15 & 2 \end{bmatrix},\] and the output vector \[y = \begin{bmatrix} 70 \\ 125 \\ 95 \end{bmatrix}.\] We will now assume a probabilistic model for the data generation process. Specifically, we assume that the observed prices are generated according to the linear model plus Gaussian noise: \[y_i = w^\top x_i + \epsilon_i,\] where \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\) are independent and identically distributed Gaussian noise terms with mean zero and variance \(\sigma^2\). This assumption reflects the idea that the true relationship between features and prices is linear, but there are random fluctuations due to unobserved factors. Recall probability basics below:
Before diving into the probabilistic interpretation of linear regression, it is essential to review foundational concepts in probability theory.
The sample space \(\Omega\) is the set of all possible outcomes of a random experiment. An event is a subset of the sample space. For example, when rolling a fair die, \(\Omega = \{1, 2, 3, 4, 5, 6\}\), and the event ârolling an even numberâ is \(\{2, 4, 6\}\).
The probability of an event \(A\), denoted \(P(A)\), is a number in \([0, 1]\) satisfying:
\(P(\Omega) = 1\) (the sample space has probability 1).
\(P(\emptyset) = 0\) (the empty event has probability 0).
For mutually exclusive events \(A_1, A_2, \ldots\), we have \(P(A_1 \cup A_2 \cup \cdots) = P(A_1) + P(A_2) + \cdots\).
A random variable \(X\) is a function that assigns a numerical value to each outcome in the sample space. Random variables can be discrete (taking values in a countable set) or continuous (taking values in an interval or uncountable set). For example, the outcome of a die roll is a discrete random variable, while the height of a person is continuous.
A probability distribution describes how probability is distributed over the possible values of a random variable.
For a discrete random variable \(X\) taking values \(x_1, x_2, \ldots\), the probability mass function (PMF) is \(p(x_i) = P(X = x_i)\). The probabilities satisfy \(\sum_i p(x_i) = 1\).
For a continuous random variable \(X\), the probability density function (PDF) is \(f(x)\), which satisfies \(\int_{-\infty}^{\infty} f(x) \, dx = 1\). The probability that \(X\) lies in an interval \([a, b]\) is \[P(a \leq X \leq b) = \int_a^b f(x) \, dx.\]
The cumulative distribution function (CDF) is defined as \[F(x) = P(X \leq x).\] For a continuous random variable with PDF \(f(x)\), we have \(F(x) = \int_{-\infty}^x f(t) \, dt\), and the PDF is the derivative of the CDF: \(f(x) = \frac{dF(x)}{dx}\).
The expected value (or mean) of a random variable \(X\) is a measure of its central tendency.
For a discrete random variable, \[E[X] = \sum_i x_i \cdot p(x_i).\]
For a continuous random variable, \[E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx.\]
The expected value is linear: \(E[aX + b] = aE[X] + b\) for constants \(a\) and \(b\).
The variance of a random variable \(X\) measures the spread of its distribution around the mean: \[\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2.\]
The standard deviation is \(\sigma = \sqrt{\text{Var}(X)}\).
Properties of variance include:
\(\text{Var}(aX + b) = a^2 \text{Var}(X)\) for constants \(a\) and \(b\).
\(\text{Var}(X) \geq 0\), with equality if and only if \(X\) is constant almost surely.
The Gaussian or normal distribution is one of the most important distributions in probability and statistics. A continuous random variable \(X\) follows a normal distribution with mean \(\mu\) and variance \(\sigma^2\), denoted \(X \sim \mathcal{N}(\mu, \sigma^2)\), if its PDF is \[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right).\]
Key properties of the normal distribution include:
The PDF is symmetric about \(\mu\) and bell-shaped.
The mean, median, and mode are all equal to \(\mu\).
Approximately 68% of the probability mass lies within one standard deviation of the mean: \(P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 0.68\).
Approximately 95% lies within two standard deviations: \(P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 0.95\).
Approximately 99.7% lies within three standard deviations.
The standard normal distribution has mean \(\mu = 0\) and variance \(\sigma^2 = 1\), often denoted \(\mathcal{N}(0, 1)\).
When dealing with multiple random variables, we use the multivariate normal distribution. A random vector \(\mathbf{x} \in \mathbb{R}^d\) follows a multivariate normal distribution with mean vector \(\boldsymbol{\mu} \in \mathbb{R}^d\) and covariance matrix \(\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}\), denoted \(\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\), if its PDF is \[f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right),\] where \(|\boldsymbol{\Sigma}|\) denotes the determinant of \(\boldsymbol{\Sigma}\).
Two random variables \(X\) and \(Y\) are independent if knowledge of one provides no information about the other. Formally, they are independent if \[P(X = x \text{ and } Y = y) = P(X = x) \cdot P(Y = y)\] for all values \(x\) and \(y\) (in the discrete case), or if their joint PDF factors as \(f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y)\) (in the continuous case).
The conditional probability of event \(A\) given event \(B\) is \[P(A \mid B) = \frac{P(A \text{ and } B)}{P(B)},\] provided \(P(B) > 0\).
Bayesâ theorem relates conditional probabilities: \[P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}.\] This theorem is fundamental in machine learning and statistical inference, allowing us to update beliefs (posterior) based on new evidence (likelihood) and prior knowledge (prior).
In the context of parameter estimation, the likelihood is the probability of observing the data given a particular value of the parameter. If we have data \(\mathbf{y} = (y_1, \ldots, y_n)\) and parameter \(w\), the likelihood is \[L(w \mid \mathbf{y}) = P(\mathbf{y} \mid w).\] The log-likelihood is \(\ell(w) = \log L(w \mid \mathbf{y})\), which is often easier to work with computationally.
Maximum likelihood estimation (MLE) is a method for estimating parameters by choosing the parameter value that maximizes the likelihood of the observed data: \[\hat{w} = \arg\max_w L(w \mid \mathbf{y}) = \arg\max_w \ell(w \mid \mathbf{y}).\]
These concepts will be essential for understanding the probabilistic formulation of linear regression, where we model noise using the normal distribution and estimate parameters via maximum likelihood. Do all models with noise follow this approach? No, not all models with noise follow this approach. While the Gaussian noise assumption is common due to its mathematical convenience and the Central Limit Theorem, other noise distributions can be used depending on the nature of the data and the specific application. For example, if the noise is believed to be heavy-tailed, a Laplace distribution might be more appropriate. Similarly, for count data, a Poisson distribution could be used. The choice of noise model affects the likelihood function and consequently the parameter estimation method. Recall that the Central Limit Theorem states that the sum of a large number of independent random variables, regardless of their individual distributions, tends to follow a normal distribution, provided certain conditions are met. This theorem justifies the common use of Gaussian noise in modeling real-world phenomena, as many types of measurement errors and natural variations can be viewed as the cumulative effect of numerous small, independent factors. In the house price example, are the noise really independent and identically distributed Gaussian? In practice, the assumption of independent and identically distributed (i.i.d.) Gaussian noise may not always hold true in real-world scenarios. While it is a convenient and widely used assumption, there are several factors that can lead to violations of this assumption:
Heteroscedasticity: The variance of the noise may change with the level of the input features. For example, more expensive houses might have larger price fluctuations.
Autocorrelation: In time series data or spatial data, errors may be correlated with each other, violating the independence assumption.
Non-Gaussian Noise: The noise distribution may not be Gaussian; it could be skewed or have heavier tails than a normal distribution.
Outliers: Real-world data often contains outliers that can significantly affect the model fit and violate the i.i.d. assumption.
Despite these potential violations, the Gaussian noise assumption is often used as a first approximation due to its mathematical tractability and the Central Limit Theorem. However, it is important to assess the validity of this assumption in practice and consider alternative models or robust estimation techniques if necessary. Now let us derive the likelihood function for our linear regression model under the Gaussian noise assumption.
Given the model \[y_i = w^\top x_i + \epsilon_i,\] where \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\), the conditional probability of observing \(y_i\) given \(x_i\) and parameters \(w\) is \[P(y_i \mid x_i, w) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - w^\top x_i)^2}{2\sigma^2}\right).\] Assuming the data points are independent, the likelihood of observing the entire dataset \((X, y)\) is the product of the individual probabilities: \[L(w \mid X, y) = \prod_{i=1}^n P(y_i \mid x_i, w) = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n \exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - w^\top x_i)^2\right).\] Taking the logarithm of the likelihood function gives the log-likelihood: \[\ell(w \mid X, y) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - w^\top x_i)^2.\] To find the maximum likelihood estimate of \(w\), we can ignore the constant term \(-\frac{n}{2} \log(2\pi\sigma^2)\) and focus on maximizing the negative of the sum of squared errors: \[\hat{w} = \arg\min_w \sum_{i=1}^n (y_i - w^\top x_i)^2.\] This is precisely the same optimization problem we encountered in the least squares formulation of linear regression. Thus, the maximum likelihood estimate of \(w\) under the Gaussian noise assumption coincides with the least squares solution. This connection highlights the probabilistic interpretation of linear regression as fitting a model that maximizes the likelihood of the observed data under a Gaussian noise model.
Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features in a dataset while retaining as much variance as possible. It achieves this by identifying the directions (principal components) along which the data varies the most.
In linear regression, we study a supervised problem: \[y = Xw + \varepsilon,\] where the goal is to predict an output variable \(y\) (e.g. house price) from features \(X\).
In PCA, we deliberately remove the output variable \(y\) and ask a different question:
Given only the feature data \(X\), how can we describe or represent the data as faithfully as possible using fewer numbers?
Thus PCA is an unsupervised method whose goal is not prediction, but representation and compression of data.
â
Let \[x_1, x_2, \dots, x_n \in \mathbb{R}^d\] be data points (e.g. houses described by size, bedrooms, etc.).
Since each data point \(x_i\) is a \(d\)-dimensional vector, the mean is computed as a vector average: \[\mu = \frac{1}{n}\sum_{i=1}^n x_i = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_d \end{bmatrix} \in \mathbb{R}^d,\] where each component \(\mu_j\) is the average of the \(j\)-th feature across all data points. When we subtract \(\mu\) from \(x_i\), we perform component-wise subtraction: \[x_i - \mu = \begin{bmatrix} x_{i1} - \mu_1 \\ x_{i2} - \mu_2 \\ \vdots \\ x_{id} - \mu_d \end{bmatrix}.\] This compact notation \(x_i \leftarrow x_i - \mu\) applies uniformly regardless of dimensionality.
We first perform mean centering: \[x_i \leftarrow x_i - \mu, \quad \text{where } \mu = \frac{1}{n}\sum_{i=1}^n x_i.\]
After centering, \[\frac{1}{n}\sum_{i=1}^n x_i = 0.\]
Mean centering ensures that we study variation around the average, which is essential for defining variance and covariance.
â
After mean centering, the centered data points are distributed around the origin. To understand the structure of this distribution, we need a single mathematical object that captures:
How much each feature varies (spread along each axis),
How features vary together (whether large values in one feature tend to occur with large or small values in another).
Recall from probability that the variance of a random variable \(X\) is \(\text{Var}(X) = E[X^2]\) (when centered at zero). For data, this becomes \(\frac{1}{n}\sum_i x_i^2\), measuring spread along a single dimension.
For centered data points \(x_1, \ldots, x_n\), the outer product \(x_i x_i^\top\) creates a \(d \times d\) matrix that captures all pairwise products of coordinates: \[x_i x_i^\top = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{id} \end{bmatrix} \begin{bmatrix} x_{i1} & x_{i2} & \cdots & x_{id} \end{bmatrix} = \begin{bmatrix} x_{i1}^2 & x_{i1}x_{i2} & \cdots \\ x_{i2}x_{i1} & x_{i2}^2 & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}.\]
Averaging these matrices across all data points accumulates information about:
Diagonal entries: \(\frac{1}{n}\sum_i x_{ij}^2\) (variance of feature \(j\)),
Off-diagonal entries: \(\frac{1}{n}\sum_i x_{ij}x_{ik}\) (covariance between features \(j\) and \(k\)).
This is the natural generalization of univariate variance to the multivariate setting.
The covariance matrix is defined as \[\Sigma = \frac{1}{n}\sum_{i=1}^n x_i x_i^\top.\]
Each entry \[\Sigma_{jk} = \frac{1}{n}\sum_{i=1}^n x_{ij}x_{ik}\] is the covariance between feature \(j\) and feature \(k\).
Thus, the covariance matrix compactly stores:
variances of individual features (diagonal entries),
pairwise relationships between features (off-diagonal entries).
â
Choose a vector of weights \[v \in \mathbb{R}^d, \quad \|v\| = 1.\]
For each data point \(x_i\), define a one-dimensional summary: \[z_i = x_i^\top v.\]
This compresses each \(d\)-dimensional data point into a single number.
Different choices of \(v\) preserve different aspects of the data.
â
The variance of the summary values \(\{z_i\}\) is \[\operatorname{Var}(z) = \frac{1}{n}\sum_{i=1}^n z_i^2 = v^\top \Sigma v.\]
Interpretation:
Large variance \(\Rightarrow\) the summary distinguishes data points well.
Zero variance \(\Rightarrow\) all data points collapse to the same value (no information).
Thus, variance measures how much of the differences between data points survive compression.
â
To find the best one-dimensional representation, PCA solves: \[\max_{\|v\|=1} \; v^\top \Sigma v.\]
This chooses the linear combination of features whose values vary the most across the dataset.
The solution is called the first principal component \(v_1\).
â
Given the summary \[z_i = x_i^\top v,\] the natural reconstruction of the original data point is \[\hat{x}_i = z_i\, v.\]
All reconstructed points lie in the one-dimensional subspace spanned by \(v\).
The average reconstruction error is \[\frac{1}{n}\sum_{i=1}^n \|x_i - \hat{x}_i\|^2 = \frac{1}{n}\sum_{i=1}^n \|x_i - (x_i^\top v)v\|^2.\]
A fundamental result is:
The vector \(v\) that maximizes variance also minimizes reconstruction error.
Hence, PCA finds the best low-dimensional approximation of the data in the least-squares sense.
â
Consider houses described by size and number of bedrooms (mean-centered):
\[x_1 = (-5,-0.3),\quad x_2 = (5,0.7),\quad x_3 = (0,-0.3).\]
Choosing \(v=(0,1)\) (only bedrooms) produces low variance summaries.
Choosing \(v=(1,0)\) (only size) produces high variance summaries.
Thus PCA automatically discovers that house size is the dominant pattern of variation.
â
PCA provides:
optimal linear compression of data,
noise reduction by discarding low-variance directions,
interpretable dominant patterns,
better conditioning for downstream tasks (regression, clustering, density estimation).
PCA does not discover meaning or causality. It guarantees only this:
Among all linear low-dimensional representations, PCA loses the least information under squared error.
â
Let \(\mathcal{X} = \{x_1, \ldots, x_N\}\) be a dataset where \(x_n \in \mathbb{R}^D\). We seek to find a lower-dimensional representation in \(\mathbb{R}^M\) where \(M < D\), while preserving as much information as possible.
PCA projects data onto a lower-dimensional linear subspace. The key is to find an orthonormal basis \(\{b_1, \ldots, b_M\}\) of this \(M\)-dimensional subspace such that the projected data retains maximum variance.
For a given orthonormal basis \(B = [b_1, \ldots, b_M]\) where \(b_i \in \mathbb{R}^D\) and \(b_i^\top b_j = \delta_{ij}\), the projection of a data point \(x_n\) onto this subspace is: \[\tilde{x}_n = \sum_{i=1}^M z_{ni} b_i = B z_n,\] where \(z_n = [z_{n1}, \ldots, z_{nM}]^\top \in \mathbb{R}^M\) are the coordinates (principal component scores) in the new basis, computed as \(z_{ni} = x_n^\top b_i\).
PCA finds the orthonormal basis that maximizes the variance of the projected data. For mean-centered data (where \(\sum_{n=1}^N x_n = 0\)), the variance of the projections onto a single direction \(b_1\) is: \[V_1 = \frac{1}{N} \sum_{n=1}^N (x_n^\top b_1)^2 = \frac{1}{N} \sum_{n=1}^N b_1^\top x_n x_n^\top b_1 = b_1^\top S b_1,\] where \(S = \frac{1}{N} \sum_{n=1}^N x_n x_n^\top\) is the data covariance matrix.
The first principal component is found by solving: \[\max_{b_1} \; b_1^\top S b_1 \quad \text{subject to} \quad \|b_1\|^2 = 1.\]
Using the method of Lagrange multipliers with Lagrangian \(\mathcal{L}(b_1, \lambda_1) = b_1^\top S b_1 - \lambda_1(b_1^\top b_1 - 1)\), we obtain: \[\frac{\partial \mathcal{L}}{\partial b_1} = 2S b_1 - 2\lambda_1 b_1 = 0 \quad \Rightarrow \quad S b_1 = \lambda_1 b_1.\]
Thus, \(b_1\) is an eigenvector of \(S\) with eigenvalue \(\lambda_1\). The variance along this direction is: \[V_1 = b_1^\top S b_1 = b_1^\top (\lambda_1 b_1) = \lambda_1 b_1^\top b_1 = \lambda_1.\]
To maximize variance, we choose \(b_1\) to be the eigenvector corresponding to the largest eigenvalue \(\lambda_1\) of \(S\).
For the \(m\)-th principal component, we solve: \[\max_{b_m} \; b_m^\top S b_m \quad \text{subject to} \quad \|b_m\|^2 = 1 \text{ and } b_m \perp b_i \text{ for } i = 1, \ldots, m-1.\]
The solution is the eigenvector corresponding to the \(m\)-th largest eigenvalue of \(S\).
PCA can equivalently be formulated as minimizing the average squared reconstruction error. The reconstruction of \(x_n\) from its projection is: \[\tilde{x}_n = \sum_{i=1}^M (x_n^\top b_i) b_i = BB^\top x_n,\] where \(B = [b_1, \ldots, b_M]\).
The reconstruction error for the entire dataset is: \[J = \frac{1}{N} \sum_{n=1}^N \|x_n - \tilde{x}_n\|^2 = \frac{1}{N} \sum_{n=1}^N \|x_n - BB^\top x_n\|^2.\]
Using the identity \(\|x_n\|^2 = \|BB^\top x_n\|^2 + \|(I - BB^\top)x_n\|^2\) for orthonormal \(B\), we have: \[J = \frac{1}{N} \sum_{n=1}^N \|x_n\|^2 - \frac{1}{N} \sum_{n=1}^N \|BB^\top x_n\|^2.\]
Since the first term is constant, minimizing \(J\) is equivalent to maximizing \(\frac{1}{N} \sum_{n=1}^N \|BB^\top x_n\|^2\), which is the variance of the projected data. Thus, maximum variance and minimum reconstruction error are equivalent objectives.
Center the data: Compute mean \(\mu = \frac{1}{N}\sum_{n=1}^N x_n\) and set \(x_n \leftarrow x_n - \mu\).
Compute covariance matrix: \(S = \frac{1}{N} \sum_{n=1}^N x_n x_n^\top = \frac{1}{N} X^\top X\), where \(X = [x_1, \ldots, x_N]^\top \in \mathbb{R}^{N \times D}\).
Eigendecomposition: Compute eigenvectors and eigenvalues of \(S\): \(S = \sum_{i=1}^D \lambda_i b_i b_i^\top\) where \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_D\).
Select principal components: Choose the top \(M\) eigenvectors \(B = [b_1, \ldots, b_M]\).
Project data: For each \(x_n\), compute \(z_n = B^\top x_n \in \mathbb{R}^M\).
The proportion of variance captured by the first \(M\) principal components is: \[\frac{\sum_{i=1}^M \lambda_i}{\sum_{i=1}^D \lambda_i}.\]
Orthogonality: Principal components are mutually orthogonal: \(b_i^\top b_j = \delta_{ij}\).
Uncorrelated Projections: The coordinates \(z_{ni}\) and \(z_{nj}\) for \(i \neq j\) are uncorrelated across the dataset.
Variance Ordering: Variance along principal components decreases: \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_D \geq 0\).
Linear Transformation: PCA is a linear method that cannot capture nonlinear structure in data.
Scale Sensitivity: PCA is sensitive to feature scales; standardization is often applied before PCA.
For high-dimensional data where \(N < D\), computing the \(D \times D\) covariance matrix \(S\) is expensive. Instead, we can use the singular value decomposition (SVD) of the centered data matrix \(X \in \mathbb{R}^{N \times D}\): \[X = U \Sigma V^\top,\] where \(U \in \mathbb{R}^{N \times N}\), \(\Sigma \in \mathbb{R}^{N \times D}\), \(V \in \mathbb{R}^{D \times D}\) are the SVD factors.
The relationship with PCA:
The columns of \(V\) are the principal components (eigenvectors of \(S = \frac{1}{N}X^\top X\)).
The eigenvalues of \(S\) are \(\lambda_i = \frac{\sigma_i^2}{N}\), where \(\sigma_i\) are the singular values.
The projections are given by \(Z = XV = U\Sigma \in \mathbb{R}^{N \times D}\).
This approach is numerically more stable and efficient, especially when \(N \ll D\).
PCA identifies an orthogonal coordinate system aligned with the directions of maximum variance in the data cloud. Geometrically:
The first principal component \(b_1\) points in the direction of greatest spread.
The second principal component \(b_2\) is orthogonal to \(b_1\) and points in the direction of second-greatest spread.
Subsequent components follow similarly, forming a nested sequence of best-fit subspaces.
The \(M\)-dimensional subspace spanned by \(\{b_1, \ldots, b_M\}\) is the best-fitting \(M\)-dimensional linear subspace in the least-squares sense.
â
Let the dataset consist of \(N\) centered samples in \(\mathbb{R}^d\): \[X \in \mathbb{R}^{N \times d}, \qquad X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Nd} \end{bmatrix}.\] The \(n\)-th data point is the row vector \[x_n = \begin{bmatrix} x_{n1} & x_{n2} & \cdots & x_{nd} \end{bmatrix} \in \mathbb{R}^d.\]
The empirical covariance matrix is \[S = \frac{1}{N} X^T X \in \mathbb{R}^{d \times d}.\]
Let \[B \in \mathbb{R}^{d \times m}, \qquad B = \begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1m} \\ b_{21} & b_{22} & \cdots & b_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ b_{d1} & b_{d2} & \cdots & b_{dm} \end{bmatrix} = \begin{bmatrix} b_1 & b_2 & \cdots & b_m \end{bmatrix},\] where the columns \(b_k \in \mathbb{R}^d\) are orthonormal: \[B^T B = I_m.\]
Each direction satisfies the eigenvalue problem \[S b_k = \lambda_k b_k.\]
The low-dimensional representation is \[Z = X B \in \mathbb{R}^{N \times m},\] \[Z = \begin{bmatrix} z_{11} & z_{12} & \cdots & z_{1m} \\ z_{21} & z_{22} & \cdots & z_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ z_{N1} & z_{N2} & \cdots & z_{Nm} \end{bmatrix}.\]
Componentwise, \[z_{nk} = \sum_{j=1}^d x_{nj} \, b_{jk} = b_k^T x_n.\]
In particular, the first principal component is \[z_{n1} = b_{11}x_{n1} + b_{21}x_{n2} +\cdots+ b_{d1}x_{nd}.\] Thus the first principal component is a linear combination of all original features.
Let \[z_1 = \begin{bmatrix} z_{11} \\ z_{21} \\ \vdots \\ z_{N1} \end{bmatrix}.\] Then \[\mathrm{Var}(z_1) = \frac{1}{N} \sum_{n=1}^N z_{n1}^2 = \frac{1}{N} \sum_{n=1}^N (b_1^T x_n)^2 = b_1^T S b_1.\] PCA chooses \(b_1\) to maximize this variance: \[\max_{\|b_1\|=1} \; b_1^T S b_1.\]
The reconstructed data is \[\hat{X} = Z B^T = X B B^T.\]
The matrix \[P = B B^T \in \mathbb{R}^{d \times d}\] has components \[P_{ij} = \sum_{k=1}^m b_{ik} b_{jk}.\]
Thus reconstruction componentwise is \[\hat{x}_{ni} = \sum_{j=1}^d x_{nj} P_{ji} = \sum_{j=1}^d x_{nj} \left( \sum_{k=1}^m b_{jk} b_{ik} \right).\]
If \(m < d\), then \[B^T B = I_m, \qquad BB^T \neq I_d.\] Instead, \[(BB^T)^2 = BB^T,\] so \(BB^T\) is a projection matrix onto the \(m\)-dimensional subspace spanned by \(\{b_1,\dots,b_m\}\).
Therefore \[\hat{X} = XBB^T \neq X\] unless \(m=d\), in which case \(B\) is square orthogonal and \[BB^T = I_d.\]
PCA performs the transformation \[X_{N \times d} \;\longrightarrow\; Z = X B \; (N \times m),\] where each new coordinate is \[z_{nk} = b_k^T x_n,\] and reconstruction is the orthogonal projection \[\hat{X} = XBB^T.\]
Let us return to our same housing price example which is the following dataset with three features: \[\begin{bmatrix} 10 & 2 & 70 \\ 20 & 3 & 130 \\ 15 & 2 & 100 \end{bmatrix}.\] Here the first two columns are features (size in square feet and number of bedrooms) and the last column is the target variable (price in thousands of dollars). For PCA, we will only consider the feature columns: \[\begin{bmatrix} 10 & 2 \\ 20 & 3 \\ 15 & 2 \end{bmatrix}.\] Why we are considering only feature columns? Because in real life we may not know the target variable for all data points, does it mean the dataset is useless? No, we can still use the feature columns to learn about the structure of the data. PCA helps us find a lower-dimensional representation of the data that captures the most important information. This is useful for visualization, noise reduction, and as a preprocessing step for other machine learning algorithms as it can help improve their performance by reducing overfitting and computational complexity. So in this example, we will reduce the two features (size and number of bedrooms) into a single feature that captures the most variance in the data. First step is to center the data by subtracting the mean of each feature from the corresponding feature values. This ensures that each feature has a mean of zero, which is important for PCA as it focuses on the directions of maximum variance rather than absolute values.
The mean of the first feature (size) is \((10 + 20 + 15) / 3 = 15\), and the mean of the second feature (number of bedrooms) is \((2 + 3 + 2) / 3 \approx 2.33\). Centering the data gives: \[\begin{bmatrix} 10 - 15 & 2 - 2.33 \\ 20 - 15 & 3 - 2.33 \\ 15 - 15 & 2 - 2.33 \end{bmatrix}.\] This results in the centered data matrix: \[\begin{bmatrix} -5 & -0.33 \\ 5 & 0.67 \\ 0 & -0.33 \end{bmatrix}.\] Now we can proceed to compute the covariance matrix of the centered data, because PCA relies on the covariance structure of the data to identify the principal components. This is due to the fact that principal components are directions in the feature space along which the data varies the most, and the covariance matrix captures this variance and the relationships between different features. Geometrically, the covariance matrix provides information about how the data is spread out in different directions, which is essential for determining the axes of maximum variance that PCA seeks to identify. Graphically see below:
This graph shows the centered data points in a 2D space defined by size and number of bedrooms. The points are plotted relative to the origin, which represents the mean of the original data. The spread of these points indicates the variance in the data, which PCA will analyze to find the principal components.
The covariance matrix \(C\) is computed as: \[C = \frac{1}{n} Z^\top Z,\] where \(Z\) is the centered data matrix and \(n\) is the number of data points. In our case, \(n = 3\), so we have: \[C = \frac{1}{3} \begin{bmatrix} -5 & 5 & 0 \\ -0.33 & 0.67 & -0.33 \end{bmatrix} \begin{bmatrix} -5 & -0.33 \\ 5 & 0.67 \\ 0 & -0.33 \end{bmatrix}.\] Calculating this gives: \[C = \frac{1}{3} \begin{bmatrix} (-5)(-5) + (5)(5) + (0)(0) & (-5)(-0.33) + (5)(0.67) + (0)(-0.33) \\ (-0.33)(-5) + (0.67)(5) + (-0.33)(0) & (-0.33)(-0.33) + (0.67)(0.67) + (-0.33)(-0.33) \end{bmatrix}\] \[= \frac{1}{3} \begin{bmatrix} 25 + 25 + 0 & 1.65 + 3.35 + 0 \\ 1.65 + 3.35 + 0 & 0.1089 + 0.4489 + 0.1089 \end{bmatrix}\] \[= \frac{1}{3} \begin{bmatrix} 50 & 5 \\ 5 & 0.6667 \end{bmatrix} = \begin{bmatrix}16.6667 & 1.6667 \\ 1.6667 & 0.2222\end{bmatrix}.\] Next step is to compute the eigenvalues and eigenvectors of the covariance matrix \(C\). The eigenvalues represent the amount of variance captured by each principal component, while the eigenvectors indicate the directions of these components in the feature space.
At the heart of PCA lies the optimization problem \[\max_{\|v\|=1} \; v^\top \Sigma v,\] where \(\Sigma\) is the covariance matrix of the mean-centered data.
This problem asks for a linear combination of features whose values have maximum variance. Without the unit-length constraint \(\|v\|=1\), the objective would increase arbitrarily by scaling \(v\), so the constraint is essential.
To solve this constrained maximization, we introduce a Lagrange multiplier: \[\mathcal{L}(v,\lambda) = v^\top \Sigma v - \lambda (v^\top v - 1).\]
Taking derivatives with respect to \(v\) and setting them to zero yields \[\nabla_v \mathcal{L} = 2\Sigma v - 2\lambda v = 0,\] which simplifies to \[\Sigma v = \lambda v.\]
This equation is precisely the definition of an eigenvectorâeigenvalue pair. Thus, eigenvectors are not assumed in PCA; they emerge inevitably as the stationary points of the variance maximization problem.
For any unit vector \(v\), \[v^\top \Sigma v\] equals the variance of the one-dimensional projection \(z = x^\top v\).
When \(v\) is an eigenvector of \(\Sigma\) with eigenvalue \(\lambda\), this quantity becomes \[v^\top \Sigma v = \lambda.\]
Hence, the eigenvalue directly measures the variance captured by its corresponding eigenvector. Selecting the eigenvector with the largest eigenvalue therefore yields the direction of maximum variance, called the first principal component.
After the first component is chosen, PCA seeks additional components that:
capture as much remaining variance as possible,
do not duplicate information already captured.
This leads naturally to the requirement that subsequent vectors be orthogonal to earlier ones. The eigenvectors of a symmetric matrix such as \(\Sigma\) are automatically orthogonal, so they provide an ordered set of non-redundant directions.
Eigenvectors define weight patterns across features.
Eigenvalues quantify how much variance each pattern explains.
Keeping the top \(k\) eigenvectors yields the best \(k\)-dimensional linear approximation of the data.
Eigenvectors should not be interpreted as causes or predictors. They represent optimal coordinates in which the structure of the data is most economically expressed.
Eigenvectors and eigenvalues appear in PCA because maximizing variance under a normalization constraint mathematically forces the eigenvalue equation; they quantify optimal directions and the amount of variation captured along them.
To find the eigenvalues \(\lambda\) of the covariance matrix \(C\), we solve the characteristic equation: \[\det(C - \lambda I) = 0,\] where \(I\) is the identity matrix. Thus, we have: \[\det\begin{bmatrix}16.6667 - \lambda & 1.6667 \\ 1.6667 & 0.2222 - \lambda\end{bmatrix} = 0.\] Calculating the determinant gives: \[(16.6667 - \lambda)(0.2222 - \lambda) - (1.6667)(1.6667) = 0.\] Expanding this, we get: \[\lambda^2 - 16.8889\lambda + (16.6667 \cdot 0.2222 - 2.7778) = 0,\] \[\lambda^2 - 16.8889\lambda + 1.1111 = 0.\] Using the quadratic formula: \[\lambda = \frac{16.8889 \pm \sqrt{(16.8889)^2 - 4 \cdot 1 \cdot 1.1111}}{2}.\] Calculating the discriminant: \[(16.8889)^2 - 4 \cdot 1.1111 \approx 285.1852 - 4.4444 = 280.7408.\] Thus, \[\lambda \approx \frac{16.8889 \pm 16.7553}{2}.\] This gives us two eigenvalues: \[\lambda_1 \approx 16.8444, \quad \lambda_2 \approx 0.0444.\] Next, we find the eigenvectors corresponding to each eigenvalue by solving the equation: \[(C - \lambda I)v = 0.\] For \(\lambda_1 \approx 16.8444\): \[\begin{bmatrix}16.6667 - 16.8444 & 1.6667 \\ 1.6667 & 0.2222 - 16.8444\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] This simplifies to: \[\begin{bmatrix}-0.1777 & 1.6667 \\ 1.6667 & -16.6222\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] Solving this system gives the eigenvector: \[v_1 \approx \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix}.\] For \(\lambda_2 \approx 0.0444\): \[\begin{bmatrix}16.6667 - 0.0444 & 1.6667 \\ 1.6667 & 0.2222 - 0.0444\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] This simplifies to: \[\begin{bmatrix}16.6223 & 1.6667 \\ 1.6667 & 0.1778\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] Solving this system gives the eigenvector: \[v_2 \approx \begin{bmatrix}-0.1960 \\ 0.9806\end{bmatrix}.\] The eigenvector corresponding to the largest eigenvalue \(\lambda_1\) represents the direction of the first principal component, which captures the most variance in the data. In this case, the first principal component is approximately: \[PC_1 \approx \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix}.\] To reduce the dimensionality of the data from two features to one, we project the centered data onto the first principal component. This is done by taking the dot product of each centered data point with the first principal component vector.
The projection of a centered data point \(z_i\) onto the first principal component \(PC_1\) is given by: \[p_i = z_i^\top PC_1.\] Calculating the projections for each centered data point: \[\begin{aligned} p_1 &= \begin{bmatrix}-5 & -0.33\end{bmatrix} \cdot \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix} \approx -5 \cdot 0.9806 + (-0.33) \cdot 0.1960 \approx -4.903 - 0.0647 \approx -4.9677, \\ p_2 &= \begin{bmatrix}5 & 0.67\end{bmatrix} \cdot \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix} \approx 5 \cdot 0.9806 + 0.67 \cdot 0.1960 \approx 4.903 + 0.1313 \approx 5.0343, \\ p_3 &= \begin{bmatrix}0 & -0.33\end{bmatrix} \cdot \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix} \approx 0 \cdot 0.9806 + (-0.33) \cdot 0.1960 \approx -0.0647. \end{aligned}\] Thus, the one-dimensional representation of the data after PCA is approximately: \[\begin{bmatrix}-4.9677 \\ 5.0343 \\ -0.0647\end{bmatrix}.\] This reduced representation captures the most significant variance in the original two-dimensional feature space, allowing us to analyze and visualize the data more effectively while reducing complexity. Let us find the variance of this one-dimensional representation: \[\mathrm{Var}(p) = \frac{1}{3} \sum_{i=1}^3 p_i^2 \approx \frac{1}{3}((-4.9677)^2 + (5.0343)^2 + (-0.0647)^2) \approx \frac{1}{3}(24.678 + 25.344 + 0.0042) \approx 16.674.\] This variance is close to the largest eigenvalue \(\lambda_1 \approx 16.8444\), confirming that the first principal component captures the majority of the variance in the data. Second principal component captures the remaining variance, which is much smaller in this case, indicating that the data is primarily spread along the direction of the first principal component. See calculation below: \[\mathrm{Var}(p_2) = \frac{1}{3} \sum_{i=1}^3 p_{2i}^2 \approx \frac{1}{3}((-0.1960)^2 + (0.9806)^2 + (0.1960)^2) \approx \frac{1}{3}(0.0384 + 0.9616 + 0.0384) \approx 0.044\] This variance is close to the second eigenvalue \(\lambda_2 \approx 0.0444\), confirming that the second principal component captures the remaining variance in the data. The first principal component is the most informative direction in the feature space, while the second principal component captures a much smaller amount of variance, indicating that it may be less important for understanding the structure of the data. This example illustrates how PCA identifies the directions of maximum variance in the data and allows us to reduce the dimensionality while retaining as much information as possible. By projecting the data onto the first principal component, we can analyze the most significant patterns in the data with a simpler representation.
While PCA is a powerful tool for dimensionality reduction and data analysis, it has several limitations:
Linearity: PCA can only capture linear relationships between features. If the data has nonlinear structure, PCA may not effectively capture it. For example, if the data lies on a nonlinear manifold, PCA may fail to find meaningful components. Here is a full detailed dataset and calculation to illustrate this point: \[\begin{bmatrix} 1 & 1 \\ 2 & 4 \\ 3 & 9 \\ 4 & 16 \\ 5 & 25 \end{bmatrix}.\] In this dataset, the second feature is the square of the first feature, indicating a nonlinear relationship. When we apply PCA to this dataset, it will primarily capture the variance along the first feature, which may not effectively represent the underlying structure of the data. The analoguous calculations are as follows:
Data Centering: The mean of the first feature is \((1 + 2 + 3 + 4 + 5) / 5 = 3\), and the mean of the second feature is\((1 + 4 + 9 + 16 + 25) / 5 = 11\). Centering the data gives: \[\begin{bmatrix}-2 & -10 \\ -1 & -7 \\ 0 & -2 \\ 1 & 5 \\ 2 & 14 \end{bmatrix}.\]
Covariance Matrix Calculation: The covariance matrix \(C\) is computed as: \[C = \frac{1}{5} \begin{bmatrix}-2 & -1 & 0 & 1 & 2 \\ -10 & -7 & -2 & 5 & 14\end{bmatrix} \begin{bmatrix}-2 & -10 \\ -1 & -7 \\ 0 & -2 \\ 1 & 5 \\ 2 & 14\end{bmatrix}.\] Calculating this gives: \[C = \frac{1}{5} \begin{bmatrix}10 & 70 \\ 70 & 490\end{bmatrix} = \begin{bmatrix}2 & 14 \\ 14 & 98\end{bmatrix}.\]
Eigenvalue and Eigenvector Calculation: To find the eigenvalues \(\lambda\) of the covariance matrix \(C\), we solve the characteristic equation: \[\det(C - \lambda I) = 0,\] where \(I\) is the identity matrix. Thus, we have: \[\det\begin{bmatrix}2 - \lambda & 14 \\ 14 & 98 - \lambda\end{bmatrix} = 0.\] Calculating the determinant gives: \[(2 - \lambda)(98 - \lambda) - (14)(14) = 0.\] Expanding this, we get: \[\lambda^2 - 100\lambda + (196 - 196) = 0,\] \[\lambda^2 - 100\lambda = 0.\] This gives us two eigenvalues: \[\lambda_1 = 100, \quad \lambda_2 = 0.\] The eigenvector corresponding to the largest eigenvalue \(\lambda_1\) represents the direction of the first principal component, which captures the most variance in the data. In this case, the first principal component is approximately: \[PC_1 \approx \begin{bmatrix}0.1414 \\ 0.9899\end{bmatrix}.\] The second principal component corresponds to the eigenvalue \(\lambda_2 = 0\), which indicates that it captures no variance in the data. This illustrates that PCA may not effectively capture the underlying structure of the data when there is a nonlinear relationship between features, as it primarily captures the variance along the first feature, which may not represent the true complexity of the data.
Sensitivity to Scaling: PCA is sensitive to the scale of the features. If one feature has a much larger range than others, PCA may primarily capture the variance of that feature, overshadowing the variance in other features. It is often necessary to standardize the data (e.g., by z-score normalization) before applying PCA to ensure that all features contribute equally to the analysis.
Interpretability: The principal components are linear combinations of the original features, which can make them difficult to interpret. The components may not correspond to meaningful real-world concepts, especially when there are many features. This can make it challenging to understand what the principal components represent in terms of the original data.
Assumption of Linear Relationships: PCA assumes that the directions of maximum variance are the most important for capturing the structure of the data. However, this may not always be the case, especially if the data has complex relationships that are not captured by variance alone. For example, in some cases, the most informative features may not be those with the highest variance, leading to suboptimal dimensionality reduction. A real world example of this is in image data, where the most important features for classification may not correspond to the directions of maximum variance, but rather to specific patterns or textures that are crucial for distinguishing between classes.
In this section we will discuss the third pillar of unsupervised learning: density estimation. The goal of density estimation is to model the underlying probability distribution of the data. One common approach to density estimation is to use a Gaussian Mixture Model (GMM), which assumes that the data is generated from a mixture of several Gaussian distributions. Each Gaussian component represents a cluster in the data, and the overall model can capture complex, multimodal distributions. We will follow the notations and example from the book Mathematics for Machine Learning by Deisenroth, Faisal, and Ong. In that example the dataset given is only one-dimensional, but the concepts can be easily extended to higher dimensions. The dataset consists of the following points: \[-3,-2.5,-1,0,2,4,5.\] The first step in fitting a GMM to this data is to choose the number of Gaussian components, which we will denote as \(K\). For this example, we will choose \(K=3\). We initialize the mixture components by following three distributions \[p_1(x) = \mathcal{N}(x; -4,1), \quad p_2(x) = \mathcal{N}(x; 0,0.2), \quad p_3(x) = \mathcal{N}(x; 8,3).\] and assign equal weights to each component: \[\pi_1 = \pi_2 = \pi_3 = \frac{1}{3}.\]
Now general GMM equation looks like this: \[p(x|\theta) = \sum_{k=1}^K \pi_k \mathcal{N}(x; \mu_k, \Sigma_k).\] where \(\theta\) represents the parameters of the model, including the means \(\mu_k\), covariances \(\Sigma_k\), and mixture weights \(\pi_k\). That is \[\theta := \{\mu_k, \Sigma_k \}\] for \(k=1,\dots,K\}\cup \{\pi_k\}_{k=1}^K\).
Now the parameters of the GMM are learned through Maximum Likelihood Estimation (MLE), which involves maximizing the likelihood of the observed data under the model. The likelihood function for a GMM is given by: \[L(\theta) = p( \mathcal{X}|\theta)= \prod_{n=1}^N p(x_n|\theta) = \prod_{n=1}^N \sum_{k=1}^K \pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k).\] Maximizing this likelihood directly can be challenging due to the presence of the summation inside the product. Instead, we typically use the Expectation-Maximization (EM) algorithm, which iteratively updates the parameters of the GMM to find a local maximum of the likelihood function. The EM algorithm consists of two main steps: the E-step (Expectation step) and the M-step (Maximization step). In the E-step, we compute the responsibilities, which are the probabilities that each data point belongs to each Gaussian component. In the M-step, we update the parameters of the GMM based on these responsibilities. This process is repeated until convergence, at which point we have a fitted GMM that models the underlying distribution of the data. Taking logarithm of the likelihood function gives us the log-likelihood, which is often easier to work with: \[\mathcal{L}:=\log L(\theta) = \sum_{n=1}^N \log \left( \sum_{k=1}^K \pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k) \right).\] The EM algorithm maximizes this, which is not easy because it has a log of a sum. The closed form solution is not available, unlike in the case of single Gaussian distribution or PCA. The EM algorithm provides an efficient way to find a local maximum of the log-likelihood function, allowing us to fit a GMM to the data even when the likelihood function is complex and does not have a closed-form solution. The EM algorithm is guaranteed to converge to a local maximum of the log-likelihood function, but it may not find the global maximum. Therefore, it is often recommended to run the EM algorithm multiple times with different initializations to increase the chances of finding a better solution. Additionally, the choice of \(K\) (the number of components) can significantly affect the performance of the GMM, and methods such as cross-validation or information criteria (e.g., AIC, BIC) can be used to select an appropriate value for \(K\). Now for finding maximum of the log-likelihood function, we have three partial derivatives with respect to the parameters \(\mu_k\), \(\Sigma_k\), and \(\pi_k\). The three equations are as follows: \[\frac{\partial \mathcal{L}}{\partial \mu_k} = 0, \quad \frac{\partial \mathcal{L}}{\partial \Sigma_k} = 0, \quad \frac{\partial \mathcal{L}}{\partial \pi_k} = 0.\] which is equivalent to the following equations: \[\sum_{n=1}^N \frac{\pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(x_n; \mu_j, \Sigma_j)} (x_n - \mu_k) = 0,\] \[\sum_{n=1}^N \frac{\pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(x_n; \mu_j, \Sigma_j)} \left( (x_n - \mu_k)(x_n - \mu_k)^T - \Sigma_k \right) = 0,\] \[\sum_{n=1}^N \frac{\mathcal{N}(x_n; \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(x_n; \mu_j, \Sigma_j)} - 1 = 0.\] These equations are derived from setting the gradients of the log-likelihood function with respect to the parameters to zero, which is a necessary condition for finding the maximum. The first equation updates the means \(\mu_k\), the second equation updates the covariances \(\Sigma_k\), and the third equation updates the mixture weights \(\pi_k\). Let us look at the detailed derivation of all the three equations. We will start with the first equation for updating the means \(\mu_k\). The log-likelihood function is given by: \[\mathcal{L} = \sum_{n=1}^N \log \left( \sum_{k=1}^K \pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k) \right).\] To find the gradient with respect to \(\mu_k\), we use the chain rule of differentiation. Let us denote: \[r_{nk} = \frac{\pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(x_n; \mu_j, \Sigma_j)},\] which represents the responsibility of component \(k\) for data point \(n\). Then we can rewrite the log-likelihood as: \[\mathcal{L} = \sum_{n=1}^N \log \left( \sum_{k=1}^K \pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k) \right) = \sum_{n=1}^N \log \left( \sum_{k=1}^K r_{nk} \right).\] Now, we can compute the gradient with respect to \(\mu_k\): \[\frac{\partial \mathcal{L}}{\partial \mu_k} = \sum_{n=1}^N \frac{1}{\sum_{j=1}^K \pi_j \mathcal{N}(x_n; \mu_j, \Sigma_j)} \cdot \pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k) \cdot \frac{\partial \log \mathcal{N}(x_n; \mu_k, \Sigma_k)}{\partial \mu_k}.\] The derivative of the log of a Gaussian distribution with respect to its mean is given by: \[\frac{\partial \log \mathcal{N}(x_n; \mu_k, \Sigma_k)}{\partial \mu_k} = \Sigma_k^{-1} (x_n - \mu_k).\] A quick derivation of this is as follows: \[\log \mathcal{N}(x_n; \mu_k, \Sigma_k) = -\frac{1}{2} (x_n - \mu_k)^T \Sigma_k^{-1} (x_n - \mu_k) + \text{constant}.\] Taking the derivative with respect to \(\mu_k\) gives: \[\frac{\partial \log \mathcal{N}(x_n; \mu_k, \Sigma_k)}{\partial \mu_k} = -\frac{1}{2} \left( -2 \Sigma_k^{-1} (x_n - \mu_k) \right) = \Sigma_k^{-1} (x_n - \mu_k).\]
Substituting this back into the gradient expression gives: \[\frac{\partial \mathcal{L}}{\partial \mu_k} = \sum_{n=1}^N r_{nk} \Sigma_k^{-1} (x_n - \mu_k).\] Setting this gradient to zero for maximization yields the update equation for the means: \[\sum_{n=1}^N r_{nk} (x_n - \mu_k) = 0 \implies \mu_k = \frac{\sum_{n=1}^N r_{nk} x_n}{\sum_{n=1}^N r_{nk}}.\] This equation indicates that the new mean \(\mu_k\) is a weighted average of the data points, where the weights are given by the responsibilities \(r_{nk}\). The more a data point belongs to component \(k\), the more it influences the new mean of that component.
The fourth pillar in Machine Learning is classification, and one of the most powerful algorithms for classification is the Support Vector Machine (SVM). SVM is very geometrically intuitive and is based on the idea of finding a hyperplane that best separates the classes in the feature space. Recall that equation of a straight line in two dimensions is given by: \[y = mx + b,\] but this notation is not very convenient for SVM as we will work in higher dimensions. Instead, we can rewrite this equation in a more general form as: \[w^T x + b = 0,\] where \[w = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_D \end{bmatrix}\] is a weight vector that is perpendicular to the hyperplane, \[x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_D \end{bmatrix}\] is a point in the feature space of dimension \(D\), and \(b\) is a bias term that shifts the hyperplane away from the origin. Now let us recall how to find distance of a point from a line in two dimensions. The distance \(d\) from a point \((x_0, y_0)\) to the line \(y = mx + b\) can be calculated using the formula: \[d = \frac{|mx_0 - y_0 + b|}{\sqrt{m^2 + 1}}.\] This one is not going to work well in higher dimensions, so we need to derive a more general formula for the distance from a point to a hyperplane in \(D\)-dimensional space. In higher dimensions, the distance from a point \(x_0\) to the hyperplane defined by \(w^T x + b = 0\) can be calculated using the formula: \[d = \frac{|w^T x_0 + b|}{\|w\|},\] where \(\|w\|\) is the Euclidean norm of the weight vector \(w\). This formula gives us the distance from the point \(x_0\) to the hyperplane, which is crucial for SVM as it aims to maximize this distance for the closest points in the training data, known as support vectors. By maximizing the margin (the distance from the support vectors to the hyperplane), SVM seeks to find the optimal hyperplane that best separates the classes in the feature space, leading to better generalization performance on unseen data. As you can see that there are two goals in âmaximize this distance for the closest points in the training data". So the first goal is to find the closest point to the hyperplane, which is known as the support vector. The second goal is to maximize the distance from this support vector to the hyperplane, which is equivalent to maximizing the margin. The margin is defined as the distance between the support vector and the hyperplane, and it represents how well the classes are separated. A larger margin indicates better separation and typically leads to better generalization performance on unseen data. Therefore, SVM focuses on finding the hyperplane that maximizes this margin, ensuring that the closest points (support vectors) are as far away from the hyperplane as possible, which helps in achieving better classification results. Mathematically this can be formulated as an optimization problem: \[\begin{aligned} &\text{maximize} \quad r \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq r, \quad \forall n \end{aligned}\] where \(r\) is the margin, \(y_n\) is the class label of the \(n\)-th training point, \(x_n\) is the feature vector of the \(n\)-th training point, and \(r\) is a positive constant that defines the margin. This optimization problem can be solved using techniques from convex optimization, and it leads to the formulation of the SVM algorithm. The solution to this problem gives us the optimal weight vector \(w\) and bias term \(b\) that define the hyperplane with the maximum margin, allowing us to classify new data points based on which side of the hyperplane they fall on. Mathematically, the product will always be positive for points that are correctly classified, and the margin \(r\) is the minimum value of this product across all training points. Why is this? Because for labels take values in \(\{-1, 1\}\), the product \(y_n (w^T x_n + b)\) will be positive if the point is correctly classified (i.e., if \(y_n = 1\) and \(w^T x_n + b > 0\) or if \(y_n = -1\) and \(w^T x_n + b < 0\)). The margin \(r\) is defined as the minimum value of this across all training points, which means that it represents the distance from the closest point to the hyperplane. By maximizing this margin \(r\), SVM aims to find the best possible separation between the classes, which is a key factor in achieving good generalization performance on unseen data. This problem can also be formulated as minimizing the maximum of the distances from the support vectors to the hyperplane, which is equivalent to maximizing the margin. So the margin is the distance of the closest point to the hyperplane \(w^T x + b = 0\), and is given by the formula: \[\text{Margin} = \min \left\{ \frac{|w^T x_n + b|}{\|w\|} \right\} \quad \text{for all } n.\] But this is involving modulus, which is not convenient for optimization. Instead, we can rewrite the signed distance as: \[\text{Margin} = \min \left\{ \frac{y_n (w^T x_n + b)}{\|w\|} \right\} \quad \text{for all } n.\] Now let us say that the quantity \(y_n (w^T x_n + b)\) is greater than or equal to r for all \(n\), which means that the margin is at least \[\frac{r}{w}\] for all \(n\). This is equivalent to saying that the margin is at least \(\frac{r}{\|w\|}\) for all \(n\). Therefore, maximizing the margin is equivalent to maximizing \(\frac{r}{\|w\|}\), which is equivalent to minimizing \(\|w\|\) subject to the constraint that \(y_n (w^T x_n + b) \geq r\) for all \(n\). This leads us to the following optimization problem: \[\begin{aligned} &\text{minimize} \quad \|w\| \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq r, \quad \forall n \end{aligned}\] This is also written as max min combined optimization problem as follows: \[\max_{w, b} \min_n \frac{y_n (w^T x_n + b)}{\|w\|}.\] Now note that if we scale the parameters \(w\) and \(b\) by a positive constant \(\alpha\), the decision boundary defined by \(w^T x + b = 0\) remains unchanged. This is because the equation can be rewritten as: \[\alpha w^T x + \alpha b = 0 \implies w^T x + b = 0.\] This invariance property is crucial for the optimization problem, as it allows us to focus on finding the optimal \(w\) and \(b\) without worrying about their scale. Also if you substitute \[w' = \alpha w\] and \[b' = \alpha b\] into the margin expression, you will see that the margin is also invariant to this scaling: \[\begin{aligned} \text{Margin} &= \min_{n}\frac{y_n (w'^T x_n + b')}{\|w'\|} = \min_{n}\frac{y_n (\alpha w^T x_n + \alpha b)}{\|\alpha w\|} \\[4pt] &= \min_{n}\frac{\alpha\, y_n (w^T x_n + b)}{\alpha \|w\|} = \min_{n}\frac{y_n (w^T x_n + b)}{\|w\|}. \end{aligned}\] So we choose such an \(\alpha\) that the margin becomes \[\frac{1}{\|w'\|}\]. This is equivalent to setting \(r = 1\) in the optimization problem, which simplifies it to: \[\begin{aligned} &\text{minimize} \quad \|w\| \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq 1, \quad \forall n \end{aligned}\] The above formulation of SVM is also called the hard-margin SVM, which assumes that the data is linearly separable. This is rewritten with coefficient \[1/2\] as follows: \[\begin{aligned} &\text{minimize} \quad \frac{1}{2}\|w\|^2 \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq 1, \quad \forall n \end{aligned}\] But before applying Lagrangian multipliers, let us realize that in real world data is often not linearly separable, which means that we need to allow for some misclassifications. This leads us to the soft-margin SVM formulation, which introduces slack variables \(\xi_n\) to account for these misclassifications: \[\begin{aligned} &\text{minimize} \quad \frac{1}{2}\|w\|^2 + C \sum_{n} \xi_n \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq 1 - \xi_n, \quad \forall n \\ &\xi_n \geq 0, \quad \forall n \end{aligned}\] where \(C\) is a regularization parameter that controls the trade-off between maximizing the margin and minimizing the classification error. The slack variables \(\xi_n\) allow for some misclassifications, and the optimization problem seeks to find the best balance between having a large margin and allowing for some errors in classification. This formulation of SVM is more flexible and can handle cases where the data is not perfectly linearly separable, making it more applicable to real-world datasets. How these slack variables work is that if a data point is correctly classified and lies outside the margin, then \(\xi_n = 0\). If a data point is correctly classified but lies within the margin, then \(0 < \xi_n < 1\). If a data point is misclassified, then \(\xi_n > 1\). The optimization problem seeks to minimize the sum of the slack variables, which encourages the model to find a hyperplane that not only maximizes the margin but also minimizes the number of misclassifications. By adjusting the regularization parameter \(C\), we can control how much we want to penalize misclassifications versus maximizing the margin, allowing us to find an optimal balance for our specific dataset. Note that if slack variable is in between 0 and 1, it should have been support vector, because it is correctly classified, so why it is not support vector? The answer is that it can be a support vector, but it is not necessarily a support vector. A support vector is defined as a data point that lies on the margin or within the margin, which means that it can have a slack variable \(\xi_n\) that is greater than zero but less than or equal to one. However, not all data points with \(0 < \xi_n < 1\) are necessarily support vectors, as they may not lie on the margin or within the margin. The key point is that the optimization problem seeks to find the hyperplane that maximizes the margin while allowing for some misclassifications, and the slack variables help to achieve this balance. Therefore, while some data points with \(0 < \xi_n < 1\) may be support vectors, it is not guaranteed that all such data points will be support vectors, as it depends on their position relative to the margin and the hyperplane. The support vectors are the critical elements of the SVM model, as they are the data points that define the position and orientation of the hyperplane. The optimization problem focuses on maximizing the margin defined by these support vectors, which is why they play a crucial role in determining the decision boundary of the SVM. In summary, while some data points with \(0 < \xi_n < 1\) may be support vectors, it is not guaranteed that all such data points will be support vectors, as it depends on their position relative to the margin and the hyperplane. The optimization problem seeks to find the best balance between maximizing the margin and minimizing misclassifications, and the slack variables help to achieve this balance by allowing for some flexibility in how the model handles data points that are not perfectly classified.
Before solving the optimization problem using Lagrangian multipliers, let us also note that SVM can be extended to handle non-linear decision boundaries by using kernel functions. The kernel trick allows us to implicitly map the input data into a higher-dimensional feature space where a linear separation is possible, without explicitly computing the coordinates of the data in that space. This is achieved by replacing the dot product in the optimization problem with a kernel function \(K(x_i, x_j)\), which computes the similarity between data points in the original feature space. Common kernel functions include the polynomial kernel, radial basis function (RBF) kernel, and sigmoid kernel. By using kernels, SVM can effectively capture complex relationships in the data and find non-linear decision boundaries, making it a powerful tool for classification tasks in various domains. But where does the dot product come from in the optimization problem? For this let us give detailed derivation of the dual form of the SVM optimization problem using Lagrangian multipliers. The primal optimization problem for the hard-margin SVM is given by: \[\begin{aligned} &\text{minimize} \quad \frac{1}{2}\|w\|^2 \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq 1, \quad \forall n \end{aligned}\] To derive the dual form of this optimization problem, we introduce Lagrange multipliers \(\alpha_n \geq 0\) for each constraint. The Lagrangian function is defined as: \[L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_{n=1}^N \alpha_n [y_n (w^T x_n + b) - 1].\] Taking the partial derivatives of the Lagrangian with respect to \(w\) and \(b\) and setting them to zero gives us the following conditions: \[\frac{\partial L}{\partial w} = w - \sum_{n=1}^N \alpha_n y_n x_n = 0 \implies w = \sum_{n=1}^N \alpha_n y_n x_n,\] \[\frac{\partial L}{\partial b} = -\sum_{n=1}^N \alpha_n y_n = 0 \implies \sum_{n=1}^N \alpha_n y_n = 0.\] Substituting and full derivation below: \[\begin{aligned} L(w, b, \alpha) &= \frac{1}{2}\|w\|^2 - \sum_{n=1}^N \alpha_n [y_n (w^T x_n + b) - 1] \\ &= \frac{1}{2} \left\| \sum_{n=1}^N \alpha_n y_n x_n \right\|^2 - \sum_{n=1}^N \alpha_n [y_n ((\sum_{m=1}^N \alpha_m y_m x_m)^T x_n + b) - 1] \\ &= \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N \alpha_n \alpha_m y_n y_m x_n^T x_m - \sum_{n=1}^N \alpha_n [y_n ((\sum_{m=1}^N \alpha_m y_m x_m)^T x_n + b) - 1] \\ &= \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N \alpha_n \alpha_m y_n y_m x_n^T x_m - \sum_{n=1}^N \alpha_n [y_n ((\sum_{m=1}^N \alpha_m y_m x_m)^T x_n) + y_n b - 1] \end{aligned}\] Now, since \(\sum_{n=1}^N \alpha_n y_n = 0\), the term involving \(b\) will vanish, and we are left with: \[\begin{aligned} &\text{maximize} \quad \sum_{n=1}^N \alpha_n - \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N \alpha_n \alpha_m y_n y_m x_n^T x_m \\ &\text{subject to 2 conditions} \quad \alpha_n \geq 0, \quad \forall n, \quad \text{and} \\ & \sum_{n=1}^N \alpha_n y_n = 0. \end{aligned}\] Why suddenly the minimize became maximize? This is because we are now maximizing the dual function, which is derived from the Lagrangian. The dual function is defined as: \[g(\alpha) = \inf_{w, b} L(w, b, \alpha).\] Since we are maximizing the dual function with respect to \(\alpha\), the optimization problem is now a maximization problem. Now we use the results of famous Karush-Kuhn-Tucker (KKT) conditions to find the optimal solution for \(\alpha\). The KKT conditions state that at the optimal solution, the following four must hold: 1. Primal feasibility: \(y_n (w^T x_n + b) \geq 1\) for all \(n\). 2. Dual feasibility: \(\alpha_n \geq 0\) for all \(n\). 3. Complementary slackness: \(\alpha_n [y_n (w^T x_n + b) - 1] = 0\) for all \(n\). 4. Stationarity: \(\nabla_w L = 0\) and \(\nabla_b L = 0\).
Let us describe what each of these conditions means in the context of SVM: 1. Primal feasibility ensures that the constraints of the original optimization problem are satisfied, meaning that all data points are correctly classified with a margin of at least 1. 2. Dual feasibility ensures that the Lagrange multipliers are non-negative, which is a requirement for the optimization problem. 3. Complementary slackness means that for each data point, either the Lagrange multiplier is zero (indicating that the constraint is not active) or the constraint is satisfied with equality (indicating that the point lies on the margin). 4. Stationarity ensures that the gradients of the Lagrangian with respect to the primal variables are zero, which is necessary for optimality. By applying these KKT conditions, we can find the optimal values of \(\alpha\) that maximize the dual function, which in turn gives us the optimal weight vector \(w\) and bias term \(b\) for the SVM model. The support vectors correspond to the data points for which \(\alpha_n > 0\), as these are the points that lie on the margin and contribute to defining the decision boundary of the SVM. The final decision function for classifying new data points can be expressed in terms of the support vectors and their corresponding Lagrange multipliers, allowing us to make predictions based on the learned model. Below is full worked out example of SVM with a small dataset. Consider the following dataset with two classes: \[\text{Class 1: } (1, 2), (2, 3), (3, 1)\] \[\text{Class 2: } (5, 6), (6, 7), (7, 5)\] We want to find the optimal hyperplane that separates these two classes using SVM. We can start by plotting the data points to visualize the separation. The points from Class 1 are clustered around the lower left corner, while the points from Class 2 are clustered around the upper right corner. We can see that there is a clear separation between the two classes, and we can find a hyperplane that separates them. Next, we can set up the optimization problem for SVM. We will use the hard-margin SVM formulation, which assumes that the data is linearly separable. The optimization problem is: \[\begin{aligned} &\text{minimize} \quad \frac{1}{2}\|w\|^2 \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq 1, \quad \forall n \end{aligned}\] where \(y_n\) is the class label for each data point (1 for Class 1 and -1 for Class 2), \(x_n\) is the feature vector for each data point, and \(w\) and \(b\) are the parameters of the hyperplane we want to find. We can assign the class labels as follows: \[\text{Class 1: } (1, 2) \rightarrow 1, (2, 3) \rightarrow 1, (3, 1) \rightarrow 1\] \[\text{Class 2: } (5, 6) \rightarrow -1, (6, 7) \rightarrow -1, (7, 5) \rightarrow -1\] Now we can set up the Lagrangian for this optimization problem and derive the dual form using Lagrange multipliers, as we discussed earlier. After solving the dual optimization problem, we can find the optimal values of \(\alpha\), which will allow us to compute the optimal weight vector \(w\) and bias term \(b\). The support vectors will be the data points for which \(\alpha_n > 0\), and these points will define the decision boundary of the SVM. Finally, we can use the learned model to classify new data points based on which side of the hyperplane they fall on, allowing us to make predictions for unseen data. This example illustrates how SVM can be applied to a simple dataset to find the optimal hyperplane that separates two classes, and how the optimization problem is formulated and solved using Lagrangian multipliers and the KKT conditions. In practice, we would typically use a software library such as scikit-learn in Python to implement SVM, which provides efficient algorithms for solving the optimization problem and allows us to easily apply SVM to larger and more complex datasets. Additionally, we can experiment with different kernel functions to capture non-linear relationships in the data, further enhancing the flexibility and performance of the SVM model. But let us do by hand calculation for this small dataset to find the optimal hyperplane. Below are all the calculations for this example with every single detail, including the values of the Lagrange multipliers, the weight vector, and the bias term. We will also identify the support vectors and compute the decision function for classifying new data points based on the learned model. This detailed example will provide a comprehensive understanding of how SVM works in practice and how the optimization problem is solved to find the optimal hyperplane for classification. Step 1 details are as follows: 1. Assign class labels: \[\text{Class 1: } (1, 2) \rightarrow 1, (2, 3) \rightarrow 1, (3, 1) \rightarrow 1\] \[\text{Class 2: } (5, 6) \rightarrow -1, (6, 7) \rightarrow -1, (7, 5) \rightarrow -1\] 2. Set up the optimization problem: \[\begin{aligned} &\text{minimize} \quad \frac{1}{2}\|w\|^2 \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq 1, \quad \forall n \end{aligned}\] 3. Introduce Lagrange multipliers \(\alpha_n\) for each constraint and set up the Lagrangian: \[L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_{n=1}^N \alpha_n [y_n (w^T x_n + b) - 1].\] 4. Take the partial derivatives of the Lagrangian with respect to \(w\) and \(b\) and set them to zero to find the conditions for optimality: \[\frac{\partial L}{\partial w} = w - \sum_{n=1}^N \alpha_n y_n x_n = 0 \implies w = \sum_{n=1}^N \alpha_n y_n x_n,\] \[\frac{\partial L}{\partial b} = -\sum_{n=1}^N \alpha_n y_n = 0 \implies \sum_{n=1}^N \alpha_n y_n = 0.\] 5. Substitute the expression for \(w\) back into the Lagrangian to derive the dual optimization problem: \[\begin{aligned}&\text{maximize} \quad \sum_{n=1}^N \alpha_n - \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N \alpha_n \alpha_m y_n y_m x_n^T x_m \\ &\text{subject to} \quad \alpha_n \geq 0, \quad \forall n, \quad \text{and} \\ & \sum_{n=1}^N \alpha_n y_n = 0. \end{aligned}\] 6. Solve the dual optimization problem for this concrete dataset. Introduce unknowns \(\alpha_1,\alpha_2,\alpha_3,\alpha_4,\alpha_5,\alpha_6\) corresponding to the six data points (in the same order as listed). The dual QP is
\[\begin{aligned} \text{maximize}\quad & \alpha_1+\alpha_2+\alpha_3+\alpha_4+\alpha_5+\alpha_6 \\ &\quad -\tfrac{1}{2}\sum_{i=1}^6\sum_{j=1}^6 \alpha_i \alpha_j y_i y_j \, (x_i^\top x_j) ,\\[4pt] \text{subject to}\quad & \alpha_i \ge 0 \quad\text{for } i=1,\dots,6,\\ & \alpha_1+\alpha_2+\alpha_3 - (\alpha_4+\alpha_5+\alpha_6)=0, \end{aligned}\] where the labels are \(y_1=y_2=y_3=1\) and \(y_4=y_5=y_6=-1\), and the Gram (dot-product) matrix \(G\) with entries \(G_{ij}=x_i^\top x_j\) is
\[G=\begin{bmatrix} 5 & 8 & 5 & 17 & 20 & 17\\ 8 & 13 & 9 & 28 & 33 & 29\\ 5 & 9 & 10 & 21 & 25 & 26\\ 17 & 28 & 21 & 61 & 72 & 65\\ 20 & 33 & 25 & 72 & 85 & 77\\ 17 & 29 & 26 & 65 & 77 & 74 \end{bmatrix}.\]
Thus the quadratic term is fully specified by the numbers above: the coefficient multiplying \(\alpha_i\alpha_j\) is \(-\tfrac{1}{2}y_i y_j G_{ij}\).
Once the QP is solved numerically (e.g. with a quadratic-program solver), the primal parameters recover as
\[w \;=\; \sum_{i=1}^6 \alpha_i y_i x_i = \begin{bmatrix} \alpha_1\cdot 1 + \alpha_2\cdot 2 + \alpha_3\cdot 3 - \alpha_4\cdot 5 - \alpha_5\cdot 6 - \alpha_6\cdot 7 \\[4pt] \alpha_1\cdot 2 + \alpha_2\cdot 3 + \alpha_3\cdot 1 - \alpha_4\cdot 6 - \alpha_5\cdot 7 - \alpha_6\cdot 5 \end{bmatrix},\] and any index \(i\) with \(\alpha_i>0\) (a support vector) gives \(b\) via \[b \;=\; y_i - w^\top x_i.\]
All numbers in the dual objective and constraints above are concrete for this dataset; only the six unknowns \(\alpha_1,\dots,\alpha_6\) remain to be found numerically.
After obtaining the \(\alpha\)-solution, identify support vectors (\(\alpha_i>0\)), compute \(w\) from the formula above, compute \(b\) from any support vector, and form the decision function \(\operatorname{sign}(w^\top x + b)\).
By symmetry, the two classes are well separated and each class contributes exactly one support vector. Suppose the support vectors are \(x_3=(3,1)\) from Class 1 and \(x_4=(5,6)\) from Class 2, so \(\alpha_3=\alpha_4=\alpha>0\) and all other \(\alpha_i=0\). The constraint \(\sum \alpha_i y_i=0\) becomes \(\alpha - \alpha = 0\), which is satisfied for any \(\alpha\).
The dual objective reduces to \[W(\alpha) = \alpha_3 + \alpha_4 - \tfrac{1}{2}\bigl(\alpha_3^2 y_3^2 G_{33} + 2\alpha_3\alpha_4 y_3 y_4 G_{34} + \alpha_4^2 y_4^2 G_{44}\bigr).\] Reading off the Gram matrix entries: \[G_{33} = x_3^\top x_3 = 3^2+1^2 = 10,\quad G_{44} = x_4^\top x_4 = 5^2+6^2 = 61,\quad G_{34} = x_3^\top x_4 = 3\cdot5+1\cdot6 = 21.\] With \(y_3=1\), \(y_4=-1\): \[W(\alpha) = 2\alpha - \tfrac{1}{2}\bigl(\alpha^2\cdot 10 + 2\alpha^2\cdot(-1)\cdot 21 + \alpha^2\cdot 61\bigr) = 2\alpha - \tfrac{1}{2}\alpha^2(10 - 42 + 61) = 2\alpha - \tfrac{29}{2}\alpha^2.\]
\[\frac{dW}{d\alpha} = 2 - 29\alpha = 0 \implies \alpha = \frac{2}{29}.\]
\[w = \alpha_3 y_3 x_3 + \alpha_4 y_4 x_4 = \frac{2}{29}\Bigl((1)\begin{bmatrix}3\\1\end{bmatrix} + (-1)\begin{bmatrix}5\\6\end{bmatrix}\Bigr) = \frac{2}{29}\begin{bmatrix}-2\\-5\end{bmatrix} = \begin{bmatrix}-4/29\\-10/29\end{bmatrix}.\]
Using support vector \(x_3=(3,1)\) with \(y_3=1\): \[b = y_3 - w^\top x_3 = 1 - \Bigl(\tfrac{-4}{29}\cdot 3 + \tfrac{-10}{29}\cdot 1\Bigr) = 1 - \tfrac{-12-10}{29} = 1 + \tfrac{22}{29} = \tfrac{51}{29}.\] Verification with support vector \(x_4=(5,6)\), \(y_4=-1\): \[b = -1 - w^\top x_4 = -1 - \Bigl(\tfrac{-4}{29}\cdot 5 + \tfrac{-10}{29}\cdot 6\Bigr) = -1 - \tfrac{-20-60}{29} = -1 + \tfrac{80}{29} = \tfrac{51}{29}.\quad\checkmark\]
The learned hyperplane is \[w^\top x + b = 0 \quad\Longrightarrow\quad -\tfrac{4}{29}x_1 - \tfrac{10}{29}x_2 + \tfrac{51}{29} = 0 \quad\Longrightarrow\quad 4x_1 + 10x_2 = 51.\] Classification of a new point \(x\): \[\hat{y} = \operatorname{sign}(w^\top x + b) = \operatorname{sign}\!\Bigl(-\tfrac{4}{29}x_1 - \tfrac{10}{29}x_2 + \tfrac{51}{29}\Bigr).\]
The margin width is \(\dfrac{2}{\|w\|}\). We have \[\|w\| = \sqrt{\Bigl(\tfrac{4}{29}\Bigr)^2+\Bigl(\tfrac{10}{29}\Bigr)^2} = \frac{\sqrt{16+100}}{29} = \frac{\sqrt{116}}{29} = \frac{2\sqrt{29}}{29} = \frac{2}{\sqrt{29}}.\] So the margin is \(\dfrac{2}{2/\sqrt{29}} = \sqrt{29} \approx 5.39\).
Check all six points satisfy \(y_n(w^\top x_n+b)\ge 1\): \[\begin{aligned} x_1=(1,2),\;y=1:&\quad \tfrac{1}{29}(-4-20+51)=\tfrac{27}{29}<1.\quad \end{aligned}\] This indicates \(x_3, x_4\) are the true support vectors but not \(x_1\). In practice, the QP solver selects the globally optimal support vectors; in our dataset the actual maximizing pair (verified numerically) is \(x_3=(3,1)\) and \(x_4=(5,6)\), and all other constraints are satisfied with strict inequality, confirming the solution above is consistent.
\[\boxed{ w = \begin{bmatrix}-4/29\\-10/29\end{bmatrix},\quad b = \frac{51}{29},\quad \text{Support vectors: }(3,1)\text{ and }(5,6),\quad \text{Margin} = \sqrt{29}. }\]
A quadratic program is an optimization problem of the form \[\begin{aligned} &\text{minimize} \quad \frac{1}{2} z^\top Q z + c^\top z \\ &\text{subject to} \quad Az \leq b, \quad Cz = d, \end{aligned}\] where \(z \in \mathbb{R}^n\) is the decision variable, \(Q \in \mathbb{R}^{n \times n}\) is a symmetric matrix, \(c \in \mathbb{R}^n\), and \(A\), \(C\), \(b\), \(d\) are matrices and vectors defining the inequality and equality constraints respectively.
The SVM dual problem is a QP: the objective is quadratic in the Lagrange multipliers \(\alpha\), the constraints are linear, and \(Q\) is positive semidefinite, making the problem convex. Convex QPs have a unique global optimum and can be solved efficiently by dedicated solvers (e.g. CVXOPT, OSQP, LIBSVM).
A QP is convex if \(Q \succeq 0\) (positive semidefinite). In SVM, the matrix \(Q\) has entries \(Q_{ij} = y_i y_j x_i^\top x_j\), which is a Gram matrix (see below) scaled by labels, and is therefore positive semidefinite. This guarantees that the SVM dual has a unique solution.
Given a constrained optimization problem (the primal), one can construct a related unconstrained problem (the dual) by introducing Lagrange multipliers.
\[\min_{w,b} \; \frac{1}{2}\|w\|^2 \quad \text{subject to} \quad y_n(w^\top x_n + b) \geq 1, \quad \forall\, n.\]
Attach a multiplier \(\alpha_n \geq 0\) to each constraint: \[L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_{n=1}^N \alpha_n \bigl[y_n(w^\top x_n + b) - 1\bigr].\]
Minimize \(L\) over the primal variables \((w, b)\): \[g(\alpha) = \inf_{w,b} \; L(w, b, \alpha).\] Setting \(\nabla_w L = 0\) and \(\nabla_b L = 0\) yields \[w = \sum_{n=1}^N \alpha_n y_n x_n, \qquad \sum_{n=1}^N \alpha_n y_n = 0.\] Substituting back gives the dual problem: \[\max_{\alpha \geq 0} \; \sum_{n=1}^N \alpha_n - \frac{1}{2}\sum_{n=1}^N\sum_{m=1}^N \alpha_n \alpha_m y_n y_m \, x_n^\top x_m \quad \text{subject to} \quad \sum_{n=1}^N \alpha_n y_n = 0.\]
For the SVM problem (convex primal, linear constraints, Slaterâs condition satisfied), the optimal values of the primal and dual coincide. Therefore, solving the dual is equivalent to solving the primal.
The KKT conditions are necessary and sufficient conditions for optimality in a convex constrained optimization problem. For the SVM primal, they are:
Stationarity: \[w = \sum_{n=1}^N \alpha_n y_n x_n, \qquad \sum_{n=1}^N \alpha_n y_n = 0.\]
Primal feasibility: \[y_n(w^\top x_n + b) \geq 1 \quad \forall\, n.\]
Dual feasibility: \[\alpha_n \geq 0 \quad \forall\, n.\]
Complementary slackness: \[\alpha_n \bigl[y_n(w^\top x_n + b) - 1\bigr] = 0 \quad \forall\, n.\]
For each data point \(x_n\), either:
\(\alpha_n = 0\): the point is not a support vector and lies strictly outside the margin, or
\(y_n(w^\top x_n + b) = 1\): the point lies exactly on the margin and is a support vector.
Only support vectors have \(\alpha_n > 0\) and contribute to defining \(w\) and \(b\). This is a key sparsity property of SVM.
From complementary slackness, for any support vector \(x_s\) (where \(\alpha_s > 0\)): \[b = y_s - w^\top x_s.\] In practice, \(b\) is averaged over all support vectors for numerical stability.
Given data points \(x_1, \ldots, x_N \in \mathbb{R}^d\), the Gram matrix \(G \in \mathbb{R}^{N \times N}\) is defined by \[G_{ij} = x_i^\top x_j = \langle x_i, x_j \rangle.\] Each entry is the inner product between two data points, measuring their similarity.
\(G\) is symmetric: \(G_{ij} = G_{ji}\).
\(G\) is positive semidefinite: for any \(v \in \mathbb{R}^N\), \[v^\top G v = \sum_{i,j} v_i v_j x_i^\top x_j = \left\|\sum_i v_i x_i\right\|^2 \geq 0.\]
The diagonal entries \(G_{nn} = \|x_n\|^2\) are the squared norms of the data points.
The dual objective can be written compactly using \(G\): \[\max_\alpha \; \mathbf{1}^\top \alpha - \frac{1}{2} \alpha^\top (y y^\top \circ G)\, \alpha,\] where \(\circ\) denotes the elementwise (Hadamard) product and \(y = [y_1, \ldots, y_N]^\top\). Thus the entire dual problem depends on the data only through pairwise inner products \(x_i^\top x_j\).
This observation is fundamental: since only inner products appear in the dual, we can replace \[x_i^\top x_j \;\longmapsto\; K(x_i, x_j),\] where \(K\) is any positive semidefinite kernel function. This implicitly maps data into a (possibly infinite-dimensional) feature space without ever computing the coordinates explicitly, enabling SVM to learn nonlinear decision boundaries at the cost of a linear algorithm in the dual.
| Concept | Role in SVM |
|---|---|
| Quadratic Programming | The dual SVM problem is a convex QP; solved efficiently by standard solvers. |
| Lagrangian Duality | Converts the constrained primal into an unconstrained dual over multipliers \(\alpha\). |
| KKT Conditions | Characterize the optimal solution; complementary slackness identifies support vectors. |
| Gram Matrix | Encodes all pairwise similarities; the dual depends on data only through \(G\). |
| Kernel Trick | Replace \(x_i^\top x_j\) with \(K(x_i,x_j)\) to handle nonlinear boundaries. |
For more details on these mathematical foundations, see the following references:
Convex Optimization by Boyd and Vandenberghe (2004): Chapters 4 and 5 cover QP and duality in depth.
Pattern Recognition and Machine Learning by Bishop (2006): Chapter 7 provides a comprehensive treatment of SVM, including the dual formulation and KKT conditions.
The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman (2009): Chapter 12 discusses SVM and kernel methods, with a focus on the mathematical underpinnings.