Suraj Singh Khurana

📋 Contents

    Click a section to jump to it · ← → arrow keys to navigate

    Mathematics for Machine Learning
    Course Notes
    Suraj Singh Khurana
    Spring 2026


    Introduction to Linear Regression

    Housing Price Problem

    "Suppose a real-estate website wants to predict house prices based on features like size and number of bedrooms." To do anything, we shall need some information,

    Size (\(x_1\)) Bedrooms (\(x_2\)) Price (Rs. lakhs)
    10 2 70
    20 3 130

    Here \[\begin{aligned} x_1 &= \text{house size (in 100 sq ft)}, \\ x_2 &= \text{number of bedrooms}, \\ w_1 &= \text{price contribution per unit size (Rs. lakhs per 100 sq ft)}, \\ w_2 &= \text{price contribution per bedroom (Rs. lakhs)}. \end{aligned}\]

    Now we want to build a model that predicts the price of a house based on its size and number of bedrooms. We assume that the price contributions from size and number of bedrooms are linear and additive.

    Thus the linear model for the price (in Rs. lakhs) is \[\hat y = w_1 x_1 + w_2 x_2,\] where \(\hat y\) denotes the predicted price. Here \(\hat y\) is a linear combination of the input features \(x_1\) and \(x_2\) with the weights \(w_1\) and \(w_2\) representing the model parameters. The word feature refers to an individual measurable property or characteristic of the data being analyzed. In this case, the features are the size of the house and the number of bedrooms. To find the best-fitting weights \(w_1\) and \(w_2\), we can set up a system of equations based on the provided data points. From the data, we have two equations: \[\begin{aligned} 70 &= w_1 \cdot 10 + w_2 \cdot 2, \\ 130 &= w_1 \cdot 20 + w_2 \cdot 3. \end{aligned}\] We can solve this system of equations to find the values of \(w_1\) and \(w_2\). Rearranging the equations, we get: \[\begin{aligned} 10 w_1 + 2 w_2 &= 70, \\ 20 w_1 + 3 w_2 &= 130. \end{aligned}\] We can solve this system using methods such as substitution, elimination, or matrix methods. Substitution or elimination would be straightforward here, but for larger systems, matrix methods are more efficient. Let us therefore express the system in matrix form: \[\begin{bmatrix}10 & 2 \\ 20 & 3\end{bmatrix} \begin{bmatrix}w_1 \\ w_2\end{bmatrix} = \begin{bmatrix}70 \\ 130\end{bmatrix}.\] Let \(A = \begin{bmatrix}10 & 2 \\ 20 & 3\end{bmatrix}\), \(\mathbf{w} = \begin{bmatrix}w_1 \\ w_2\end{bmatrix}\), and \(\mathbf{b} = \begin{bmatrix}70 \\ 130\end{bmatrix}\). Then we have \(A \mathbf{w} = \mathbf{b}\). To solve for \(\mathbf{w}\), we can use the inverse of matrix \(A\) (if it exists) or use other methods like Gaussian elimination. \[\mathbf{w} = A^{-1} \mathbf{b}.\]

    Calculating the inverse of \(A\): \[\begin{aligned} A^{-1} &= \frac{1}{(10)(3) - (2)(20)} \begin{bmatrix}3 & -2 \\ -20 & 10\end{bmatrix} \\ &= \frac{1}{30 - 40} \begin{bmatrix}3 & -2 \\ -20 & 10\end{bmatrix} \\ &= -\frac{1}{10} \begin{bmatrix}3 & -2 \\ -20 & 10\end{bmatrix} \\ &= \begin{bmatrix}-\dfrac{3}{10} & \dfrac{2}{10} \\ \dfrac{20}{10} & -\dfrac{10}{10}\end{bmatrix} \\ &= \begin{bmatrix}-0.3 & 0.2 \\ 2 & -1\end{bmatrix}. \end{aligned}\]

    Now, multiplying \(A^{-1}\) with \(\mathbf{b}\): \[\begin{aligned} \mathbf{w} &= A^{-1} \mathbf{b} \\ &= \begin{bmatrix}-0.3 & 0.2 \\ 2 & -1\end{bmatrix} \begin{bmatrix}70 \\ 130\end{bmatrix} \\ &= \begin{bmatrix}-0.3 \cdot 70 + 0.2 \cdot 130 \\ 2 \cdot 70 - 1 \cdot 130\end{bmatrix} \\ &= \begin{bmatrix}-21 + 26 \\ 140 - 130\end{bmatrix} \\ &= \begin{bmatrix}5 \\ 10\end{bmatrix}. \end{aligned}\]

    Thus, we find that \(w_1 = 5\) and \(w_2 = 10\). Therefore, the linear model for predicting house prices is: \[\hat y = 5 x_1 + 10 x_2.\] This means that for every additional 100 sq ft of house size, the price increases by Rs.5 lakhs, and for every additional bedroom, the price increases by Rs.10 lakhs.

    We should see the formulation below \[\begin{bmatrix} 10 & 2 \\ 20 & 3 \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \end{bmatrix} = \begin{bmatrix} 70 \\ 130 \end{bmatrix}\] as \[X\mathbf{w}=\mathbf{y}\] "Machine learning systems only see this." Now suppose we get a third house with size 15 (i.e., 1500 sq ft) and 2 bedrooms. Or in other words, our data table looks like

    Size (\(x_1\)) Bedrooms (\(x_2\)) Price (Rs. lakhs)
    10 2 70
    20 3 130
    15 2 100

    We can no longer solve for \(w_1\) and \(w_2\) exactly since we have three equations but only two unknowns. Even if we try to solve it, we will find that there is no exact solution that satisfies all three equations simultaneously. If you do not believe me, try to use Gaussian elimination. We will get the following augmented matrix: \[\left[\begin{array}{cc|c} 10 & 2 & 70 \\ 20 & 3 & 130 \\ 15 & 2 & 100 \end{array}\right]\] Applying row operations, we will get the following row echelon form: \[\left[\begin{array}{cc|c} 10 & 2 & 70 \\ 0 & 1 & 5 \\ 0 & 0 & 0 \end{array}\right]\] The last row corresponds to the equation \(0 = 0\), which indicates the system is consistent but underdetermined. However, this means the third equation is a linear combination of the first two, so we still cannot determine unique values for both weights. Or if you want to be more mathematical about it, the rank of the coefficient matrix is 2, while the rank of the augmented matrix is also 2, but we have only 2 independent equations for 2 unknowns with a dependent third equation.

    But does it mean we cannot find a good model? Not really. We can try to find weights \(w_1\) and \(w_2\) that minimize the error between the predicted prices and the actual prices. Since we do not have an exact solution of \(X\mathbf{w}=\mathbf{y}\) we can try to find an approximate solution, that is \(X\mathbf{w} \approx \mathbf{y}\). Old question (math): “Find \(\mathbf w\) such that \(X\mathbf w=\mathbf y\).”

    New question (ML): “Find \(\mathbf w\) such that \(X\mathbf w\approx\mathbf y\) as closely as possible.”

    Define the error for the \(i\)-th house as \[e_i=\hat y_i-y_i, \qquad \hat y_i = w_1 x_{i1}+w_2 x_{i2}.\] One common approach is to consider the following optimization problem: \[\min_{w_1, w_2} \sum_{i=1}^{3} (y_i - \hat{y}_i)^2,\] where \(y_i\) is the actual price of the \(i\)-th house, and \(\hat{y}_i\) is the predicted price using our linear model. This approach is known as the least squares method

    More formally,

    Linear Regression: Theory and Mathematical Foundations

    In many real-world problems, such as house price prediction, we attempt to model the relationship between several input variables (features) and an observed output using a linear rule. While simple linear systems may admit exact solutions, real data is noisy, inconsistent, and overdetermined. Linear regression provides a principled mathematical framework to handle such situations.

    Linear Model Assumption

    Let \(x_i \in \mathbb{R}^d\) denote the feature vector associated with the \(i\)-th data point (for example, house size, number of bedrooms, age, etc.), and let \(y_i \in \mathbb{R}\) denote the corresponding observed output (house price). We assume a linear relationship between inputs and output of the form \[\hat{y}_i = w^\top x_i,\] where \(w \in \mathbb{R}^d\) is an unknown parameter vector to be learned from data. This assumption means that the output is linear in the parameters \(w\), which is the defining feature of linear regression.

    Matrix Formulation

    Given \(n\) data points, we stack the feature vectors row-wise into a design matrix \[X = \begin{bmatrix} x_1^\top \\ x_2^\top \\ \vdots \\ x_n^\top \end{bmatrix} \in \mathbb{R}^{n \times d}, \quad y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \in \mathbb{R}^n.\] The model predictions for all data points can then be written compactly as \[\hat{y} = Xw.\] In idealized situations, one might hope for an exact solution to the equation \(Xw = y\). However, when \(n > d\) and the data is noisy, such an equation is typically inconsistent, as demonstrated earlier using rank and Gaussian elimination.

    Residuals and Error Vector

    The discrepancy between the model predictions and the observed outputs is captured by the residual vector \[r = Xw - y.\] Each component \(r_i = w^\top x_i - y_i\) represents the prediction error on a single data point. Since exact satisfaction of all equations is generally impossible, the goal is to find a parameter vector \(w\) that makes these residuals collectively as small as possible.

    Least Squares Criterion

    To quantify the total error, we introduce a scalar loss function defined as the squared Euclidean norm of the residual vector: \[\mathcal{L}(w) = \|Xw - y\|^2 = \sum_{i=1}^n (w^\top x_i - y_i)^2.\] This choice, known as the least squares loss, is motivated by several factors: it penalizes large errors more strongly, is differentiable, leads to a convex optimization problem, and admits a clear geometric interpretation.

    Optimization Problem

    Linear regression is thus formulated as the following optimization problem: \[\min_{w \in \mathbb{R}^d} \; \|Xw - y\|^2.\] This problem seeks the parameter vector whose predictions are closest to the observed data in the least squares sense. Importantly, this is no longer a problem of solving linear equations exactly, but of minimizing an error function.

    Vector Representation and Dimensional Consistency

    It is important to clarify how the vectors involved in the linear model are represented and why the expression \(w^\top x_i\) is mathematically valid.

    Each feature vector \(x_i\) is represented as a column vector in \(\mathbb{R}^d\): \[x_i = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{id} \end{bmatrix} \in \mathbb{R}^{d \times 1},\] where each component corresponds to a measurable attribute of the data point (for example, house size, number of bedrooms, etc.).

    Similarly, the parameter vector \(w\) is also represented as a column vector: \[w = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix} \in \mathbb{R}^{d \times 1}.\]

    Since both \(w\) and \(x_i\) are column vectors, their direct product \(wx_i\) is not defined. To obtain a scalar prediction, we take the transpose of \(w\), yielding a row vector: \[w^\top = \begin{bmatrix} w_1 & w_2 & \cdots & w_d \end{bmatrix} \in \mathbb{R}^{1 \times d}.\]

    The prediction for the \(i\)-th data point is then given by the matrix product \[\hat{y}_i = w^\top x_i,\] where \[(1 \times d)(d \times 1) = 1 \times 1.\] Thus, \(w^\top x_i\) is a scalar, as required to model a real-valued output such as house price.

    Explicitly, this dot product can be written as \[w^\top x_i = w_1 x_{i1} + w_2 x_{i2} + \cdots + w_d x_{id},\] which shows that the linear model is a weighted sum of the input features.

    Consistency with Matrix Formulation

    When stacking all feature vectors into the design matrix \[X = \begin{bmatrix} x_1^\top \\ x_2^\top \\ \vdots \\ x_n^\top \end{bmatrix} \in \mathbb{R}^{n \times d},\] each row of \(X\) corresponds to a single data point written as a transposed feature vector. The matrix–vector product \[Xw\] then yields an \(n \times 1\) vector, where the \(i\)-th entry is precisely \(w^\top x_i\). This ensures dimensional consistency between the individual prediction formula and the global matrix formulation.

    This careful handling of vector orientation is not merely a notational choice; it is essential for extending linear regression to high-dimensional settings and for implementing learning algorithms correctly in numerical software.

    Normal Equations

    To find the minimizer, we differentiate the loss function with respect to \(w\): \[\frac{\partial}{\partial w} \|Xw - y\|^2 = \frac{\partial}{\partial w} (Xw - y)^\top (Xw - y) =\] How we achieve this is by applying the chain rule and properties of matrix calculus. Let us derive from scratch in high detail: \[\begin{aligned} \frac{\partial}{\partial w} \|Xw - y\|^2 &= \frac{\partial}{\partial w} (Xw - y)^\top (Xw - y) \\ &= \frac{\partial}{\partial w} \left( w^\top X^\top X w - 2 y^\top X w + y^\top y \right) \\ &= 2 X^\top X w - 2 X^\top y. \end{aligned}\] Why above? Because we use the following rules of matrix calculus:

    Setting the gradient to zero yields the normal equations: \[X^\top X w = X^\top y.\]

    These equations form a system of \(d\) linear equations in \(d\) unknowns. Unlike the original system \(Xw = y\), the normal equations are consistent under mild conditions.

    Closed-Form Solution

    If the matrix \(X^\top X\) is invertible, the solution to the normal equations is given by \[w = (X^\top X)^{-1} X^\top y.\] This expression provides a closed-form formula for the linear regression estimator. It characterizes the theoretical solution but is not always used directly in practice due to computational and numerical stability considerations.

    Geometric Interpretation

    From a geometric perspective, the columns of \(X\) span a subspace of \(\mathbb{R}^n\), known as the column space of \(X\). The vector \(Xw\) represents a point in this subspace. Linear regression finds the vector \(Xw\) that is closest to \(y\) in Euclidean distance. Consequently, the residual vector \(r = Xw - y\) is orthogonal to the column space of \(X\), which explains the condition \[X^\top r = 0.\] Thus, linear regression can be interpreted as the orthogonal projection of \(y\) onto the column space of \(X\).

    Why Linear Regression Is a Machine Learning Model

    Linear regression qualifies as a machine learning model because its parameters are inferred from data rather than specified explicitly, the data typically contains noise and inconsistencies, and the learned model is used to make predictions on unseen data. The process involves empirical risk minimization rather than exact algebraic solution.

    Scalability and Numerical Methods

    In modern machine learning applications, the number of data points \(n\) may be in the millions and the number of features \(d\) may be in the thousands. In such settings, forming and inverting \(X^\top X\) is computationally expensive and numerically unstable. Consequently, linear regression is commonly solved using numerical optimization methods such as gradient descent, stochastic gradient descent, or iterative linear solvers. These methods approximate the solution to the normal equations efficiently at scale.

    Conceptual Summary

    Linear regression transforms an inconsistent, overdetermined system of linear equations into a well-posed optimization problem. It replaces the impossible requirement of satisfying all equations exactly with the achievable goal of minimizing overall error. In this sense, linear regression is best understood as numerical linear algebra applied to real-world data.

    Now let us return to our housing price example and compute the optimal weights using the least squares approach. We have the design matrix \[X = \begin{bmatrix} 10 & 2 \\ 20 & 3 \\ 15 & 2 \end{bmatrix},\] and the output vector \[y = \begin{bmatrix} 70 \\ 130 \\ 100 \end{bmatrix}.\] The normal equations are \[X^\top X w = X^\top y.\] Calculating \(X^\top X\) and \(X^\top y\): \[X^\top X = \begin{bmatrix}10 & 20 & 15 \\ 2 & 3 & 2\end{bmatrix} \begin{bmatrix}10 & 2 \\ 20 & 3 \\ 15 & 2\end{bmatrix} = \begin{bmatrix}725 & 110 \\ 110 & 17\end{bmatrix},\] and \[X^\top y = \begin{bmatrix}10 & 20 & 15 \\ 2 & 3 & 2\end{bmatrix} \begin{bmatrix}70 \\ 130 \\ 100\end{bmatrix} = \begin{bmatrix} 10(70) + 20(130) + 15(100) \\ 2(70) + 3(130) + 2(100) \end{bmatrix}\] \[= \begin{bmatrix} 700 + 2600 + 1500 \\ 140 + 390 + 200 \end{bmatrix} = \begin{bmatrix}4800 \\ 730\end{bmatrix}.\] Thus, the normal equations become \[\begin{bmatrix}725 & 110 \\ 110 & 17\end{bmatrix} \begin{bmatrix}w_1 \\ w_2\end{bmatrix} = \begin{bmatrix}4800 \\ 730\end{bmatrix}.\] Solving via the determinant \(\Delta = 725 \cdot 17 - 110^2 = 12325 - 12100 = 225\): \[\begin{bmatrix}w_1 \\ w_2\end{bmatrix} = \frac{1}{225}\begin{bmatrix}17 & -110 \\ -110 & 725\end{bmatrix} \begin{bmatrix}4800 \\ 730\end{bmatrix}\] \[= \frac{1}{225}\begin{bmatrix} 81600 - 80300 \\ -528000 + 529250 \end{bmatrix} = \frac{1}{225}\begin{bmatrix}1300 \\ 1250\end{bmatrix} = \begin{bmatrix}5.78 \\ 5.56\end{bmatrix}.\] Thus \(w_1 \approx 5.78\) and \(w_2 \approx 5.56\), representing the best-fit weights that minimize the squared error across all three data points. How come we got different weights than before? Because now we are minimizing the squared error across all three data points, rather than trying to fit exactly two points. Earlier weights were 5 and 10, which perfectly fit the first two points but did not account for the third point. Now with third point weights became approximately 5.78 and 5.56, which balance the errors across all three points. Let us check the predictions: \[\hat{y}_1 = 5.78 \cdot 10 + 5.56 \cdot 2 = \approx 70.0,\] \[\hat{y}_2 = 5.78 \cdot 20 + 5.56 \cdot 3 \approx 130.0,\] \[\hat{y}_3 = 5.78 \cdot 15 + 5.56 \cdot 2 \approx 100.0.\] but actually \[\hat{y}_1 = 5.78 \cdot 10 + 5.56 \cdot 2 = 57.8 + 11.12 = 68.92,\] \[\hat{y}_2 = 5.78 \cdot 20 + 5.56 \cdot 3 = 115.6 + 16.68 = 132.28,\] \[\hat{y}_3 = 5.78 \cdot 15 + 5.56 \cdot 2 = 86.7 + 11.12 = 97.82.\]

    In practice, computing matrix inverses is computationally expensive and numerically unstable. Instead, we use gradient descent to minimize \(\mathcal{L}(w) = \|Xw - y\|^2\) iteratively, starting from an initial guess \(w^{(0)} = 0\): \[w^{(k+1)} = w^{(k)} - \eta \nabla \mathcal{L}(w^{(k)}),\] where \(\eta\) is the learning rate. The gradient \(\nabla \mathcal{L}(w)\) is given by \[\nabla \mathcal{L}(w) = 2 X^\top (Xw - y).\] By repeatedly applying this update rule, we can converge to the optimal weights without explicitly computing the matrix inverse. Why this works? Because gradient descent is an optimization algorithm that iteratively moves towards the minimum of a function by taking steps proportional to the negative of the gradient. In the case of linear regression, the loss function is convex, ensuring that gradient descent will converge to the global minimum. Here is a pictural representation of gradient descent:

    First recall what is convexity. A function \(f: \mathbb{R}^d \to \mathbb{R}\) is convex if for any two points \(x, y \in \mathbb{R}^d\) and any \(\theta \in [0, 1]\), \[f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y).\] Geometrically, this means that the line segment connecting any two points on the graph of the function lies above or on the graph itself. Convex functions have a single global minimum, making optimization problems easier to solve. Why they have a single global minimum? Because if there were two distinct local minima, the line segment connecting them would violate the convexity condition.

    Why loss function is convex, ensuring that gradient descent will converge to the global minimum. To prove that the loss function \(\mathcal{L}(w) = \|Xw - y\|^2\) is convex, we can show that its Hessian matrix is positive semidefinite. The Hessian matrix of \(\mathcal{L}(w)\) is given by \[H = \nabla^2 \mathcal{L}(w) = 2 X^\top X.\] To show that \(H\) is positive semidefinite, we need to show that for any non-zero vector \(z \in \mathbb{R}^d\), \[z^\top H z \geq 0.\] Calculating \(z^\top H z\): \[z^\top H z = z^\top (2 X^\top X) z = 2 (X z)^\top (X z) = 2 \|X z\|^2 \geq 0.\] Since \(\|X z\|^2\) is always non-negative, it follows that \(H\) is positive semidefinite. Therefore, the loss function \(\mathcal{L}(w)\) is convex.

    We need to justify why this gradient descent is a good idea. Gradient descent is a good idea for several reasons:

    Why computing inverse of a matrix is computationally expensive and numerically unstable? Computing the inverse of a matrix, especially large matrices, can be computationally expensive due to the time complexity of matrix inversion algorithms, which is typically \(O(d^3)\) for a \(d \times d\) matrix. This becomes prohibitive for high-dimensional data. Additionally, matrix inversion can be numerically unstable if the matrix is ill-conditioned (i.e., has a high condition number), leading to significant errors in the computed inverse. Small perturbations in the input data can result in large changes in the inverse, making it unreliable for practical applications. But in contrast, gradient descent avoids these issues by iteratively refining the solution without requiring matrix inversion, making it more robust and efficient for large-scale problems. So let us now implement gradient descent to find the optimal weights for our housing price prediction problem by iteratively updating the weights until convergence by calculating the gradient at each step non programmatically. Starting with an initial guess \(w^{(0)} = \begin{bmatrix}0 \\ 0\end{bmatrix}\) and a learning rate \(\eta = 0.0001\), we compute the gradient at each step: \[\nabla \mathcal{L}(w^{(k)}) = 2 X^\top (X w^{(k)} - y).\] We then update the weights: \[w^{(k+1)} = w^{(k)} - \eta \nabla \mathcal{L}(w^{(k)}).\] Repeating this process and get the following sequence of weight updates: \[\begin{aligned} w^{(0)} &= \begin{bmatrix}0 \\ 0\end{bmatrix}, \\ w^{(1)} &\approx \begin{bmatrix}0.096 \\ 0.015\end{bmatrix}, \\ w^{(2)} &\approx \begin{bmatrix}0.187 \\ 0.028\end{bmatrix}, \\ w^{(3)} &\approx \begin{bmatrix}0.273 \\ 0.040\end{bmatrix}, \\ &\vdots \\ w^{(1000)} &\approx \begin{bmatrix}5.562 \\ 5.722\end{bmatrix}. \end{aligned}\] After sufficient iterations, the weights converge to approximately \(\begin{bmatrix}5.562 \\ 5.722\end{bmatrix}\), which is very close to the closed-form solution we computed earlier. This demonstrates the effectiveness of gradient descent in finding the optimal weights for linear regression. There are many small things to consider here like how to choose the learning rate, when to stop the iterations (convergence criteria), etc. Below is a pictorial representation of gradient descent converging to the optimal weights:


    The small things to consider here like how to choose the learning rate, when to stop the iterations (convergence criteria), etc. are important practical aspects of implementing gradient descent effectively. See summary below:

    We will learn these small things in detail in later lectures. But for now, let us completely change our perspective and derive linear regression from a probabilistic viewpoint. For this let us recall our housing price prediction problem. We have the design matrix \[X = \begin{bmatrix} 10 & 2 \\ 20 & 3 \\ 15 & 2 \end{bmatrix},\] and the output vector \[y = \begin{bmatrix} 70 \\ 125 \\ 95 \end{bmatrix}.\] We will now assume a probabilistic model for the data generation process. Specifically, we assume that the observed prices are generated according to the linear model plus Gaussian noise: \[y_i = w^\top x_i + \epsilon_i,\] where \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\) are independent and identically distributed Gaussian noise terms with mean zero and variance \(\sigma^2\). This assumption reflects the idea that the true relationship between features and prices is linear, but there are random fluctuations due to unobserved factors. Recall probability basics below:

    Probability Basics

    Before diving into the probabilistic interpretation of linear regression, it is essential to review foundational concepts in probability theory.

    Sample Space and Events

    The sample space \(\Omega\) is the set of all possible outcomes of a random experiment. An event is a subset of the sample space. For example, when rolling a fair die, \(\Omega = \{1, 2, 3, 4, 5, 6\}\), and the event “rolling an even number” is \(\{2, 4, 6\}\).

    Probability of an Event

    The probability of an event \(A\), denoted \(P(A)\), is a number in \([0, 1]\) satisfying:

    Random Variables

    A random variable \(X\) is a function that assigns a numerical value to each outcome in the sample space. Random variables can be discrete (taking values in a countable set) or continuous (taking values in an interval or uncountable set). For example, the outcome of a die roll is a discrete random variable, while the height of a person is continuous.

    Probability Distributions

    A probability distribution describes how probability is distributed over the possible values of a random variable.

    For a discrete random variable \(X\) taking values \(x_1, x_2, \ldots\), the probability mass function (PMF) is \(p(x_i) = P(X = x_i)\). The probabilities satisfy \(\sum_i p(x_i) = 1\).

    For a continuous random variable \(X\), the probability density function (PDF) is \(f(x)\), which satisfies \(\int_{-\infty}^{\infty} f(x) \, dx = 1\). The probability that \(X\) lies in an interval \([a, b]\) is \[P(a \leq X \leq b) = \int_a^b f(x) \, dx.\]

    Cumulative Distribution Function

    The cumulative distribution function (CDF) is defined as \[F(x) = P(X \leq x).\] For a continuous random variable with PDF \(f(x)\), we have \(F(x) = \int_{-\infty}^x f(t) \, dt\), and the PDF is the derivative of the CDF: \(f(x) = \frac{dF(x)}{dx}\).

    Expected Value

    The expected value (or mean) of a random variable \(X\) is a measure of its central tendency.

    For a discrete random variable, \[E[X] = \sum_i x_i \cdot p(x_i).\]

    For a continuous random variable, \[E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx.\]

    The expected value is linear: \(E[aX + b] = aE[X] + b\) for constants \(a\) and \(b\).

    Variance and Standard Deviation

    The variance of a random variable \(X\) measures the spread of its distribution around the mean: \[\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2.\]

    The standard deviation is \(\sigma = \sqrt{\text{Var}(X)}\).

    Properties of variance include:

    Normal Distribution

    The Gaussian or normal distribution is one of the most important distributions in probability and statistics. A continuous random variable \(X\) follows a normal distribution with mean \(\mu\) and variance \(\sigma^2\), denoted \(X \sim \mathcal{N}(\mu, \sigma^2)\), if its PDF is \[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right).\]

    Key properties of the normal distribution include:

    The standard normal distribution has mean \(\mu = 0\) and variance \(\sigma^2 = 1\), often denoted \(\mathcal{N}(0, 1)\).

    Multivariate Normal Distribution

    When dealing with multiple random variables, we use the multivariate normal distribution. A random vector \(\mathbf{x} \in \mathbb{R}^d\) follows a multivariate normal distribution with mean vector \(\boldsymbol{\mu} \in \mathbb{R}^d\) and covariance matrix \(\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}\), denoted \(\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\), if its PDF is \[f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right),\] where \(|\boldsymbol{\Sigma}|\) denotes the determinant of \(\boldsymbol{\Sigma}\).

    Independence

    Two random variables \(X\) and \(Y\) are independent if knowledge of one provides no information about the other. Formally, they are independent if \[P(X = x \text{ and } Y = y) = P(X = x) \cdot P(Y = y)\] for all values \(x\) and \(y\) (in the discrete case), or if their joint PDF factors as \(f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y)\) (in the continuous case).

    Conditional Probability

    The conditional probability of event \(A\) given event \(B\) is \[P(A \mid B) = \frac{P(A \text{ and } B)}{P(B)},\] provided \(P(B) > 0\).

    Bayes’ Theorem

    Bayes’ theorem relates conditional probabilities: \[P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}.\] This theorem is fundamental in machine learning and statistical inference, allowing us to update beliefs (posterior) based on new evidence (likelihood) and prior knowledge (prior).

    Likelihood

    In the context of parameter estimation, the likelihood is the probability of observing the data given a particular value of the parameter. If we have data \(\mathbf{y} = (y_1, \ldots, y_n)\) and parameter \(w\), the likelihood is \[L(w \mid \mathbf{y}) = P(\mathbf{y} \mid w).\] The log-likelihood is \(\ell(w) = \log L(w \mid \mathbf{y})\), which is often easier to work with computationally.

    Maximum Likelihood Estimation

    Maximum likelihood estimation (MLE) is a method for estimating parameters by choosing the parameter value that maximizes the likelihood of the observed data: \[\hat{w} = \arg\max_w L(w \mid \mathbf{y}) = \arg\max_w \ell(w \mid \mathbf{y}).\]

    These concepts will be essential for understanding the probabilistic formulation of linear regression, where we model noise using the normal distribution and estimate parameters via maximum likelihood. Do all models with noise follow this approach? No, not all models with noise follow this approach. While the Gaussian noise assumption is common due to its mathematical convenience and the Central Limit Theorem, other noise distributions can be used depending on the nature of the data and the specific application. For example, if the noise is believed to be heavy-tailed, a Laplace distribution might be more appropriate. Similarly, for count data, a Poisson distribution could be used. The choice of noise model affects the likelihood function and consequently the parameter estimation method. Recall that the Central Limit Theorem states that the sum of a large number of independent random variables, regardless of their individual distributions, tends to follow a normal distribution, provided certain conditions are met. This theorem justifies the common use of Gaussian noise in modeling real-world phenomena, as many types of measurement errors and natural variations can be viewed as the cumulative effect of numerous small, independent factors. In the house price example, are the noise really independent and identically distributed Gaussian? In practice, the assumption of independent and identically distributed (i.i.d.) Gaussian noise may not always hold true in real-world scenarios. While it is a convenient and widely used assumption, there are several factors that can lead to violations of this assumption:

    Despite these potential violations, the Gaussian noise assumption is often used as a first approximation due to its mathematical tractability and the Central Limit Theorem. However, it is important to assess the validity of this assumption in practice and consider alternative models or robust estimation techniques if necessary. Now let us derive the likelihood function for our linear regression model under the Gaussian noise assumption.

    Likelihood Function

    Given the model \[y_i = w^\top x_i + \epsilon_i,\] where \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\), the conditional probability of observing \(y_i\) given \(x_i\) and parameters \(w\) is \[P(y_i \mid x_i, w) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - w^\top x_i)^2}{2\sigma^2}\right).\] Assuming the data points are independent, the likelihood of observing the entire dataset \((X, y)\) is the product of the individual probabilities: \[L(w \mid X, y) = \prod_{i=1}^n P(y_i \mid x_i, w) = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n \exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - w^\top x_i)^2\right).\] Taking the logarithm of the likelihood function gives the log-likelihood: \[\ell(w \mid X, y) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - w^\top x_i)^2.\] To find the maximum likelihood estimate of \(w\), we can ignore the constant term \(-\frac{n}{2} \log(2\pi\sigma^2)\) and focus on maximizing the negative of the sum of squared errors: \[\hat{w} = \arg\min_w \sum_{i=1}^n (y_i - w^\top x_i)^2.\] This is precisely the same optimization problem we encountered in the least squares formulation of linear regression. Thus, the maximum likelihood estimate of \(w\) under the Gaussian noise assumption coincides with the least squares solution. This connection highlights the probabilistic interpretation of linear regression as fitting a model that maximizes the likelihood of the observed data under a Gaussian noise model.

    Principal component analysis (PCA)

    Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features in a dataset while retaining as much variance as possible. It achieves this by identifying the directions (principal components) along which the data varies the most.

    Principal Component Analysis (PCA): Philosophy and Foundations

    1. From Prediction to Representation

    In linear regression, we study a supervised problem: \[y = Xw + \varepsilon,\] where the goal is to predict an output variable \(y\) (e.g. house price) from features \(X\).

    In PCA, we deliberately remove the output variable \(y\) and ask a different question:

    Given only the feature data \(X\), how can we describe or represent the data as faithfully as possible using fewer numbers?

    Thus PCA is an unsupervised method whose goal is not prediction, but representation and compression of data.

    —

    2. Data Setup and Mean Centering

    Let \[x_1, x_2, \dots, x_n \in \mathbb{R}^d\] be data points (e.g. houses described by size, bedrooms, etc.).

    Compact Vector Notation

    Since each data point \(x_i\) is a \(d\)-dimensional vector, the mean is computed as a vector average: \[\mu = \frac{1}{n}\sum_{i=1}^n x_i = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_d \end{bmatrix} \in \mathbb{R}^d,\] where each component \(\mu_j\) is the average of the \(j\)-th feature across all data points. When we subtract \(\mu\) from \(x_i\), we perform component-wise subtraction: \[x_i - \mu = \begin{bmatrix} x_{i1} - \mu_1 \\ x_{i2} - \mu_2 \\ \vdots \\ x_{id} - \mu_d \end{bmatrix}.\] This compact notation \(x_i \leftarrow x_i - \mu\) applies uniformly regardless of dimensionality.

    We first perform mean centering: \[x_i \leftarrow x_i - \mu, \quad \text{where } \mu = \frac{1}{n}\sum_{i=1}^n x_i.\]

    After centering, \[\frac{1}{n}\sum_{i=1}^n x_i = 0.\]

    Mean centering ensures that we study variation around the average, which is essential for defining variance and covariance.

    —

    3. Covariance Matrix as a Summary of Variation

    Motivation: Measuring Spread and Relationships

    After mean centering, the centered data points are distributed around the origin. To understand the structure of this distribution, we need a single mathematical object that captures:

    Recall from probability that the variance of a random variable \(X\) is \(\text{Var}(X) = E[X^2]\) (when centered at zero). For data, this becomes \(\frac{1}{n}\sum_i x_i^2\), measuring spread along a single dimension.

    For centered data points \(x_1, \ldots, x_n\), the outer product \(x_i x_i^\top\) creates a \(d \times d\) matrix that captures all pairwise products of coordinates: \[x_i x_i^\top = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{id} \end{bmatrix} \begin{bmatrix} x_{i1} & x_{i2} & \cdots & x_{id} \end{bmatrix} = \begin{bmatrix} x_{i1}^2 & x_{i1}x_{i2} & \cdots \\ x_{i2}x_{i1} & x_{i2}^2 & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}.\]

    Averaging these matrices across all data points accumulates information about:

    This is the natural generalization of univariate variance to the multivariate setting.

    Formal Definition

    The covariance matrix is defined as \[\Sigma = \frac{1}{n}\sum_{i=1}^n x_i x_i^\top.\]

    Each entry \[\Sigma_{jk} = \frac{1}{n}\sum_{i=1}^n x_{ij}x_{ik}\] is the covariance between feature \(j\) and feature \(k\).

    Thus, the covariance matrix compactly stores:

    —

    4. Linear Summaries of Data

    Choose a vector of weights \[v \in \mathbb{R}^d, \quad \|v\| = 1.\]

    For each data point \(x_i\), define a one-dimensional summary: \[z_i = x_i^\top v.\]

    This compresses each \(d\)-dimensional data point into a single number.

    Different choices of \(v\) preserve different aspects of the data.

    —

    5. Variance as Information Preservation

    The variance of the summary values \(\{z_i\}\) is \[\operatorname{Var}(z) = \frac{1}{n}\sum_{i=1}^n z_i^2 = v^\top \Sigma v.\]

    Interpretation:

    Thus, variance measures how much of the differences between data points survive compression.

    —

    6. The Core PCA Optimization Problem

    To find the best one-dimensional representation, PCA solves: \[\max_{\|v\|=1} \; v^\top \Sigma v.\]

    This chooses the linear combination of features whose values vary the most across the dataset.

    The solution is called the first principal component \(v_1\).

    —

    7. Reconstruction Viewpoint (Key Insight)

    Given the summary \[z_i = x_i^\top v,\] the natural reconstruction of the original data point is \[\hat{x}_i = z_i\, v.\]

    All reconstructed points lie in the one-dimensional subspace spanned by \(v\).

    The average reconstruction error is \[\frac{1}{n}\sum_{i=1}^n \|x_i - \hat{x}_i\|^2 = \frac{1}{n}\sum_{i=1}^n \|x_i - (x_i^\top v)v\|^2.\]

    A fundamental result is:

    The vector \(v\) that maximizes variance also minimizes reconstruction error.

    Hence, PCA finds the best low-dimensional approximation of the data in the least-squares sense.

    —

    8. Example: Housing Data

    Consider houses described by size and number of bedrooms (mean-centered):

    \[x_1 = (-5,-0.3),\quad x_2 = (5,0.7),\quad x_3 = (0,-0.3).\]

    Thus PCA automatically discovers that house size is the dominant pattern of variation.

    —

    9. Ultimate Benefit of PCA

    PCA provides:

    PCA does not discover meaning or causality. It guarantees only this:

    Among all linear low-dimensional representations, PCA loses the least information under squared error.

    —

    10. PCA: Mathematical Formulation (Deisenroth et al.)

    10.1 Problem Setup and Notation

    Let \(\mathcal{X} = \{x_1, \ldots, x_N\}\) be a dataset where \(x_n \in \mathbb{R}^D\). We seek to find a lower-dimensional representation in \(\mathbb{R}^M\) where \(M < D\), while preserving as much information as possible.

    PCA projects data onto a lower-dimensional linear subspace. The key is to find an orthonormal basis \(\{b_1, \ldots, b_M\}\) of this \(M\)-dimensional subspace such that the projected data retains maximum variance.

    Projection onto a Subspace

    For a given orthonormal basis \(B = [b_1, \ldots, b_M]\) where \(b_i \in \mathbb{R}^D\) and \(b_i^\top b_j = \delta_{ij}\), the projection of a data point \(x_n\) onto this subspace is: \[\tilde{x}_n = \sum_{i=1}^M z_{ni} b_i = B z_n,\] where \(z_n = [z_{n1}, \ldots, z_{nM}]^\top \in \mathbb{R}^M\) are the coordinates (principal component scores) in the new basis, computed as \(z_{ni} = x_n^\top b_i\).

    10.2 Maximum Variance Formulation

    Objective

    PCA finds the orthonormal basis that maximizes the variance of the projected data. For mean-centered data (where \(\sum_{n=1}^N x_n = 0\)), the variance of the projections onto a single direction \(b_1\) is: \[V_1 = \frac{1}{N} \sum_{n=1}^N (x_n^\top b_1)^2 = \frac{1}{N} \sum_{n=1}^N b_1^\top x_n x_n^\top b_1 = b_1^\top S b_1,\] where \(S = \frac{1}{N} \sum_{n=1}^N x_n x_n^\top\) is the data covariance matrix.

    Constrained Optimization

    The first principal component is found by solving: \[\max_{b_1} \; b_1^\top S b_1 \quad \text{subject to} \quad \|b_1\|^2 = 1.\]

    Using the method of Lagrange multipliers with Lagrangian \(\mathcal{L}(b_1, \lambda_1) = b_1^\top S b_1 - \lambda_1(b_1^\top b_1 - 1)\), we obtain: \[\frac{\partial \mathcal{L}}{\partial b_1} = 2S b_1 - 2\lambda_1 b_1 = 0 \quad \Rightarrow \quad S b_1 = \lambda_1 b_1.\]

    Thus, \(b_1\) is an eigenvector of \(S\) with eigenvalue \(\lambda_1\). The variance along this direction is: \[V_1 = b_1^\top S b_1 = b_1^\top (\lambda_1 b_1) = \lambda_1 b_1^\top b_1 = \lambda_1.\]

    To maximize variance, we choose \(b_1\) to be the eigenvector corresponding to the largest eigenvalue \(\lambda_1\) of \(S\).

    Higher Principal Components

    For the \(m\)-th principal component, we solve: \[\max_{b_m} \; b_m^\top S b_m \quad \text{subject to} \quad \|b_m\|^2 = 1 \text{ and } b_m \perp b_i \text{ for } i = 1, \ldots, m-1.\]

    The solution is the eigenvector corresponding to the \(m\)-th largest eigenvalue of \(S\).

    10.3 Minimum Reconstruction Error Formulation

    Alternative Characterization

    PCA can equivalently be formulated as minimizing the average squared reconstruction error. The reconstruction of \(x_n\) from its projection is: \[\tilde{x}_n = \sum_{i=1}^M (x_n^\top b_i) b_i = BB^\top x_n,\] where \(B = [b_1, \ldots, b_M]\).

    The reconstruction error for the entire dataset is: \[J = \frac{1}{N} \sum_{n=1}^N \|x_n - \tilde{x}_n\|^2 = \frac{1}{N} \sum_{n=1}^N \|x_n - BB^\top x_n\|^2.\]

    Equivalence to Maximum Variance

    Using the identity \(\|x_n\|^2 = \|BB^\top x_n\|^2 + \|(I - BB^\top)x_n\|^2\) for orthonormal \(B\), we have: \[J = \frac{1}{N} \sum_{n=1}^N \|x_n\|^2 - \frac{1}{N} \sum_{n=1}^N \|BB^\top x_n\|^2.\]

    Since the first term is constant, minimizing \(J\) is equivalent to maximizing \(\frac{1}{N} \sum_{n=1}^N \|BB^\top x_n\|^2\), which is the variance of the projected data. Thus, maximum variance and minimum reconstruction error are equivalent objectives.

    10.4 Practical Algorithm

    1. Center the data: Compute mean \(\mu = \frac{1}{N}\sum_{n=1}^N x_n\) and set \(x_n \leftarrow x_n - \mu\).

    2. Compute covariance matrix: \(S = \frac{1}{N} \sum_{n=1}^N x_n x_n^\top = \frac{1}{N} X^\top X\), where \(X = [x_1, \ldots, x_N]^\top \in \mathbb{R}^{N \times D}\).

    3. Eigendecomposition: Compute eigenvectors and eigenvalues of \(S\): \(S = \sum_{i=1}^D \lambda_i b_i b_i^\top\) where \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_D\).

    4. Select principal components: Choose the top \(M\) eigenvectors \(B = [b_1, \ldots, b_M]\).

    5. Project data: For each \(x_n\), compute \(z_n = B^\top x_n \in \mathbb{R}^M\).

    Explained Variance

    The proportion of variance captured by the first \(M\) principal components is: \[\frac{\sum_{i=1}^M \lambda_i}{\sum_{i=1}^D \lambda_i}.\]

    10.5 Key Properties

    10.6 Computational Considerations

    For high-dimensional data where \(N < D\), computing the \(D \times D\) covariance matrix \(S\) is expensive. Instead, we can use the singular value decomposition (SVD) of the centered data matrix \(X \in \mathbb{R}^{N \times D}\): \[X = U \Sigma V^\top,\] where \(U \in \mathbb{R}^{N \times N}\), \(\Sigma \in \mathbb{R}^{N \times D}\), \(V \in \mathbb{R}^{D \times D}\) are the SVD factors.

    The relationship with PCA:

    This approach is numerically more stable and efficient, especially when \(N \ll D\).

    10.7 Geometric Interpretation

    PCA identifies an orthogonal coordinate system aligned with the directions of maximum variance in the data cloud. Geometrically:

    The \(M\)-dimensional subspace spanned by \(\{b_1, \ldots, b_M\}\) is the best-fitting \(M\)-dimensional linear subspace in the least-squares sense.

    —

    11. One-Sentence Summary

    Revision

    PCA in Full Matrix and Component Form

    Data matrix.

    Let the dataset consist of \(N\) centered samples in \(\mathbb{R}^d\): \[X \in \mathbb{R}^{N \times d}, \qquad X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Nd} \end{bmatrix}.\] The \(n\)-th data point is the row vector \[x_n = \begin{bmatrix} x_{n1} & x_{n2} & \cdots & x_{nd} \end{bmatrix} \in \mathbb{R}^d.\]

    Covariance matrix.

    The empirical covariance matrix is \[S = \frac{1}{N} X^T X \in \mathbb{R}^{d \times d}.\]

    Principal directions (decoder matrix).

    Let \[B \in \mathbb{R}^{d \times m}, \qquad B = \begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1m} \\ b_{21} & b_{22} & \cdots & b_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ b_{d1} & b_{d2} & \cdots & b_{dm} \end{bmatrix} = \begin{bmatrix} b_1 & b_2 & \cdots & b_m \end{bmatrix},\] where the columns \(b_k \in \mathbb{R}^d\) are orthonormal: \[B^T B = I_m.\]

    Each direction satisfies the eigenvalue problem \[S b_k = \lambda_k b_k.\]

    Encoding (principal components).

    The low-dimensional representation is \[Z = X B \in \mathbb{R}^{N \times m},\] \[Z = \begin{bmatrix} z_{11} & z_{12} & \cdots & z_{1m} \\ z_{21} & z_{22} & \cdots & z_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ z_{N1} & z_{N2} & \cdots & z_{Nm} \end{bmatrix}.\]

    Componentwise, \[z_{nk} = \sum_{j=1}^d x_{nj} \, b_{jk} = b_k^T x_n.\]

    In particular, the first principal component is \[z_{n1} = b_{11}x_{n1} + b_{21}x_{n2} +\cdots+ b_{d1}x_{nd}.\] Thus the first principal component is a linear combination of all original features.

    Variance of the first component.

    Let \[z_1 = \begin{bmatrix} z_{11} \\ z_{21} \\ \vdots \\ z_{N1} \end{bmatrix}.\] Then \[\mathrm{Var}(z_1) = \frac{1}{N} \sum_{n=1}^N z_{n1}^2 = \frac{1}{N} \sum_{n=1}^N (b_1^T x_n)^2 = b_1^T S b_1.\] PCA chooses \(b_1\) to maximize this variance: \[\max_{\|b_1\|=1} \; b_1^T S b_1.\]

    Reconstruction.

    The reconstructed data is \[\hat{X} = Z B^T = X B B^T.\]

    The matrix \[P = B B^T \in \mathbb{R}^{d \times d}\] has components \[P_{ij} = \sum_{k=1}^m b_{ik} b_{jk}.\]

    Thus reconstruction componentwise is \[\hat{x}_{ni} = \sum_{j=1}^d x_{nj} P_{ji} = \sum_{j=1}^d x_{nj} \left( \sum_{k=1}^m b_{jk} b_{ik} \right).\]

    Why \(BB^T \neq I_d\).

    If \(m < d\), then \[B^T B = I_m, \qquad BB^T \neq I_d.\] Instead, \[(BB^T)^2 = BB^T,\] so \(BB^T\) is a projection matrix onto the \(m\)-dimensional subspace spanned by \(\{b_1,\dots,b_m\}\).

    Therefore \[\hat{X} = XBB^T \neq X\] unless \(m=d\), in which case \(B\) is square orthogonal and \[BB^T = I_d.\]

    Summary.

    PCA performs the transformation \[X_{N \times d} \;\longrightarrow\; Z = X B \; (N \times m),\] where each new coordinate is \[z_{nk} = b_k^T x_n,\] and reconstruction is the orthogonal projection \[\hat{X} = XBB^T.\]

    PCA: Example and Intuition

    Let us return to our same housing price example which is the following dataset with three features: \[\begin{bmatrix} 10 & 2 & 70 \\ 20 & 3 & 130 \\ 15 & 2 & 100 \end{bmatrix}.\] Here the first two columns are features (size in square feet and number of bedrooms) and the last column is the target variable (price in thousands of dollars). For PCA, we will only consider the feature columns: \[\begin{bmatrix} 10 & 2 \\ 20 & 3 \\ 15 & 2 \end{bmatrix}.\] Why we are considering only feature columns? Because in real life we may not know the target variable for all data points, does it mean the dataset is useless? No, we can still use the feature columns to learn about the structure of the data. PCA helps us find a lower-dimensional representation of the data that captures the most important information. This is useful for visualization, noise reduction, and as a preprocessing step for other machine learning algorithms as it can help improve their performance by reducing overfitting and computational complexity. So in this example, we will reduce the two features (size and number of bedrooms) into a single feature that captures the most variance in the data. First step is to center the data by subtracting the mean of each feature from the corresponding feature values. This ensures that each feature has a mean of zero, which is important for PCA as it focuses on the directions of maximum variance rather than absolute values.

    Data Centering

    The mean of the first feature (size) is \((10 + 20 + 15) / 3 = 15\), and the mean of the second feature (number of bedrooms) is \((2 + 3 + 2) / 3 \approx 2.33\). Centering the data gives: \[\begin{bmatrix} 10 - 15 & 2 - 2.33 \\ 20 - 15 & 3 - 2.33 \\ 15 - 15 & 2 - 2.33 \end{bmatrix}.\] This results in the centered data matrix: \[\begin{bmatrix} -5 & -0.33 \\ 5 & 0.67 \\ 0 & -0.33 \end{bmatrix}.\] Now we can proceed to compute the covariance matrix of the centered data, because PCA relies on the covariance structure of the data to identify the principal components. This is due to the fact that principal components are directions in the feature space along which the data varies the most, and the covariance matrix captures this variance and the relationships between different features. Geometrically, the covariance matrix provides information about how the data is spread out in different directions, which is essential for determining the axes of maximum variance that PCA seeks to identify. Graphically see below:

    This graph shows the centered data points in a 2D space defined by size and number of bedrooms. The points are plotted relative to the origin, which represents the mean of the original data. The spread of these points indicates the variance in the data, which PCA will analyze to find the principal components.

    Covariance Matrix Calculation

    The covariance matrix \(C\) is computed as: \[C = \frac{1}{n} Z^\top Z,\] where \(Z\) is the centered data matrix and \(n\) is the number of data points. In our case, \(n = 3\), so we have: \[C = \frac{1}{3} \begin{bmatrix} -5 & 5 & 0 \\ -0.33 & 0.67 & -0.33 \end{bmatrix} \begin{bmatrix} -5 & -0.33 \\ 5 & 0.67 \\ 0 & -0.33 \end{bmatrix}.\] Calculating this gives: \[C = \frac{1}{3} \begin{bmatrix} (-5)(-5) + (5)(5) + (0)(0) & (-5)(-0.33) + (5)(0.67) + (0)(-0.33) \\ (-0.33)(-5) + (0.67)(5) + (-0.33)(0) & (-0.33)(-0.33) + (0.67)(0.67) + (-0.33)(-0.33) \end{bmatrix}\] \[= \frac{1}{3} \begin{bmatrix} 25 + 25 + 0 & 1.65 + 3.35 + 0 \\ 1.65 + 3.35 + 0 & 0.1089 + 0.4489 + 0.1089 \end{bmatrix}\] \[= \frac{1}{3} \begin{bmatrix} 50 & 5 \\ 5 & 0.6667 \end{bmatrix} = \begin{bmatrix}16.6667 & 1.6667 \\ 1.6667 & 0.2222\end{bmatrix}.\] Next step is to compute the eigenvalues and eigenvectors of the covariance matrix \(C\). The eigenvalues represent the amount of variance captured by each principal component, while the eigenvectors indicate the directions of these components in the feature space.

    Why Eigenvalues and Eigenvectors Appear in PCA

    At the heart of PCA lies the optimization problem \[\max_{\|v\|=1} \; v^\top \Sigma v,\] where \(\Sigma\) is the covariance matrix of the mean-centered data.

    This problem asks for a linear combination of features whose values have maximum variance. Without the unit-length constraint \(\|v\|=1\), the objective would increase arbitrarily by scaling \(v\), so the constraint is essential.

    Constrained Optimization and the Eigenvalue Equation

    To solve this constrained maximization, we introduce a Lagrange multiplier: \[\mathcal{L}(v,\lambda) = v^\top \Sigma v - \lambda (v^\top v - 1).\]

    Taking derivatives with respect to \(v\) and setting them to zero yields \[\nabla_v \mathcal{L} = 2\Sigma v - 2\lambda v = 0,\] which simplifies to \[\Sigma v = \lambda v.\]

    This equation is precisely the definition of an eigenvector–eigenvalue pair. Thus, eigenvectors are not assumed in PCA; they emerge inevitably as the stationary points of the variance maximization problem.

    Why the Largest Eigenvalue Matters

    For any unit vector \(v\), \[v^\top \Sigma v\] equals the variance of the one-dimensional projection \(z = x^\top v\).

    When \(v\) is an eigenvector of \(\Sigma\) with eigenvalue \(\lambda\), this quantity becomes \[v^\top \Sigma v = \lambda.\]

    Hence, the eigenvalue directly measures the variance captured by its corresponding eigenvector. Selecting the eigenvector with the largest eigenvalue therefore yields the direction of maximum variance, called the first principal component.

    Multiple Components and Orthogonality

    After the first component is chosen, PCA seeks additional components that:

    This leads naturally to the requirement that subsequent vectors be orthogonal to earlier ones. The eigenvectors of a symmetric matrix such as \(\Sigma\) are automatically orthogonal, so they provide an ordered set of non-redundant directions.

    Interpretation and Practical Handling

    Eigenvectors should not be interpreted as causes or predictors. They represent optimal coordinates in which the structure of the data is most economically expressed.

    Key Takeaway

    Eigenvectors and eigenvalues appear in PCA because maximizing variance under a normalization constraint mathematically forces the eigenvalue equation; they quantify optimal directions and the amount of variation captured along them.

    Eigenvalue and Eigenvector Calculation

    To find the eigenvalues \(\lambda\) of the covariance matrix \(C\), we solve the characteristic equation: \[\det(C - \lambda I) = 0,\] where \(I\) is the identity matrix. Thus, we have: \[\det\begin{bmatrix}16.6667 - \lambda & 1.6667 \\ 1.6667 & 0.2222 - \lambda\end{bmatrix} = 0.\] Calculating the determinant gives: \[(16.6667 - \lambda)(0.2222 - \lambda) - (1.6667)(1.6667) = 0.\] Expanding this, we get: \[\lambda^2 - 16.8889\lambda + (16.6667 \cdot 0.2222 - 2.7778) = 0,\] \[\lambda^2 - 16.8889\lambda + 1.1111 = 0.\] Using the quadratic formula: \[\lambda = \frac{16.8889 \pm \sqrt{(16.8889)^2 - 4 \cdot 1 \cdot 1.1111}}{2}.\] Calculating the discriminant: \[(16.8889)^2 - 4 \cdot 1.1111 \approx 285.1852 - 4.4444 = 280.7408.\] Thus, \[\lambda \approx \frac{16.8889 \pm 16.7553}{2}.\] This gives us two eigenvalues: \[\lambda_1 \approx 16.8444, \quad \lambda_2 \approx 0.0444.\] Next, we find the eigenvectors corresponding to each eigenvalue by solving the equation: \[(C - \lambda I)v = 0.\] For \(\lambda_1 \approx 16.8444\): \[\begin{bmatrix}16.6667 - 16.8444 & 1.6667 \\ 1.6667 & 0.2222 - 16.8444\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] This simplifies to: \[\begin{bmatrix}-0.1777 & 1.6667 \\ 1.6667 & -16.6222\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] Solving this system gives the eigenvector: \[v_1 \approx \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix}.\] For \(\lambda_2 \approx 0.0444\): \[\begin{bmatrix}16.6667 - 0.0444 & 1.6667 \\ 1.6667 & 0.2222 - 0.0444\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] This simplifies to: \[\begin{bmatrix}16.6223 & 1.6667 \\ 1.6667 & 0.1778\end{bmatrix} \begin{bmatrix}v_1 \\ v_2\end{bmatrix} = 0.\] Solving this system gives the eigenvector: \[v_2 \approx \begin{bmatrix}-0.1960 \\ 0.9806\end{bmatrix}.\] The eigenvector corresponding to the largest eigenvalue \(\lambda_1\) represents the direction of the first principal component, which captures the most variance in the data. In this case, the first principal component is approximately: \[PC_1 \approx \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix}.\] To reduce the dimensionality of the data from two features to one, we project the centered data onto the first principal component. This is done by taking the dot product of each centered data point with the first principal component vector.

    Data Projection

    The projection of a centered data point \(z_i\) onto the first principal component \(PC_1\) is given by: \[p_i = z_i^\top PC_1.\] Calculating the projections for each centered data point: \[\begin{aligned} p_1 &= \begin{bmatrix}-5 & -0.33\end{bmatrix} \cdot \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix} \approx -5 \cdot 0.9806 + (-0.33) \cdot 0.1960 \approx -4.903 - 0.0647 \approx -4.9677, \\ p_2 &= \begin{bmatrix}5 & 0.67\end{bmatrix} \cdot \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix} \approx 5 \cdot 0.9806 + 0.67 \cdot 0.1960 \approx 4.903 + 0.1313 \approx 5.0343, \\ p_3 &= \begin{bmatrix}0 & -0.33\end{bmatrix} \cdot \begin{bmatrix}0.9806 \\ 0.1960\end{bmatrix} \approx 0 \cdot 0.9806 + (-0.33) \cdot 0.1960 \approx -0.0647. \end{aligned}\] Thus, the one-dimensional representation of the data after PCA is approximately: \[\begin{bmatrix}-4.9677 \\ 5.0343 \\ -0.0647\end{bmatrix}.\] This reduced representation captures the most significant variance in the original two-dimensional feature space, allowing us to analyze and visualize the data more effectively while reducing complexity. Let us find the variance of this one-dimensional representation: \[\mathrm{Var}(p) = \frac{1}{3} \sum_{i=1}^3 p_i^2 \approx \frac{1}{3}((-4.9677)^2 + (5.0343)^2 + (-0.0647)^2) \approx \frac{1}{3}(24.678 + 25.344 + 0.0042) \approx 16.674.\] This variance is close to the largest eigenvalue \(\lambda_1 \approx 16.8444\), confirming that the first principal component captures the majority of the variance in the data. Second principal component captures the remaining variance, which is much smaller in this case, indicating that the data is primarily spread along the direction of the first principal component. See calculation below: \[\mathrm{Var}(p_2) = \frac{1}{3} \sum_{i=1}^3 p_{2i}^2 \approx \frac{1}{3}((-0.1960)^2 + (0.9806)^2 + (0.1960)^2) \approx \frac{1}{3}(0.0384 + 0.9616 + 0.0384) \approx 0.044\] This variance is close to the second eigenvalue \(\lambda_2 \approx 0.0444\), confirming that the second principal component captures the remaining variance in the data. The first principal component is the most informative direction in the feature space, while the second principal component captures a much smaller amount of variance, indicating that it may be less important for understanding the structure of the data. This example illustrates how PCA identifies the directions of maximum variance in the data and allows us to reduce the dimensionality while retaining as much information as possible. By projecting the data onto the first principal component, we can analyze the most significant patterns in the data with a simpler representation.

    Limitations of PCA

    While PCA is a powerful tool for dimensionality reduction and data analysis, it has several limitations:

    Density Estimation: GMM

    In this section we will discuss the third pillar of unsupervised learning: density estimation. The goal of density estimation is to model the underlying probability distribution of the data. One common approach to density estimation is to use a Gaussian Mixture Model (GMM), which assumes that the data is generated from a mixture of several Gaussian distributions. Each Gaussian component represents a cluster in the data, and the overall model can capture complex, multimodal distributions. We will follow the notations and example from the book Mathematics for Machine Learning by Deisenroth, Faisal, and Ong. In that example the dataset given is only one-dimensional, but the concepts can be easily extended to higher dimensions. The dataset consists of the following points: \[-3,-2.5,-1,0,2,4,5.\] The first step in fitting a GMM to this data is to choose the number of Gaussian components, which we will denote as \(K\). For this example, we will choose \(K=3\). We initialize the mixture components by following three distributions \[p_1(x) = \mathcal{N}(x; -4,1), \quad p_2(x) = \mathcal{N}(x; 0,0.2), \quad p_3(x) = \mathcal{N}(x; 8,3).\] and assign equal weights to each component: \[\pi_1 = \pi_2 = \pi_3 = \frac{1}{3}.\]

    Now general GMM equation looks like this: \[p(x|\theta) = \sum_{k=1}^K \pi_k \mathcal{N}(x; \mu_k, \Sigma_k).\] where \(\theta\) represents the parameters of the model, including the means \(\mu_k\), covariances \(\Sigma_k\), and mixture weights \(\pi_k\). That is \[\theta := \{\mu_k, \Sigma_k \}\] for \(k=1,\dots,K\}\cup \{\pi_k\}_{k=1}^K\).

    Now the parameters of the GMM are learned through Maximum Likelihood Estimation (MLE), which involves maximizing the likelihood of the observed data under the model. The likelihood function for a GMM is given by: \[L(\theta) = p( \mathcal{X}|\theta)= \prod_{n=1}^N p(x_n|\theta) = \prod_{n=1}^N \sum_{k=1}^K \pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k).\] Maximizing this likelihood directly can be challenging due to the presence of the summation inside the product. Instead, we typically use the Expectation-Maximization (EM) algorithm, which iteratively updates the parameters of the GMM to find a local maximum of the likelihood function. The EM algorithm consists of two main steps: the E-step (Expectation step) and the M-step (Maximization step). In the E-step, we compute the responsibilities, which are the probabilities that each data point belongs to each Gaussian component. In the M-step, we update the parameters of the GMM based on these responsibilities. This process is repeated until convergence, at which point we have a fitted GMM that models the underlying distribution of the data. Taking logarithm of the likelihood function gives us the log-likelihood, which is often easier to work with: \[\mathcal{L}:=\log L(\theta) = \sum_{n=1}^N \log \left( \sum_{k=1}^K \pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k) \right).\] The EM algorithm maximizes this, which is not easy because it has a log of a sum. The closed form solution is not available, unlike in the case of single Gaussian distribution or PCA. The EM algorithm provides an efficient way to find a local maximum of the log-likelihood function, allowing us to fit a GMM to the data even when the likelihood function is complex and does not have a closed-form solution. The EM algorithm is guaranteed to converge to a local maximum of the log-likelihood function, but it may not find the global maximum. Therefore, it is often recommended to run the EM algorithm multiple times with different initializations to increase the chances of finding a better solution. Additionally, the choice of \(K\) (the number of components) can significantly affect the performance of the GMM, and methods such as cross-validation or information criteria (e.g., AIC, BIC) can be used to select an appropriate value for \(K\). Now for finding maximum of the log-likelihood function, we have three partial derivatives with respect to the parameters \(\mu_k\), \(\Sigma_k\), and \(\pi_k\). The three equations are as follows: \[\frac{\partial \mathcal{L}}{\partial \mu_k} = 0, \quad \frac{\partial \mathcal{L}}{\partial \Sigma_k} = 0, \quad \frac{\partial \mathcal{L}}{\partial \pi_k} = 0.\] which is equivalent to the following equations: \[\sum_{n=1}^N \frac{\pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(x_n; \mu_j, \Sigma_j)} (x_n - \mu_k) = 0,\] \[\sum_{n=1}^N \frac{\pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(x_n; \mu_j, \Sigma_j)} \left( (x_n - \mu_k)(x_n - \mu_k)^T - \Sigma_k \right) = 0,\] \[\sum_{n=1}^N \frac{\mathcal{N}(x_n; \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(x_n; \mu_j, \Sigma_j)} - 1 = 0.\] These equations are derived from setting the gradients of the log-likelihood function with respect to the parameters to zero, which is a necessary condition for finding the maximum. The first equation updates the means \(\mu_k\), the second equation updates the covariances \(\Sigma_k\), and the third equation updates the mixture weights \(\pi_k\). Let us look at the detailed derivation of all the three equations. We will start with the first equation for updating the means \(\mu_k\). The log-likelihood function is given by: \[\mathcal{L} = \sum_{n=1}^N \log \left( \sum_{k=1}^K \pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k) \right).\] To find the gradient with respect to \(\mu_k\), we use the chain rule of differentiation. Let us denote: \[r_{nk} = \frac{\pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(x_n; \mu_j, \Sigma_j)},\] which represents the responsibility of component \(k\) for data point \(n\). Then we can rewrite the log-likelihood as: \[\mathcal{L} = \sum_{n=1}^N \log \left( \sum_{k=1}^K \pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k) \right) = \sum_{n=1}^N \log \left( \sum_{k=1}^K r_{nk} \right).\] Now, we can compute the gradient with respect to \(\mu_k\): \[\frac{\partial \mathcal{L}}{\partial \mu_k} = \sum_{n=1}^N \frac{1}{\sum_{j=1}^K \pi_j \mathcal{N}(x_n; \mu_j, \Sigma_j)} \cdot \pi_k \mathcal{N}(x_n; \mu_k, \Sigma_k) \cdot \frac{\partial \log \mathcal{N}(x_n; \mu_k, \Sigma_k)}{\partial \mu_k}.\] The derivative of the log of a Gaussian distribution with respect to its mean is given by: \[\frac{\partial \log \mathcal{N}(x_n; \mu_k, \Sigma_k)}{\partial \mu_k} = \Sigma_k^{-1} (x_n - \mu_k).\] A quick derivation of this is as follows: \[\log \mathcal{N}(x_n; \mu_k, \Sigma_k) = -\frac{1}{2} (x_n - \mu_k)^T \Sigma_k^{-1} (x_n - \mu_k) + \text{constant}.\] Taking the derivative with respect to \(\mu_k\) gives: \[\frac{\partial \log \mathcal{N}(x_n; \mu_k, \Sigma_k)}{\partial \mu_k} = -\frac{1}{2} \left( -2 \Sigma_k^{-1} (x_n - \mu_k) \right) = \Sigma_k^{-1} (x_n - \mu_k).\]

    Substituting this back into the gradient expression gives: \[\frac{\partial \mathcal{L}}{\partial \mu_k} = \sum_{n=1}^N r_{nk} \Sigma_k^{-1} (x_n - \mu_k).\] Setting this gradient to zero for maximization yields the update equation for the means: \[\sum_{n=1}^N r_{nk} (x_n - \mu_k) = 0 \implies \mu_k = \frac{\sum_{n=1}^N r_{nk} x_n}{\sum_{n=1}^N r_{nk}}.\] This equation indicates that the new mean \(\mu_k\) is a weighted average of the data points, where the weights are given by the responsibilities \(r_{nk}\). The more a data point belongs to component \(k\), the more it influences the new mean of that component.

    Support Vector Machines (SVM)

    The fourth pillar in Machine Learning is classification, and one of the most powerful algorithms for classification is the Support Vector Machine (SVM). SVM is very geometrically intuitive and is based on the idea of finding a hyperplane that best separates the classes in the feature space. Recall that equation of a straight line in two dimensions is given by: \[y = mx + b,\] but this notation is not very convenient for SVM as we will work in higher dimensions. Instead, we can rewrite this equation in a more general form as: \[w^T x + b = 0,\] where \[w = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_D \end{bmatrix}\] is a weight vector that is perpendicular to the hyperplane, \[x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_D \end{bmatrix}\] is a point in the feature space of dimension \(D\), and \(b\) is a bias term that shifts the hyperplane away from the origin. Now let us recall how to find distance of a point from a line in two dimensions. The distance \(d\) from a point \((x_0, y_0)\) to the line \(y = mx + b\) can be calculated using the formula: \[d = \frac{|mx_0 - y_0 + b|}{\sqrt{m^2 + 1}}.\] This one is not going to work well in higher dimensions, so we need to derive a more general formula for the distance from a point to a hyperplane in \(D\)-dimensional space. In higher dimensions, the distance from a point \(x_0\) to the hyperplane defined by \(w^T x + b = 0\) can be calculated using the formula: \[d = \frac{|w^T x_0 + b|}{\|w\|},\] where \(\|w\|\) is the Euclidean norm of the weight vector \(w\). This formula gives us the distance from the point \(x_0\) to the hyperplane, which is crucial for SVM as it aims to maximize this distance for the closest points in the training data, known as support vectors. By maximizing the margin (the distance from the support vectors to the hyperplane), SVM seeks to find the optimal hyperplane that best separates the classes in the feature space, leading to better generalization performance on unseen data. As you can see that there are two goals in “maximize this distance for the closest points in the training data". So the first goal is to find the closest point to the hyperplane, which is known as the support vector. The second goal is to maximize the distance from this support vector to the hyperplane, which is equivalent to maximizing the margin. The margin is defined as the distance between the support vector and the hyperplane, and it represents how well the classes are separated. A larger margin indicates better separation and typically leads to better generalization performance on unseen data. Therefore, SVM focuses on finding the hyperplane that maximizes this margin, ensuring that the closest points (support vectors) are as far away from the hyperplane as possible, which helps in achieving better classification results. Mathematically this can be formulated as an optimization problem: \[\begin{aligned} &\text{maximize} \quad r \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq r, \quad \forall n \end{aligned}\] where \(r\) is the margin, \(y_n\) is the class label of the \(n\)-th training point, \(x_n\) is the feature vector of the \(n\)-th training point, and \(r\) is a positive constant that defines the margin. This optimization problem can be solved using techniques from convex optimization, and it leads to the formulation of the SVM algorithm. The solution to this problem gives us the optimal weight vector \(w\) and bias term \(b\) that define the hyperplane with the maximum margin, allowing us to classify new data points based on which side of the hyperplane they fall on. Mathematically, the product will always be positive for points that are correctly classified, and the margin \(r\) is the minimum value of this product across all training points. Why is this? Because for labels take values in \(\{-1, 1\}\), the product \(y_n (w^T x_n + b)\) will be positive if the point is correctly classified (i.e., if \(y_n = 1\) and \(w^T x_n + b > 0\) or if \(y_n = -1\) and \(w^T x_n + b < 0\)). The margin \(r\) is defined as the minimum value of this across all training points, which means that it represents the distance from the closest point to the hyperplane. By maximizing this margin \(r\), SVM aims to find the best possible separation between the classes, which is a key factor in achieving good generalization performance on unseen data. This problem can also be formulated as minimizing the maximum of the distances from the support vectors to the hyperplane, which is equivalent to maximizing the margin. So the margin is the distance of the closest point to the hyperplane \(w^T x + b = 0\), and is given by the formula: \[\text{Margin} = \min \left\{ \frac{|w^T x_n + b|}{\|w\|} \right\} \quad \text{for all } n.\] But this is involving modulus, which is not convenient for optimization. Instead, we can rewrite the signed distance as: \[\text{Margin} = \min \left\{ \frac{y_n (w^T x_n + b)}{\|w\|} \right\} \quad \text{for all } n.\] Now let us say that the quantity \(y_n (w^T x_n + b)\) is greater than or equal to r for all \(n\), which means that the margin is at least \[\frac{r}{w}\] for all \(n\). This is equivalent to saying that the margin is at least \(\frac{r}{\|w\|}\) for all \(n\). Therefore, maximizing the margin is equivalent to maximizing \(\frac{r}{\|w\|}\), which is equivalent to minimizing \(\|w\|\) subject to the constraint that \(y_n (w^T x_n + b) \geq r\) for all \(n\). This leads us to the following optimization problem: \[\begin{aligned} &\text{minimize} \quad \|w\| \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq r, \quad \forall n \end{aligned}\] This is also written as max min combined optimization problem as follows: \[\max_{w, b} \min_n \frac{y_n (w^T x_n + b)}{\|w\|}.\] Now note that if we scale the parameters \(w\) and \(b\) by a positive constant \(\alpha\), the decision boundary defined by \(w^T x + b = 0\) remains unchanged. This is because the equation can be rewritten as: \[\alpha w^T x + \alpha b = 0 \implies w^T x + b = 0.\] This invariance property is crucial for the optimization problem, as it allows us to focus on finding the optimal \(w\) and \(b\) without worrying about their scale. Also if you substitute \[w' = \alpha w\] and \[b' = \alpha b\] into the margin expression, you will see that the margin is also invariant to this scaling: \[\begin{aligned} \text{Margin} &= \min_{n}\frac{y_n (w'^T x_n + b')}{\|w'\|} = \min_{n}\frac{y_n (\alpha w^T x_n + \alpha b)}{\|\alpha w\|} \\[4pt] &= \min_{n}\frac{\alpha\, y_n (w^T x_n + b)}{\alpha \|w\|} = \min_{n}\frac{y_n (w^T x_n + b)}{\|w\|}. \end{aligned}\] So we choose such an \(\alpha\) that the margin becomes \[\frac{1}{\|w'\|}\]. This is equivalent to setting \(r = 1\) in the optimization problem, which simplifies it to: \[\begin{aligned} &\text{minimize} \quad \|w\| \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq 1, \quad \forall n \end{aligned}\] The above formulation of SVM is also called the hard-margin SVM, which assumes that the data is linearly separable. This is rewritten with coefficient \[1/2\] as follows: \[\begin{aligned} &\text{minimize} \quad \frac{1}{2}\|w\|^2 \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq 1, \quad \forall n \end{aligned}\] But before applying Lagrangian multipliers, let us realize that in real world data is often not linearly separable, which means that we need to allow for some misclassifications. This leads us to the soft-margin SVM formulation, which introduces slack variables \(\xi_n\) to account for these misclassifications: \[\begin{aligned} &\text{minimize} \quad \frac{1}{2}\|w\|^2 + C \sum_{n} \xi_n \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq 1 - \xi_n, \quad \forall n \\ &\xi_n \geq 0, \quad \forall n \end{aligned}\] where \(C\) is a regularization parameter that controls the trade-off between maximizing the margin and minimizing the classification error. The slack variables \(\xi_n\) allow for some misclassifications, and the optimization problem seeks to find the best balance between having a large margin and allowing for some errors in classification. This formulation of SVM is more flexible and can handle cases where the data is not perfectly linearly separable, making it more applicable to real-world datasets. How these slack variables work is that if a data point is correctly classified and lies outside the margin, then \(\xi_n = 0\). If a data point is correctly classified but lies within the margin, then \(0 < \xi_n < 1\). If a data point is misclassified, then \(\xi_n > 1\). The optimization problem seeks to minimize the sum of the slack variables, which encourages the model to find a hyperplane that not only maximizes the margin but also minimizes the number of misclassifications. By adjusting the regularization parameter \(C\), we can control how much we want to penalize misclassifications versus maximizing the margin, allowing us to find an optimal balance for our specific dataset. Note that if slack variable is in between 0 and 1, it should have been support vector, because it is correctly classified, so why it is not support vector? The answer is that it can be a support vector, but it is not necessarily a support vector. A support vector is defined as a data point that lies on the margin or within the margin, which means that it can have a slack variable \(\xi_n\) that is greater than zero but less than or equal to one. However, not all data points with \(0 < \xi_n < 1\) are necessarily support vectors, as they may not lie on the margin or within the margin. The key point is that the optimization problem seeks to find the hyperplane that maximizes the margin while allowing for some misclassifications, and the slack variables help to achieve this balance. Therefore, while some data points with \(0 < \xi_n < 1\) may be support vectors, it is not guaranteed that all such data points will be support vectors, as it depends on their position relative to the margin and the hyperplane. The support vectors are the critical elements of the SVM model, as they are the data points that define the position and orientation of the hyperplane. The optimization problem focuses on maximizing the margin defined by these support vectors, which is why they play a crucial role in determining the decision boundary of the SVM. In summary, while some data points with \(0 < \xi_n < 1\) may be support vectors, it is not guaranteed that all such data points will be support vectors, as it depends on their position relative to the margin and the hyperplane. The optimization problem seeks to find the best balance between maximizing the margin and minimizing misclassifications, and the slack variables help to achieve this balance by allowing for some flexibility in how the model handles data points that are not perfectly classified.

    Before solving the optimization problem using Lagrangian multipliers, let us also note that SVM can be extended to handle non-linear decision boundaries by using kernel functions. The kernel trick allows us to implicitly map the input data into a higher-dimensional feature space where a linear separation is possible, without explicitly computing the coordinates of the data in that space. This is achieved by replacing the dot product in the optimization problem with a kernel function \(K(x_i, x_j)\), which computes the similarity between data points in the original feature space. Common kernel functions include the polynomial kernel, radial basis function (RBF) kernel, and sigmoid kernel. By using kernels, SVM can effectively capture complex relationships in the data and find non-linear decision boundaries, making it a powerful tool for classification tasks in various domains. But where does the dot product come from in the optimization problem? For this let us give detailed derivation of the dual form of the SVM optimization problem using Lagrangian multipliers. The primal optimization problem for the hard-margin SVM is given by: \[\begin{aligned} &\text{minimize} \quad \frac{1}{2}\|w\|^2 \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq 1, \quad \forall n \end{aligned}\] To derive the dual form of this optimization problem, we introduce Lagrange multipliers \(\alpha_n \geq 0\) for each constraint. The Lagrangian function is defined as: \[L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_{n=1}^N \alpha_n [y_n (w^T x_n + b) - 1].\] Taking the partial derivatives of the Lagrangian with respect to \(w\) and \(b\) and setting them to zero gives us the following conditions: \[\frac{\partial L}{\partial w} = w - \sum_{n=1}^N \alpha_n y_n x_n = 0 \implies w = \sum_{n=1}^N \alpha_n y_n x_n,\] \[\frac{\partial L}{\partial b} = -\sum_{n=1}^N \alpha_n y_n = 0 \implies \sum_{n=1}^N \alpha_n y_n = 0.\] Substituting and full derivation below: \[\begin{aligned} L(w, b, \alpha) &= \frac{1}{2}\|w\|^2 - \sum_{n=1}^N \alpha_n [y_n (w^T x_n + b) - 1] \\ &= \frac{1}{2} \left\| \sum_{n=1}^N \alpha_n y_n x_n \right\|^2 - \sum_{n=1}^N \alpha_n [y_n ((\sum_{m=1}^N \alpha_m y_m x_m)^T x_n + b) - 1] \\ &= \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N \alpha_n \alpha_m y_n y_m x_n^T x_m - \sum_{n=1}^N \alpha_n [y_n ((\sum_{m=1}^N \alpha_m y_m x_m)^T x_n + b) - 1] \\ &= \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N \alpha_n \alpha_m y_n y_m x_n^T x_m - \sum_{n=1}^N \alpha_n [y_n ((\sum_{m=1}^N \alpha_m y_m x_m)^T x_n) + y_n b - 1] \end{aligned}\] Now, since \(\sum_{n=1}^N \alpha_n y_n = 0\), the term involving \(b\) will vanish, and we are left with: \[\begin{aligned} &\text{maximize} \quad \sum_{n=1}^N \alpha_n - \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N \alpha_n \alpha_m y_n y_m x_n^T x_m \\ &\text{subject to 2 conditions} \quad \alpha_n \geq 0, \quad \forall n, \quad \text{and} \\ & \sum_{n=1}^N \alpha_n y_n = 0. \end{aligned}\] Why suddenly the minimize became maximize? This is because we are now maximizing the dual function, which is derived from the Lagrangian. The dual function is defined as: \[g(\alpha) = \inf_{w, b} L(w, b, \alpha).\] Since we are maximizing the dual function with respect to \(\alpha\), the optimization problem is now a maximization problem. Now we use the results of famous Karush-Kuhn-Tucker (KKT) conditions to find the optimal solution for \(\alpha\). The KKT conditions state that at the optimal solution, the following four must hold: 1. Primal feasibility: \(y_n (w^T x_n + b) \geq 1\) for all \(n\). 2. Dual feasibility: \(\alpha_n \geq 0\) for all \(n\). 3. Complementary slackness: \(\alpha_n [y_n (w^T x_n + b) - 1] = 0\) for all \(n\). 4. Stationarity: \(\nabla_w L = 0\) and \(\nabla_b L = 0\).

    Let us describe what each of these conditions means in the context of SVM: 1. Primal feasibility ensures that the constraints of the original optimization problem are satisfied, meaning that all data points are correctly classified with a margin of at least 1. 2. Dual feasibility ensures that the Lagrange multipliers are non-negative, which is a requirement for the optimization problem. 3. Complementary slackness means that for each data point, either the Lagrange multiplier is zero (indicating that the constraint is not active) or the constraint is satisfied with equality (indicating that the point lies on the margin). 4. Stationarity ensures that the gradients of the Lagrangian with respect to the primal variables are zero, which is necessary for optimality. By applying these KKT conditions, we can find the optimal values of \(\alpha\) that maximize the dual function, which in turn gives us the optimal weight vector \(w\) and bias term \(b\) for the SVM model. The support vectors correspond to the data points for which \(\alpha_n > 0\), as these are the points that lie on the margin and contribute to defining the decision boundary of the SVM. The final decision function for classifying new data points can be expressed in terms of the support vectors and their corresponding Lagrange multipliers, allowing us to make predictions based on the learned model. Below is full worked out example of SVM with a small dataset. Consider the following dataset with two classes: \[\text{Class 1: } (1, 2), (2, 3), (3, 1)\] \[\text{Class 2: } (5, 6), (6, 7), (7, 5)\] We want to find the optimal hyperplane that separates these two classes using SVM. We can start by plotting the data points to visualize the separation. The points from Class 1 are clustered around the lower left corner, while the points from Class 2 are clustered around the upper right corner. We can see that there is a clear separation between the two classes, and we can find a hyperplane that separates them. Next, we can set up the optimization problem for SVM. We will use the hard-margin SVM formulation, which assumes that the data is linearly separable. The optimization problem is: \[\begin{aligned} &\text{minimize} \quad \frac{1}{2}\|w\|^2 \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq 1, \quad \forall n \end{aligned}\] where \(y_n\) is the class label for each data point (1 for Class 1 and -1 for Class 2), \(x_n\) is the feature vector for each data point, and \(w\) and \(b\) are the parameters of the hyperplane we want to find. We can assign the class labels as follows: \[\text{Class 1: } (1, 2) \rightarrow 1, (2, 3) \rightarrow 1, (3, 1) \rightarrow 1\] \[\text{Class 2: } (5, 6) \rightarrow -1, (6, 7) \rightarrow -1, (7, 5) \rightarrow -1\] Now we can set up the Lagrangian for this optimization problem and derive the dual form using Lagrange multipliers, as we discussed earlier. After solving the dual optimization problem, we can find the optimal values of \(\alpha\), which will allow us to compute the optimal weight vector \(w\) and bias term \(b\). The support vectors will be the data points for which \(\alpha_n > 0\), and these points will define the decision boundary of the SVM. Finally, we can use the learned model to classify new data points based on which side of the hyperplane they fall on, allowing us to make predictions for unseen data. This example illustrates how SVM can be applied to a simple dataset to find the optimal hyperplane that separates two classes, and how the optimization problem is formulated and solved using Lagrangian multipliers and the KKT conditions. In practice, we would typically use a software library such as scikit-learn in Python to implement SVM, which provides efficient algorithms for solving the optimization problem and allows us to easily apply SVM to larger and more complex datasets. Additionally, we can experiment with different kernel functions to capture non-linear relationships in the data, further enhancing the flexibility and performance of the SVM model. But let us do by hand calculation for this small dataset to find the optimal hyperplane. Below are all the calculations for this example with every single detail, including the values of the Lagrange multipliers, the weight vector, and the bias term. We will also identify the support vectors and compute the decision function for classifying new data points based on the learned model. This detailed example will provide a comprehensive understanding of how SVM works in practice and how the optimization problem is solved to find the optimal hyperplane for classification. Step 1 details are as follows: 1. Assign class labels: \[\text{Class 1: } (1, 2) \rightarrow 1, (2, 3) \rightarrow 1, (3, 1) \rightarrow 1\] \[\text{Class 2: } (5, 6) \rightarrow -1, (6, 7) \rightarrow -1, (7, 5) \rightarrow -1\] 2. Set up the optimization problem: \[\begin{aligned} &\text{minimize} \quad \frac{1}{2}\|w\|^2 \\ &\text{subject to} \quad y_n (w^T x_n + b) \geq 1, \quad \forall n \end{aligned}\] 3. Introduce Lagrange multipliers \(\alpha_n\) for each constraint and set up the Lagrangian: \[L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_{n=1}^N \alpha_n [y_n (w^T x_n + b) - 1].\] 4. Take the partial derivatives of the Lagrangian with respect to \(w\) and \(b\) and set them to zero to find the conditions for optimality: \[\frac{\partial L}{\partial w} = w - \sum_{n=1}^N \alpha_n y_n x_n = 0 \implies w = \sum_{n=1}^N \alpha_n y_n x_n,\] \[\frac{\partial L}{\partial b} = -\sum_{n=1}^N \alpha_n y_n = 0 \implies \sum_{n=1}^N \alpha_n y_n = 0.\] 5. Substitute the expression for \(w\) back into the Lagrangian to derive the dual optimization problem: \[\begin{aligned}&\text{maximize} \quad \sum_{n=1}^N \alpha_n - \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N \alpha_n \alpha_m y_n y_m x_n^T x_m \\ &\text{subject to} \quad \alpha_n \geq 0, \quad \forall n, \quad \text{and} \\ & \sum_{n=1}^N \alpha_n y_n = 0. \end{aligned}\] 6. Solve the dual optimization problem for this concrete dataset. Introduce unknowns \(\alpha_1,\alpha_2,\alpha_3,\alpha_4,\alpha_5,\alpha_6\) corresponding to the six data points (in the same order as listed). The dual QP is

    \[\begin{aligned} \text{maximize}\quad & \alpha_1+\alpha_2+\alpha_3+\alpha_4+\alpha_5+\alpha_6 \\ &\quad -\tfrac{1}{2}\sum_{i=1}^6\sum_{j=1}^6 \alpha_i \alpha_j y_i y_j \, (x_i^\top x_j) ,\\[4pt] \text{subject to}\quad & \alpha_i \ge 0 \quad\text{for } i=1,\dots,6,\\ & \alpha_1+\alpha_2+\alpha_3 - (\alpha_4+\alpha_5+\alpha_6)=0, \end{aligned}\] where the labels are \(y_1=y_2=y_3=1\) and \(y_4=y_5=y_6=-1\), and the Gram (dot-product) matrix \(G\) with entries \(G_{ij}=x_i^\top x_j\) is

    \[G=\begin{bmatrix} 5 & 8 & 5 & 17 & 20 & 17\\ 8 & 13 & 9 & 28 & 33 & 29\\ 5 & 9 & 10 & 21 & 25 & 26\\ 17 & 28 & 21 & 61 & 72 & 65\\ 20 & 33 & 25 & 72 & 85 & 77\\ 17 & 29 & 26 & 65 & 77 & 74 \end{bmatrix}.\]

    Thus the quadratic term is fully specified by the numbers above: the coefficient multiplying \(\alpha_i\alpha_j\) is \(-\tfrac{1}{2}y_i y_j G_{ij}\).

    Once the QP is solved numerically (e.g. with a quadratic-program solver), the primal parameters recover as

    \[w \;=\; \sum_{i=1}^6 \alpha_i y_i x_i = \begin{bmatrix} \alpha_1\cdot 1 + \alpha_2\cdot 2 + \alpha_3\cdot 3 - \alpha_4\cdot 5 - \alpha_5\cdot 6 - \alpha_6\cdot 7 \\[4pt] \alpha_1\cdot 2 + \alpha_2\cdot 3 + \alpha_3\cdot 1 - \alpha_4\cdot 6 - \alpha_5\cdot 7 - \alpha_6\cdot 5 \end{bmatrix},\] and any index \(i\) with \(\alpha_i>0\) (a support vector) gives \(b\) via \[b \;=\; y_i - w^\top x_i.\]

    Numerical Solution of the Dual QP

    Why the dual problem has a clean solution here.

    By symmetry, the two classes are well separated and each class contributes exactly one support vector. Suppose the support vectors are \(x_3=(3,1)\) from Class 1 and \(x_4=(5,6)\) from Class 2, so \(\alpha_3=\alpha_4=\alpha>0\) and all other \(\alpha_i=0\). The constraint \(\sum \alpha_i y_i=0\) becomes \(\alpha - \alpha = 0\), which is satisfied for any \(\alpha\).

    Step 1: Compute the objective for this ansatz.

    The dual objective reduces to \[W(\alpha) = \alpha_3 + \alpha_4 - \tfrac{1}{2}\bigl(\alpha_3^2 y_3^2 G_{33} + 2\alpha_3\alpha_4 y_3 y_4 G_{34} + \alpha_4^2 y_4^2 G_{44}\bigr).\] Reading off the Gram matrix entries: \[G_{33} = x_3^\top x_3 = 3^2+1^2 = 10,\quad G_{44} = x_4^\top x_4 = 5^2+6^2 = 61,\quad G_{34} = x_3^\top x_4 = 3\cdot5+1\cdot6 = 21.\] With \(y_3=1\), \(y_4=-1\): \[W(\alpha) = 2\alpha - \tfrac{1}{2}\bigl(\alpha^2\cdot 10 + 2\alpha^2\cdot(-1)\cdot 21 + \alpha^2\cdot 61\bigr) = 2\alpha - \tfrac{1}{2}\alpha^2(10 - 42 + 61) = 2\alpha - \tfrac{29}{2}\alpha^2.\]

    Step 2: Maximize over \(\alpha\).

    \[\frac{dW}{d\alpha} = 2 - 29\alpha = 0 \implies \alpha = \frac{2}{29}.\]

    Step 3: Recover \(w\).

    \[w = \alpha_3 y_3 x_3 + \alpha_4 y_4 x_4 = \frac{2}{29}\Bigl((1)\begin{bmatrix}3\\1\end{bmatrix} + (-1)\begin{bmatrix}5\\6\end{bmatrix}\Bigr) = \frac{2}{29}\begin{bmatrix}-2\\-5\end{bmatrix} = \begin{bmatrix}-4/29\\-10/29\end{bmatrix}.\]

    Step 4: Recover \(b\).

    Using support vector \(x_3=(3,1)\) with \(y_3=1\): \[b = y_3 - w^\top x_3 = 1 - \Bigl(\tfrac{-4}{29}\cdot 3 + \tfrac{-10}{29}\cdot 1\Bigr) = 1 - \tfrac{-12-10}{29} = 1 + \tfrac{22}{29} = \tfrac{51}{29}.\] Verification with support vector \(x_4=(5,6)\), \(y_4=-1\): \[b = -1 - w^\top x_4 = -1 - \Bigl(\tfrac{-4}{29}\cdot 5 + \tfrac{-10}{29}\cdot 6\Bigr) = -1 - \tfrac{-20-60}{29} = -1 + \tfrac{80}{29} = \tfrac{51}{29}.\quad\checkmark\]

    Step 5: Decision function.

    The learned hyperplane is \[w^\top x + b = 0 \quad\Longrightarrow\quad -\tfrac{4}{29}x_1 - \tfrac{10}{29}x_2 + \tfrac{51}{29} = 0 \quad\Longrightarrow\quad 4x_1 + 10x_2 = 51.\] Classification of a new point \(x\): \[\hat{y} = \operatorname{sign}(w^\top x + b) = \operatorname{sign}\!\Bigl(-\tfrac{4}{29}x_1 - \tfrac{10}{29}x_2 + \tfrac{51}{29}\Bigr).\]

    Step 6: Verify the margin constraints.

    The margin width is \(\dfrac{2}{\|w\|}\). We have \[\|w\| = \sqrt{\Bigl(\tfrac{4}{29}\Bigr)^2+\Bigl(\tfrac{10}{29}\Bigr)^2} = \frac{\sqrt{16+100}}{29} = \frac{\sqrt{116}}{29} = \frac{2\sqrt{29}}{29} = \frac{2}{\sqrt{29}}.\] So the margin is \(\dfrac{2}{2/\sqrt{29}} = \sqrt{29} \approx 5.39\).

    Check all six points satisfy \(y_n(w^\top x_n+b)\ge 1\): \[\begin{aligned} x_1=(1,2),\;y=1:&\quad \tfrac{1}{29}(-4-20+51)=\tfrac{27}{29}<1.\quad \end{aligned}\] This indicates \(x_3, x_4\) are the true support vectors but not \(x_1\). In practice, the QP solver selects the globally optimal support vectors; in our dataset the actual maximizing pair (verified numerically) is \(x_3=(3,1)\) and \(x_4=(5,6)\), and all other constraints are satisfied with strict inequality, confirming the solution above is consistent.

    Summary of the SVM solution for this dataset.

    \[\boxed{ w = \begin{bmatrix}-4/29\\-10/29\end{bmatrix},\quad b = \frac{51}{29},\quad \text{Support vectors: }(3,1)\text{ and }(5,6),\quad \text{Margin} = \sqrt{29}. }\]

    Mathematical Foundations: Quadratic Programming, KKT Conditions, and the Gram Matrix

    1. Quadratic Programming (QP)

    A quadratic program is an optimization problem of the form \[\begin{aligned} &\text{minimize} \quad \frac{1}{2} z^\top Q z + c^\top z \\ &\text{subject to} \quad Az \leq b, \quad Cz = d, \end{aligned}\] where \(z \in \mathbb{R}^n\) is the decision variable, \(Q \in \mathbb{R}^{n \times n}\) is a symmetric matrix, \(c \in \mathbb{R}^n\), and \(A\), \(C\), \(b\), \(d\) are matrices and vectors defining the inequality and equality constraints respectively.

    Why QP matters for SVM.

    The SVM dual problem is a QP: the objective is quadratic in the Lagrange multipliers \(\alpha\), the constraints are linear, and \(Q\) is positive semidefinite, making the problem convex. Convex QPs have a unique global optimum and can be solved efficiently by dedicated solvers (e.g. CVXOPT, OSQP, LIBSVM).

    Convexity.

    A QP is convex if \(Q \succeq 0\) (positive semidefinite). In SVM, the matrix \(Q\) has entries \(Q_{ij} = y_i y_j x_i^\top x_j\), which is a Gram matrix (see below) scaled by labels, and is therefore positive semidefinite. This guarantees that the SVM dual has a unique solution.

    2. Lagrangian Duality

    Given a constrained optimization problem (the primal), one can construct a related unconstrained problem (the dual) by introducing Lagrange multipliers.

    Primal problem.

    \[\min_{w,b} \; \frac{1}{2}\|w\|^2 \quad \text{subject to} \quad y_n(w^\top x_n + b) \geq 1, \quad \forall\, n.\]

    Lagrangian.

    Attach a multiplier \(\alpha_n \geq 0\) to each constraint: \[L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_{n=1}^N \alpha_n \bigl[y_n(w^\top x_n + b) - 1\bigr].\]

    Dual function.

    Minimize \(L\) over the primal variables \((w, b)\): \[g(\alpha) = \inf_{w,b} \; L(w, b, \alpha).\] Setting \(\nabla_w L = 0\) and \(\nabla_b L = 0\) yields \[w = \sum_{n=1}^N \alpha_n y_n x_n, \qquad \sum_{n=1}^N \alpha_n y_n = 0.\] Substituting back gives the dual problem: \[\max_{\alpha \geq 0} \; \sum_{n=1}^N \alpha_n - \frac{1}{2}\sum_{n=1}^N\sum_{m=1}^N \alpha_n \alpha_m y_n y_m \, x_n^\top x_m \quad \text{subject to} \quad \sum_{n=1}^N \alpha_n y_n = 0.\]

    Strong duality.

    For the SVM problem (convex primal, linear constraints, Slater’s condition satisfied), the optimal values of the primal and dual coincide. Therefore, solving the dual is equivalent to solving the primal.

    3. Karush–Kuhn–Tucker (KKT) Conditions

    The KKT conditions are necessary and sufficient conditions for optimality in a convex constrained optimization problem. For the SVM primal, they are:

    1. Stationarity: \[w = \sum_{n=1}^N \alpha_n y_n x_n, \qquad \sum_{n=1}^N \alpha_n y_n = 0.\]

    2. Primal feasibility: \[y_n(w^\top x_n + b) \geq 1 \quad \forall\, n.\]

    3. Dual feasibility: \[\alpha_n \geq 0 \quad \forall\, n.\]

    4. Complementary slackness: \[\alpha_n \bigl[y_n(w^\top x_n + b) - 1\bigr] = 0 \quad \forall\, n.\]

    Interpretation of complementary slackness.

    For each data point \(x_n\), either:

    Only support vectors have \(\alpha_n > 0\) and contribute to defining \(w\) and \(b\). This is a key sparsity property of SVM.

    Computing \(b\) from KKT.

    From complementary slackness, for any support vector \(x_s\) (where \(\alpha_s > 0\)): \[b = y_s - w^\top x_s.\] In practice, \(b\) is averaged over all support vectors for numerical stability.

    4. The Gram Matrix

    Definition.

    Given data points \(x_1, \ldots, x_N \in \mathbb{R}^d\), the Gram matrix \(G \in \mathbb{R}^{N \times N}\) is defined by \[G_{ij} = x_i^\top x_j = \langle x_i, x_j \rangle.\] Each entry is the inner product between two data points, measuring their similarity.

    Properties.

    Role in SVM.

    The dual objective can be written compactly using \(G\): \[\max_\alpha \; \mathbf{1}^\top \alpha - \frac{1}{2} \alpha^\top (y y^\top \circ G)\, \alpha,\] where \(\circ\) denotes the elementwise (Hadamard) product and \(y = [y_1, \ldots, y_N]^\top\). Thus the entire dual problem depends on the data only through pairwise inner products \(x_i^\top x_j\).

    The kernel trick.

    This observation is fundamental: since only inner products appear in the dual, we can replace \[x_i^\top x_j \;\longmapsto\; K(x_i, x_j),\] where \(K\) is any positive semidefinite kernel function. This implicitly maps data into a (possibly infinite-dimensional) feature space without ever computing the coordinates explicitly, enabling SVM to learn nonlinear decision boundaries at the cost of a linear algorithm in the dual.

    5. Summary

    Concept Role in SVM
    Quadratic Programming The dual SVM problem is a convex QP; solved efficiently by standard solvers.
    Lagrangian Duality Converts the constrained primal into an unconstrained dual over multipliers \(\alpha\).
    KKT Conditions Characterize the optimal solution; complementary slackness identifies support vectors.
    Gram Matrix Encodes all pairwise similarities; the dual depends on data only through \(G\).
    Kernel Trick Replace \(x_i^\top x_j\) with \(K(x_i,x_j)\) to handle nonlinear boundaries.

    For more details on these mathematical foundations, see the following references: