To better understand the concepts, let’s first review a few terminologies.

1. Model

In Machine Learning, our goal is often to build good machine-learning models. By “good”, we mean that they should generalize well i.e, perform well on the unseen data(like the test data).

  • A model can be viewed as a probabilistic model or as a non-probabilistic model(function)
  • Every model has its own set of parameters(the weights and bias).
    • e.g. A Linear Regression model \(y = \theta_0 + \theta^T.x\) has the parameters {$\theta_0, \theta$} which correspond to bias and weights.
2. Learning - Training/Parameter Estimation

Learning can be understood as finding the hidden patterns and structure in data by optimizing the parameters(weights) of a model.

  • We call this process of learning:
    • Training when we view the model as a non-probabilistic model(function)
    • Parameter Estimation when we view the model as a probabilistic model
  • In this learning process, we adjust the model parameters based on the training data using some optimization techniques like Gradient Descent, Adam, RMSProp, etc.
  • For non-probabilistic models, we follow the principle of Empirical Risk Minimization which directly provides an optimization problem for finding good parameters.
  • For probabilistic models, we use the principles of Maximum Likelihood Estimation(MLE), Maximum a Posteriori(MAP) or Expectation Maximization(EM) to find a good set of parameters.
3. Parametric Distributions

Parametric Distribution: A parametric distribution is based on a mathematical function whose shape and range are determined by one or more distribution parameters.

  • Gaussian/Normal Distribution is an example of parametric distribution with $\mu$ and $\sigma$ as its parameters. Hence, it is often written as $N(\mu,\sigma^2)$.
  • Once, we know its parameters we can determine the probability of a data point $x$ by plugging it into the probability density function of the Normal Distribution which is

$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(x-\mu)^2}{2\sigma^2}}$

Empirical Risk Minimization

We follow these steps

  1. Choose a hypothesis function for the model
  2. Choose a loss function for training
  3. Choose a learning procedure using optimization

e.g: In the non-probabilistic view of Linear Regression,

  • Hypothesis function: $h_{\theta}(x) = \theta_0 + \theta^T.x$
  • Loss function: $L(y_n,\hat{y_n}) = L(y_n, h_{\theta}(x)) = \sum_{i=1}^{m} (y_n - h_{\theta}(x))^2$
  • Learning procedure: Closed-form solution under certain conditions or Gradient Descent

Maximum Likelihood Estimation (MLE)

Rather than guessing some function might make a good estimator, we would like to have some principle from which we can derive specific functions that are good estimators for different models. The most common such principle is the Maximum Likelihood principle.

The idea behind Maximum Likelihood Estimation(MLE) is to define a function of parameters called Likelihood function that enables us to find(estimate) a model(more precisely its parameters $\theta$) that fits the data well.

  • The estimation problem is focused on the Likelihood function or more precisely its negative log-likelihood.
  • We want to maximize the likelihood function(more precisely log-likelihood) or minimize the negative log-likelihood(like in the case of Logistic Regression) in order to obtain the optimal parameters for the model.

Consider a dataset of m examples \(\mathbb{D} = \{x^{(1)},...,x^{(m)} \}\) drawn independently from the true but unknown data-generating distribution $p_{data}(x)$

Let $x$ be a random variable representing the data and $p_{model}(x;\theta)$ be a parametric family of probability distributions parametrized by $\theta$. In other words, $p_{model}(x;\theta)$ maps any configuration(data point $x$) to a real number estimating the true probability $p_{data}(x)$

The Likelihood function, \(L_{D}(\theta)\) is given by \(p_{model}(\mathbb{D}\mid\theta)\)

  • Note that $\theta$ is fixed but its value is still unknown at this point and our goal is to find the set of values for $\theta$ which maximizes the likelihood function.
  • The notation \(L_{D}(\theta)\) emphasizes the fact that the dataset \(\mathbb{D}\) is fixed and the parameter $\theta$ is varying.
  • We often often drop the reference to $D$ and write it as $L(\theta)$ as it really is a function of $\theta$.
  • Since our data is fixed(as it has been observed), by varying the values of the paramers $\theta$, $L(\theta)$ tells us how likely a particular setting of $\theta$ is for the given dataset \(\mathbb{D}\).
  • Thus, the Maximum Likelihood Estimator gives us the most likely parameter set $\theta$ for the given dataset \(\mathbb{D}\).
\[\begin{aligned} \textrm{The Likelihood} & \textrm{ function is given by,}\\ L(\theta) & = p_{model}(\mathbb{D};\theta) \\ & = p_{model}(x^{(1)},...,x^{(m)};\theta) \\ & \textrm{If we assume all data points are i.i.d (independent and identically distributed)} \\ & \textrm{1. independent:} \quad p(x_1,x_2) = p(x_1)*p(x_2) \\ & \textrm{2. identically distributed:} \quad \textrm{Each } x_i \textrm{ is sampled from the same distribution i.e.,} \quad x_i \sim p_{model}(x; \theta) \\ \textrm{Now, }\\ L(\theta) & = p_{model}(x^{(1)};\theta)*p_{model}(x^{(2)};\theta)*...*p_{model}(x^{(m)};\theta) \quad \textrm{[ i.i.d ]}\\ & = \prod_{i=1}^{m} \space p_{model}(x^{(i)};\theta) \\ \textrm{Applying log, }\\ \log{L(\theta)} & = \sum_{i=1}^{m} \log{p_{model}(x^{(i)};\theta)} \\ \textrm{The Maximum} & \textrm{ Likelihood Estimator for $\theta$ is defined as,}\\ \hat{\theta}_{ML} & = \underset{\theta}{\arg \max} \quad L(\theta) \\ & = \underset{\theta}{\arg \max} \quad \log{L(\theta)} \\ & = \underset{\theta}{\arg \max} \quad \sum_{i=1}^{m} \log{p_{model}(x^{(i)};\theta)} \\ & \textrm{Note:}\\ & \textrm{1. By applying log, the product of probabilities is replaced by the sum of log-probabilities}\\ & \textrm{which is a more convenient but equivalent optimization problem.}\\ & \textrm{2. Working with logarithms is desirable because of numerical stability: on a large dataset, multiplying}\\ & \textrm{many probabilities can underflow to zero.}\\ & \textrm{3. Taking the logarithm of the likelihood does not change its arg max because the log function}\\ & \textrm{is monotonically increasing over positive arguments and so the same } \theta \textrm{ will maximize both}\\ & \textrm{probability and its logarithm.} \end{aligned}\]

MLE in Supervised Learning

Let’s now consider the supervised learning setting, where we have a dataset $\mathbb{D} = {(x^{(1)}, y^{(1)}),…,(x^{(m)}, y^{(m)})}$.

We are interested in constructing a predictor that takes a feature vector $x_n$ as input and produces a prediction $y_n$ (or something close to it), i.e., given a vector $x_n$, we want the probability distribution of the label $y_n$. In other words, we specify the conditional probability distribution of the labels given the examples for the particular parameter setting $\theta$.

Let’s assume the data points are i.i.d.

The Maximum Likelihood estimate is given by,

\[\begin{aligned} \hat{\theta}_{ML} & = \underset{\theta}{\arg \max} \quad L(\theta) \\ & = \underset{\theta}{\arg \max} \quad p(\mathcal{Y}|\mathcal{X}; \theta) \\ & = \underset{\theta}{\arg \max} \quad \prod_{i=1}^{m} \space p(y_n | x_n ; \theta) \\ & = \underset{\theta}{\arg \max} \quad \sum_{i=1}^{m} \space \log{p(y^{(i)} | x^{(i)} ; \theta)} \\ \end{aligned}\]

Thus, the goal is to find a good parameter vector $\theta$ that explains the data ${(x^{(1)}, y^{(1)}),…,(x^{(m)}, y^{(m)})}$ well i.e., maximizes the likelihood.

  • We typically use the Negative Log-Likelihood(NLL) function as the Loss/Error function (like in the case of Logistic Regression).

Maximum A Posteriori Estimation

If we have prior knowledge about the distribution of parameters $\theta$, we can multiply an additional term to the likelihood. The additional term is a prior probability distribution on parameters $p(\theta)$.

For a given prior, after observing some data $x$, we update the distribution of $\theta$ using Bayes Theorem which gives us a principled tool to update our probability distributions of random variables.

It allows us compute a posterior distribution, $ p(\theta | x) $ on the parameters $\theta$ using prior distribution, $p(\theta)$ and the likelihood function, $p(x | \theta)$

\[p(\theta | x) = \frac{p(x | \theta)*p(\theta)}{p(x)}\]

We are interested in finding the parameters $\theta$ that maximize the posterior. The distribution $p(x)$ in the denominator does not depend on $\theta$ and hence can be ignored.

\[p(\theta | x) \propto p(x | \theta)*p(\theta)\]

The Maximum A Posteriori estimate is given by,

\[\begin{aligned} \hat{\theta}_{MAP} & = \underset{\theta}{\arg \max} \quad p(\theta | x)\\ & = \underset{\theta}{\arg \max} \quad L(\theta)*p(\theta) \\ & = \underset{\theta}{\arg \max} \quad p(x | \theta)*p(\theta) \\ \end{aligned}\]
  • MAP Estimate with Gaussian Likelihood and Gaussian prior on parameters gives Ridge Regression.
    • \(y \mid x,\theta \sim N(\theta^T x, \sigma^2)\) and \(\theta \sim N(0, \frac{1}{\lambda}I )\)
  • MAP Estimate with Gaussian Likelihood and Laplace prior on parameters gives Lasso Regression.
    • \(y \mid x,\theta \sim N(\theta^T x, \sigma^2)\) and \(\theta \sim Laplace(0, \frac{1}{\lambda}I )\)