To better understand the concepts, let’s first review a few terminologies.

##### 1. Model

In Machine Learning, our **goal** is often to build **good machine-learning models**. By “good”, we mean that they should generalize well i.e, perform well on the unseen data(like the test data).

- A model can be viewed as a
**probabilistic model**or as a**non-probabilistic model(function)** - Every model has its own set of parameters(the weights and bias).
- e.g. A Linear Regression model \(y = \theta_0 + \theta^T.x\) has the parameters {$\theta_0, \theta$} which correspond to bias and weights.

##### 2. Learning - Training/Parameter Estimation

**Learning** can be understood as finding the hidden patterns and structure in data by optimizing the parameters(weights) of a model.

- We call this process of learning:
**Training**when we view the model as a non-probabilistic model(function)**Parameter Estimation**when we view the model as a probabilistic model

- In this learning process, we adjust the model parameters based on the training data using some optimization techniques like Gradient Descent, Adam, RMSProp, etc.
- For
**non-probabilistic models**, we follow the principle of**Empirical Risk Minimization**which directly provides an optimization problem for finding good parameters. - For
**probabilistic models**, we use the principles of**Maximum Likelihood Estimation(MLE)**,**Maximum a Posteriori(MAP)**or**Expectation Maximization(EM)**to find a good set of parameters.

##### 3. Parametric Distributions

**Parametric Distribution:** A parametric distribution is based on a mathematical function whose shape and range are determined by one or more distribution parameters.

- Gaussian/Normal Distribution is an example of parametric distribution with $\mu$ and $\sigma$ as its parameters. Hence, it is often written as $N(\mu,\sigma^2)$.
- Once, we know its parameters we can determine the probability of a data point $x$ by plugging it into the probability density function of the Normal Distribution which is

$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(x-\mu)^2}{2\sigma^2}}$

### Empirical Risk Minimization

We follow these steps

- Choose a hypothesis function for the model
- Choose a loss function for training
- Choose a learning procedure using optimization

e.g: In the non-probabilistic view of Linear Regression,

**Hypothesis function:**$h_{\theta}(x) = \theta_0 + \theta^T.x$**Loss function:**$L(y_n,\hat{y_n}) = L(y_n, h_{\theta}(x)) = \sum_{i=1}^{m} (y_n - h_{\theta}(x))^2$**Learning procedure:**Closed-form solution under certain conditions or Gradient Descent

### Maximum Likelihood Estimation (MLE)

Rather than guessing some function might make a good estimator, we would like to have some principle from which we can derive specific functions that are good estimators for different models. The most common such principle is the **Maximum Likelihood** principle.

The idea behind **Maximum Likelihood Estimation(MLE)** is to define a function of parameters called **Likelihood function** that enables us to find(estimate) a model(more precisely its parameters $\theta$) that fits the data well.

- The estimation problem is focused on the
**Likelihood function**or more precisely its negative log-likelihood. - We want to maximize the likelihood function(more precisely log-likelihood) or minimize the negative log-likelihood(like in the case of Logistic Regression) in order to obtain the optimal parameters for the model.

Consider a dataset of *m* examples \(\mathbb{D} = \{x^{(1)},...,x^{(m)} \}\) drawn independently from the true but unknown data-generating distribution $p_{data}(x)$

Let $x$ be a random variable representing the data and $p_{model}(x;\theta)$ be a parametric family of probability distributions parametrized by $\theta$. In other words, $p_{model}(x;\theta)$ maps any configuration(data point $x$) to a real number estimating the true probability $p_{data}(x)$

The Likelihood function, \(L_{D}(\theta)\) is given by \(p_{model}(\mathbb{D}\mid\theta)\)

- Note that $\theta$ is fixed but its value is still unknown at this point and our goal is to find the set of values for $\theta$ which maximizes the likelihood function.
- The notation \(L_{D}(\theta)\) emphasizes the fact that the dataset \(\mathbb{D}\) is fixed and the parameter $\theta$ is varying.
- We often often drop the reference to $D$ and write it as $L(\theta)$ as it really is a function of $\theta$.
- Since our data is fixed(as it has been observed), by varying the values of the paramers $\theta$, $L(\theta)$ tells us how
**likely**a particular setting of $\theta$ is for the given dataset \(\mathbb{D}\). - Thus, the Maximum Likelihood Estimator gives us the most likely parameter set $\theta$ for the given dataset \(\mathbb{D}\).

#### MLE in Supervised Learning

Let’s now consider the supervised learning setting, where we have a dataset $\mathbb{D} = {(x^{(1)}, y^{(1)}),…,(x^{(m)}, y^{(m)})}$.

We are interested in constructing a predictor that takes a feature vector $x_n$ as input and produces a prediction $y_n$ (or something close to it), i.e., given a vector $x_n$, we want the probability distribution of the label $y_n$. In other words, we specify the conditional probability distribution of the labels given the examples for the particular parameter setting $\theta$.

Let’s assume the data points are i.i.d.

The Maximum Likelihood estimate is given by,

\[\begin{aligned} \hat{\theta}_{ML} & = \underset{\theta}{\arg \max} \quad L(\theta) \\ & = \underset{\theta}{\arg \max} \quad p(\mathcal{Y}|\mathcal{X}; \theta) \\ & = \underset{\theta}{\arg \max} \quad \prod_{i=1}^{m} \space p(y_n | x_n ; \theta) \\ & = \underset{\theta}{\arg \max} \quad \sum_{i=1}^{m} \space \log{p(y^{(i)} | x^{(i)} ; \theta)} \\ \end{aligned}\]Thus, the goal is to find a good parameter vector $\theta$ that explains the data ${(x^{(1)}, y^{(1)}),…,(x^{(m)}, y^{(m)})}$ well i.e., maximizes the likelihood.

- We typically use the
**Negative Log-Likelihood(NLL)**function as the Loss/Error function (like in the case of Logistic Regression).

#### Maximum A Posteriori Estimation

If we have prior knowledge about the distribution of parameters $\theta$, we can multiply an additional term to the likelihood. The additional term is a prior probability distribution on parameters $p(\theta)$.

For a given prior, after observing some data $x$, we update the distribution of $\theta$ using **Bayes Theorem** which gives us a principled tool to update our probability distributions of random variables.

It allows us compute a **posterior distribution**, $ p(\theta | x) $ on the parameters $\theta$ using prior distribution, $p(\theta)$ and the likelihood function, $p(x | \theta)$

We are interested in finding the parameters $\theta$ that maximize the posterior. The distribution $p(x)$ in the denominator does not depend on $\theta$ and hence can be ignored.

\[p(\theta | x) \propto p(x | \theta)*p(\theta)\]The Maximum A Posteriori estimate is given by,

\[\begin{aligned} \hat{\theta}_{MAP} & = \underset{\theta}{\arg \max} \quad p(\theta | x)\\ & = \underset{\theta}{\arg \max} \quad L(\theta)*p(\theta) \\ & = \underset{\theta}{\arg \max} \quad p(x | \theta)*p(\theta) \\ \end{aligned}\]- MAP Estimate with Gaussian Likelihood and Gaussian prior on parameters gives Ridge Regression.
- \(y \mid x,\theta \sim N(\theta^T x, \sigma^2)\) and \(\theta \sim N(0, \frac{1}{\lambda}I )\)

- MAP Estimate with Gaussian Likelihood and Laplace prior on parameters gives Lasso Regression.
- \(y \mid x,\theta \sim N(\theta^T x, \sigma^2)\) and \(\theta \sim Laplace(0, \frac{1}{\lambda}I )\)