Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Excerpt
hiddentrue

linear regression via maximum likelihood and some simple assumptions about the system.

Table of Contents
maxLevel2
typeflat
printablefalse

Assumptions

  • Independence : the data points are independent
  • No knowledge about the weights : we assume no prior belief about the weights (the training data will dictate).
  • Uniform uncertainty : each data point has the same uncertainty (i.e. 
    Mathinline
    body\sigma
    ), again no prior belief about this.

Problem Definition

We wish to establish the parameters for a set of gaussian distributions that can straddle a linear function. These distributions should be centred (have means) equal to the estimated function, and a constant variance independent of 

Mathinline
bodyx
. The linear parameters (weights) and constant variance should maximise the likelihood for a given set of data points.

GuassiansImage Added

Mathematically, this can be stated as saying that for any 

Mathinline
bodyx
 the corresponding optimal 
Mathinline
bodyy
 is a guassian distributed variable around a linear function with weights 
Mathinline
bodyw
:

Mathblock
anchory_i_distribution
y = \aleph(x^T w, \sigma^2) = x^T w + \aleph(0, \sigma^2)

This is basically saying we are taking a discrete set of points and expecting the corresponding 

Mathinline
bodyy_i
 to vary around a mean given by 
Mathinline
bodyx_i^Tw
.

Estimation

The likelihood is the probability of an observation given the data. Given that the data points are independent,

Mathblock
\begin{align*}
p(y|X,w,\sigma) &= \prod_{i=1}^{n} p(y_i|x_i,w,\sigma) \\
&= \prod_{i=1}^{n} (2\pi\sigma^2)^{-1/2} \exp\left(-\frac{1}{2\sigma^2} (y_i - x_i^Tw)^2\right) \\
&= (2\pi\sigma^2)^{-n/2} \exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^{n}(y_i - x_i^Tw)^2\right) \\
&= (2\pi\sigma^2)^{-n/2} \exp\left(-\frac{1}{2\sigma^2} (y - Xw)^T(y-Xw)\right) \\
\end{align*}

Given a set of observed 

Mathinline
bodyy
 the goal here is to find the linear weights 
Mathinline
bodyw
 and the 
Mathinline
body\sigma
 such that the maximum likelihood (the probability above) is maximised. This is essentially the same as finding the set of guassian distributions such that the means are located as close as possible to the observed 
Mathinline
bodyy_i
. To find the maximum, it is easier to find the equivalent maximum of the log likelihood

Mathblock
\begin{align*}
L(w) &= -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2} (Y-Xw)^T(Y-Xw) \\
\frac{dL(w)}{dw} &= 0 - \frac{1}{2\sigma^2} (0 - 2X^Ty + X^TXw)\\
\end{align*}

which by equating to zero, gives the maximum at,

Mathblock
\hat{w} = (X^TX)^{-1}X^Ty

Similarly, we can find the best choice of 

Mathinline
body\sigma
 by taking the derivative and equating to zero, resulting in:

Mathblock
\hat{\sigma}^2 = \frac{1}{n}(Y-Xw)^T(Y-Xw) = \frac{1}{n}\sum_{i=1}^n(y_i-x_iw)^2

Prediction

The guassian distribution of 

Mathinline
bodyy
 at some arbitrary point 
Mathinline
bodyx_*
 follows from the generated estimation of the weights and variance above and can be expressed as:

Mathblock
anchorprediction
y \sim \aleph(y|x_*^T\hat{w}, \hat{\sigma}^2)

Implicit here is that the probability distribution for 

Mathinline
bodyy
 above is actually a conditional probability assuming that all of the other characteristics are fixed, i.e. 
Mathinline
bodyp(y|x_*,X,\hat{w},\hat{\sigma})
.


Our Naivete

The assumptions here represent a naivete about the problem that can be expounded upon in several ways.

  • What if we have expert knowledge (prior belief) about what the weights should be? We should then be looking at MAP, not ML.
  • Likewise can be said of having a prior belief about the uncertainties.
  • The uncertainty itself may have a non-constant form, i.e. training points may be more certain on some sub-domain of 
    Mathinline
    bodyx
     and less certain elsewhere.
  • In more complicated scenarios, data points are not necessarily independent. For example, continuity can influence covariance so that points close to each other are highly correlated.
  • A traditional way of influencing the weights to keep them simple is to use optimisation techniques which add a penalty term to the optimisation.
    • Refer to these lecture notes for notes on how to do this with ridge regression or lasso techniques.