linear regression via maximum likelihood and some simple assumptions about the system.


Assumptions

Problem Definition

We wish to establish the parameters for a set of gaussian distributions that can straddle a linear function. These distributions should be centred (have means) equal to the estimated function, and a constant variance independent of . The linear parameters (weights) and constant variance should maximise the likelihood for a given set of data points. Mathematically, this can be stated as saying that for any  the corresponding optimal  is a guassian distributed variable around a linear function with weights :

y = \aleph(x^T w, \sigma^2) = x^T w + \aleph(0, \sigma^2)

This is basically saying we are taking a discrete set of points and expecting the corresponding  to vary around a mean given by .

Estimation

The likelihood is the probability of an observation given the data. Given that the data points are independent,

\begin{align*}
p(y|X,w,\sigma) &= \prod_{i=1}^{n} p(y_i|x_i,w,\sigma) \\
&= \prod_{i=1}^{n} (2\pi\sigma^2)^{-1/2} \exp\left(-\frac{1}{2\sigma^2} (y_i - x_i^Tw)^2\right) \\
&= (2\pi\sigma^2)^{-n/2} \exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^{n}(y_i - x_i^Tw)^2\right) \\
&= (2\pi\sigma^2)^{-n/2} \exp\left(-\frac{1}{2\sigma^2} (y - Xw)^T(y-Xw)\right) \\
\end{align*}

Given a set of observed  the goal here is to find the linear weights  and the  such that the maximum likelihood (the probability above) is maximised. This is essentially the same as finding the set of guassian distributions such that the means are located as close as possible to the observed . To find the maximum, it is easier to find the equivalent maximum of the log likelihood

\begin{align*}
L(w) &= -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2} (Y-Xw)^T(Y-Xw) \\
\frac{dL(w)}{dw} &= 0 - \frac{1}{2\sigma^2} (0 - 2X^Ty + X^TXw)\\
\end{align*}

which by equating to zero, gives the maximum at,

\hat{w} = (X^TX)^{-1}X^Ty

Similarly, we can find the best choice of  by taking the derivative and equating to zero, resulting in:

\hat{\sigma}^2 = \frac{1}{n}(Y-Xw)^T(Y-Xw) = \frac{1}{n}\sum_{i=1}^n(y_i-x_iw)^2

Prediction

The guassian distribution of  at some arbitrary point  follows from the generated estimation of the weights and variance above and can be expressed as:

y \sim \aleph(y|x_*^T\hat{w}, \hat{\sigma}^2)

Implicit here is that the probability distribution for  above is actually a conditional probability assuming that all of the other characteristics are fixed, i.e. .


Our Naivete

The assumptions here represent a naivete about the problem that can be expounded upon in several ways.