Tuesday, March 12, 2013

Introduction to Machine Learning, Part 2: Linear Regression

Continuing our exploration of machine learning, we will discuss the use of basis functions for regression analysis. When presented with an unknown dataset, it is very common to attempt to find trends or patterns. The most basic form of this is visual inspection - how is the data trending? Does it repeat in cycles? Can we predict future data given some past events? The mathematical approach to this "trend finding" is called regression, or line fitting. As we will see, it is possible to fit more than a simple straight line, and the general technique of regression is very effective at breaking down many types of data.

What are some sample functions we might want to perform regression on? Take as an example:
$y = x^2 - 4x + 1$

With $x = 1\ldots100$ and no noise, the output $y$ looks like this:

What happens if we look at the same values, distorted by some noise?

Since we know the original function for the data, it is easy to "see" the underlying $x^2 - 4x +1$ (the blue line) is still there, but how can we do this mathematically, and without knowledge of the original generating function?

Looking at the function $x^2 - 4x +1$, rearranged as $1-4x+x^2$, we see that this is really a specialization of the general $2$nd order polynomial $w_0 + w_1x + w_2x^2$, with weight vector ${\bf w} = [1, -4, 1]$. If we also abstract the function vector ${\bf x} = \sum\limits_{n=0}^2x^n = [0, x, x^2]$, the result is a function $y({\bf w}, {\bf x}) = w_0 + w_1x_1 + w_2x_2$.

If we think of any $n$th order polynomial, $y({\bf w}, {\bf x})$, it can always be represented by the form.

$y({\bf w}, {\bf x}) = w_0 + w_1x_1 + w_2x_2+\ldots + w_nx_n$

This can also be represented in matrix form.

$[w_0\ldots w_{n}]\left[ \begin{array}{xmat} 1 & 0 & \cdots & 0\\ 0 & x_1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x_n \end{array} \right]$

In fact, going one step further, we see that the vector ${\bf x}$ could really be better generalized as a basis vector - that is, our current vector ${\bf x} = [0, x, x^2] = x^n$ could be generalized with something besides $x^n$. We could use sine waves (Fourier basis), Gaussians, sigmoids, wavelets, or any number of other functions to perform this same regression - in fact, this is the root of the concept of transforms in DSP lingo. 

When we talk about linear regression, we are really talking about a linear regression in the basis domain - a linear combination of basis functions multiplied by weighting values, which can be used to approximate the original signal. Since we have learned that we can use nearly any function as a basis, let's change notation from ${\bf x}$ to ${\bf \phi}$, where ${\bf \phi}$ is a function of $x$.

 $[w_o\ldots w_{n}]\left[ \begin{array}{phimat} \phi_0(x_1) & \phi_0(x_2) & \cdots & \phi_0(x_n) \\ \phi_1(x_1) & \phi_1(x_2) & \cdots & \phi_1(x_n) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_m(x_1) & \phi_m(x_2) & \cdots & \phi_m(x_n) \end{array} \right]$

The full derivation of the maximum likelihood estimate (MLE) can be found here, but the short and sweet version is that the maximum likelihood estimate of ${\bf w}$ is $(\Phi^T\Phi)^{-1}\Phi {\bf t}$, where $\Phi$ is the general basis matrix defined above, and ${\bf t}$ is a set of test datapoints. These test datapoints are generated by $\Phi {\bf w_m}$, where ${\bf w_m}$ are the measured values (red dots shown in the above figures).

$(\Phi^T\Phi)^{-1}\Phi$ is known as the Moore-Penrose pseudoinverse, and it can be replaced by the matrix inverse $\Phi^{-1}$ if the basis matrix is square (3x3, 100x100, etc.). Let's look at the code.

import numpy as np
import matplotlib.pyplot as plot
import math

def gen_dft(m, n, N):
    return np.exp(1j*-2*m*n/N)

def gen_polynomial(x, m):
    return x**m

N = 10
N_basis = 3
noise_var = B = 5
basis = np.matrix(np.zeros((N,N)), dtype=np.complex64)
xs = np.matrix(range(N)).T
ys = np.square(xs) - 4*xs + 1

wm = ys + np.sqrt(B)*np.random.randn(N,1)
for m in range(N_basis):
    for n in range(N):
        if m == 0:
            basis[m,n] = 1.0
            basis[m,n] = gen_polynomial(xs[n], m)
            #To use the gen_dft basis, make sure to set N_basis = N
            #basis[m,n] = gen_dft(m, n, N)

test_data = t = basis*wm
#Calculate using the Moore-Penrose pseudoinverse using the following formula
#maximum_likelihood = wml = np.linalg.inv(basis.T*basis)*basis.T*t
#Direct calculation appears to have numerical instability issues...
#Luckily the pinv method calculates Moore-Penrose pseudo inverse using SVD

#which largely avoids the numerical issues
maximum_likelihood = wml = np.linalg.pinv(basis)*t

plot.title("Regression fit using polynomial basis function, number of basis functions = $\$$" + `N_basis` + "$\$$")
plot.plot(ys, 'b')
plot.plot(wm, 'ro')
plot.plot(np.real(wml), 'g')

As we can see, the approximation with 3 basis functions (2nd order polynomial fit) does a pretty good job of approximating the original function. Let's see what happens if we increase the number of basis functions.

We can see that this fit gets closer to the measured data (red dots), but is actually further from the true signal without noise! This is called overfitting, because our estimate is beginning to fit our measurement, which has noise, rather than the underlying function (the blue line).  Let's try another example, using the Fourier basis matrix. 
The Fourier basis does a decent job of fitting the line, but still has more error  than the polynomial fit with the right number of basis functions. Because the original function is polynomial, we should expect a polynomial basis to give superior results. We are overfitting our polynomial regression by including too many basis functions - in general, we want to fit the smallest number of basis functions while still minimizing overall error.

There are ways to reduce overfitting by regularizing this operation, using something called the "lasso method". There are also methods for sequential estimation, which avoid large matrix operations for higher order basis matrices. We will instead look at a Bayesian approach to regression, which naturally has regularization properties, and also eliminates most of the large matrix math by making this regression a sequential operation. This will greatly help higher order regression, and falls in line with our earlier approaches from part 1.

So how do we actually perform a Bayesian regression? There are some complicated derivations once again (pg 17 and onward here), but the idea is that we make iterative estimates, starting with some prior estimate and a likelihood to get a posterior estimate, which we will use as the prior for the next iteration. Remember, Bayesian is all about $posterior\: =\: likelihood\: \times\: prior$. After iterating through all the data points, we expect our estimate to approach the true generating function, without noise. Let's look at the code.

import numpy as np
import matplotlib.pyplot as plot

N = 10
noise_var = B = .2**2
lower_bound = lb = 0.
upper_bound = ub = 5.
xs = np.matrix((ub-lb)*np.random.rand(N)+lb).T
xaxis = np.matrix(np.linspace(lb,ub,num=N)).T
w = np.array([1, -4, 1])
ys = w[0] + w[1]*xaxis + w[2]*np.square(xaxis)
t = w[0] + w[1]*xs + +w[2]*np.square(xs) + np.sqrt(B)*np.random.randn(N,1);

def gen_polynomial(x, p):
    return x**p
N_basis = 7
alpha = a = 1.
beta = b = 1./B;
prior_m = np.zeros((N_basis, 1))
m = np.zeros((N_basis, N))
prior_s = np.matrix(np.diag(np.array([a]*N_basis)))
s = np.zeros((N_basis, N_basis))
for n in range(N):
    poly = np.vectorize(gen_polynomial)
    basis = poly(xs[n], np.matrix(np.arange(N_basis)))
    s_inv = prior_s.I*np.eye(N_basis)+b*(basis.T*basis)
    s = s_inv.I*np.eye(N_basis)
    #Need to use .squeeze() so broadcasting works correctly
    m[:,n] = (s*(prior_s.I*prior_m+(b*basis.T*t[n]))).squeeze()
    y = m[0,n] + m[1,n]*xaxis + m[2,n]*np.square(xaxis)
    plot.plot(xaxis, y, "g")
    prior_m[:,0] = m[:,n].squeeze()
    prior_s = s

plot.title("Bayesian regression using polynomial basis function, number of basis functions = $\$$" + `N_basis` + "$\$$")
plot.plot(xaxis, ys, "b")
plot.plot(xs, t, "ro")
plot.plot(xaxis, y, "k")

Adding in the progression plots (plot.plot(xaxis, y, "g"))
The green lines show how the Bayesian regression updates on each iteration through the loop - though unlabeled, we assume that the closer the green line is to the final estimate (black line), the later it ran in the code, with the n = 10 iteration resulting in the final black line shown.

The top part of the code sets up the generating function, also generates samples corrupted by noise. We could set xs differently to get linearly spaced samples (using np.arange(lb,ub,N)) instead of random spacing, but random samples are what Bishop is using in his book and diagrams (Pattern Recognition and Machine Learning, C. Bishop). I also chose to vectorize the basis matrix generation this time - which doesn't actually increase the code speed (in numpy at least), but does turn a multi-line for loop into one line of code.

In fact, looking at page 153 in Bishop's book will give all the formulas used in the processing loop, mapped as follows.
s_inv = ${\bf S}_N^{-1}$
s = ${\bf S}_N$
m[:,n] = ${\bf m}_N$
prior_s = ${\bf S}_0^{-1}$
prior_m = ${\bf m}_0$
Updating the priors with the results of the loop each time, we end up with a reasonable approximation of the generating function as long as N_basis is the same as the generating function. What happens if we don't use the right value for N_basis?
Bad things, it turns out. Instead of an overfitting problem, we now have a model selection problem - if we don't choose N_basis right, we won't converge to the right answer!

We really want to automate this model selection process, as scientists are too busy traveling the world and winning awards to babysit their award-winning scientific models. Luckily, there is a formula for calculating Bayesian model evidence - the model that maximizes this function should be the best choice! See the notes here (PDF) for more details, starting with slide 41.

         ${\large {\bf A} = {\bf S}_N^{-1} = \alpha {\bf I} + \beta {\bf \Phi}^T{\bf \Phi}}$
${\large {\bf m}_N = \beta{\bf A}^{-1}{\bf \Phi}^T{\bf t}}$
${\large E({\bf m}_N) = \frac{\beta}{2}||{\bf t}-{\bf \Phi}{\bf m}_N||^2+\frac{\alpha}{2}{\bf m}_N^T{\bf m}_N}$
${\large \ln p({\bf t}| \alpha, \beta) = \frac{M}{2}ln(\alpha)+\frac{N}{2}ln(\beta)-E({\bf m}_N) - \frac{1}{2}ln(| {\bf A} | ) - \frac{N}{2}ln(2\pi)}$

Approximations for $\alpha$ and $\beta$ are:
${\large \alpha = \frac{M}{2E_w({\bf m}_N)} = \frac{M}{{\bf w}^T{\bf w}}}$
${\large \beta = \frac{N}{2E_D({\bf m}_N)} = \frac{N}{\sum\limits^N_{n=1}\{t_n - {\bf w}^T\phi({\bf x}_n)\}^2}}$
By running each model, then computing the last equation (the model evidence function), we can see which model has the highest evidence of being correct. The full code for this calculation is listed below.
import numpy as np
import matplotlib.pyplot as plot

N = 50.
noise_var = B = .2**2
lower_bound = lb = -5.
upper_bound = ub = 5.
xs = np.matrix(sorted((ub-lb)*np.random.rand(N)+lb)).T
xaxis = np.matrix(np.linspace(lb,ub,num=N)).T
w = np.array([1, -4, 1])
ys = w[0] + w[1]*xaxis + w[2]*np.square(xaxis)
t = w[0] + w[1]*xs + +w[2]*np.square(xs) + np.sqrt(B)*np.random.randn(N,1);

prior_alpha = 0.005 #Low precision initially
prior_beta = 0.005 #Low guess for noise var
max_N = 7 #Upper limit on model order
evidence = E = np.zeros(max_N)
f, axarr = plot.subplots(3)
def gen_polynomial(x, p):
    return x**p

itr_upper_bound = 250
all_y = []

for order in range(1, max_N+1):
    m = np.zeros((order, 1))
    s = np.zeros((order, order))
    poly = np.vectorize(gen_polynomial)
    basis = poly(xs, np.tile(np.arange(order), N).reshape(N, order))
    alpha = a = prior_alpha
    beta = b = prior_beta
    itr = 0
    while not itr < itr_upper_bound:
        itr += 1
        first_part = a*np.eye(order)
        second_part = b*(basis.T*basis)
        s_inv = a*np.eye(order)+b*(basis.T*basis)
        m = b*s_inv.I*basis.T*t
        posterior_alpha = pa = np.matrix(order/(m.T*m))[0,0]
        posterior_beta = pb = np.matrix(N/((t.T-m.T*basis.T)*(t.T-m.T*basis.T).T))[0,0]
        a = pa
        b = pb
    A = a*np.eye(order)+b*(basis.T)*basis
    mn = b*(A.I*(basis.T*t))
    penalty = emn = b/2.*(t.T-mn.T*basis.T)*(t.T-mn.T*basis.T).T+a/2.*mn.T*mn
    E[order-1] = order/2.*np.log(a)+N/2.*np.log(b)-emn-1./(2*np.log(np.linalg.det(A)))-N/2.*np.log(2*np.pi)
    y = (mn.T*basis.T).T
    axarr[0].plot(xs, y ,"g")

best_model = np.ma.argmax(E)
#print E
x0label = x2label = "Input X"
y0label = y2label = "Output Y"
x1label = "Model Order"
y1label = "Score"
axarr[0].set_title("Bayesian model estimation using polynomial basis functions")
axarr[0].plot(xaxis, ys, "b")
axarr[0].plot(xs, t, "ro")
axarr[1].set_title("Model Evidence")
axarr[1].plot(E, "b")
axarr[2].set_title("Best model, polynomial order $"+`best_model`+"$")
axarr[2].plot(xs, t, "ro")
axarr[2].plot(xs, all_y[best_model], "g")

The model selection calculation has allowed us to evaluate different models and compare them mathematically. The model evidence values for this run are: 

[-200.40318142 -187.76494065 -186.72241097 -189.13724168 -191.74123924 -194.37069482 -197.01120722] 

 Let's add more noise and see what happens.
Adjusting the noise_var value from .2**2 to 1.**2, we see that that the higher order models have increased bias, just as we saw in our earlier experiments. Our model evidence coefficients are:

[-207.27716138 -195.14402961 -188.44030008 -190.92931379 -193.53242997 -196.16271701 -198.75960682]

This time, we are able to use the model evidence function to correctly compensate for model bias and select the best model for our data. Awesome! Let's push a little farther, and reduce the number of sample points as well.

We can see that with a reduced number of samples (N = 10), there is not enough information to accurately fit or estimate the model order. Our coefficients for this run were:

[-46.65574626 -42.33981333 -43.95725922 -46.19200103 -49.68466447 -51.98193588 -54.56531748]

This is important point - all of these techniques help determine a model which is best for your data, but if there are not enough data points, or the data points are not unique enough to make a good assessment, you will not be able to get a grasp of the underlying function, which will lead to an imprecise answer. In the next installment of this series, we will cover another method of linear regression, which is easily extended to simplistic classification.

Pattern Recognition and Machine Learning, C. Bishop
Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern, and D. Rubin
Classnotes from Advanced Topics in Pattern Recognition, UTSA, Dr. Zhang, Spring 2013

Friday, March 8, 2013

Amazon EC2 Setup and Connection

First, you will need to sign up for Amazon EC2 (I am currently in the Free tier).
Next, follow these directions to launch a cloud OS so that we can start serving up webpages! I chose Ubuntu 12.10 64bit, so these instructions apply for that instance type, though they may work for other types

To ssh to the newly launched instance, find the public DNS from the Amazon EC2 management page, then use these commands:
chmod 600 <keyname>.pem
ssh -i <keyname>.pem -l ubuntu <public DNS name>

Now we are in! Time to do some quick security setup, using the instructions found here. The only liberties I have taken are setting the ufw firewall rules to 22 allow, and not setting IP rules or allowed users for ssh (since we have set it to key only access).

We need to get python-pip and python-virtualenv packages. Do this by running sudo apt-get install python-{pip,virtualenv}. virtualenv isolates at special instance of python from everything else, so packages won't interfere with your main python install.

Running virtualenv venv; source venv/bin/activate should put you into the virtual environment, we can install flask here by running sudo venv/bin/pip install flask . For an easy setup of flask with authentication, SQLAlchemy, and twitter bootstrap, thanks to esbullington, we can run the following commands

sudo apt-get install python-dev libpq-dev
git clone git://github.com/esbullington/flask-bootstrap.git
sudo venv/bin/pip install -r flask-bootstrap/requirements.txt
To finish launching our app, we need to create a postgreSQL database for the application to connect to. The basic instructions list the following steps:
sudo apt-get install postgreql
sudo -u postgres psql postgres
    \password postgres
sudo -u postgres createuser <username>  
sudo -u postgres psql 
    create databse <dbname> with owner <username>
sudo -u postgres psql
    \password <username>

To login to the database, use psql -d <dbname> -U <username> -h localhost.

Go into flask-boostrap, and adjust the app.cfg and app.py settings.

Change the postgresql settings to

Change the app.run line so that we bind to port 80 and allow connections from any external IP

To run this script, do
nohup sudo venv/bin/python flask-bootstrap/app.py &

Now the next step is figuring out how to link this server to a real deal domain name!

Friday, March 1, 2013

Introduction to Machine Learning, Part 1: Parameter Estimation

Machine learning is a fascinating field of study which is growing at an extreme rate. As computers get faster and faster, we attempt to harness the power of these machines to tackle difficult, complex, and/or unintuitive problems in biology, mathematics, engineering, and physics. Many of the purely mathematical models used in these fields are extremely simplified compared to their real-world counter parts (see the spherical cow). As scientists attempt more and more accurate simulations of our universe, machine learning techniques have become critical to building accurate models and estimating experimental parameters. 

Also, even very basic machine learning can be used to effectively block spam email, which puts it somewhere between water and food on the "necessary for life" scale.

Machine learning seems very complicated, and advanced methods in the field are usually very math heavy - but it is all based in common statistics, mostly centered around the ever useful Gaussian distribution. One can see the full derivation for the below formulas here (PDF) or here (PDF), a lot of math is involved but it is relatively straightforward as long as you remember Bayes rule:
$posterior \:=\: likelihood \:\times\: prior$

From the links above, we can see that our best estimate for the mean $\mu$ given a known variance $\sigma^2$ and some Gaussian distributed data vector $X$ is:

$\sigma(\mu)^2_n = {\Large \frac{1}{\frac{N}{\sigma^2}+\frac{1}{\sigma_o^2}}}$

$\mu_n = {\Large \frac{1}{\frac{N}{\sigma^2}+\frac{1}{\sigma^2}}(\frac{\sum\limits_{n=1}^N x_n}{\sigma^2}+\frac{\mu_o}{\sigma_o^2})}$

where $N$ (typically 1) represents the number of $X$ values used to generate the mean estimate $\mu$ (just $x_n$ if $N$ is 1), $\mu_o$ is the previous "best guess" for the mean ($\mu_{n-1}$), $\sigma_o^2$ is the previous confidence in the "best guess" for $\mu$ ($\sigma(\mu)^2_{n-1})$), and $\sigma$ was known prior to the calculation. Lets see what this looks like in python.

import numpy as np
import matplotlib.pyplot as plot
total_obs = 1000
primary_mean = 5.
primary_var = known_var = 4.
x = np.sqrt(primary_var)*np.random.randn(total_obs) + primary_mean
f, axarr = plot.subplots(3)
f.suptitle("Unknown mean ($\$$\mu=$\$$"+`primary_mean`+"), known variance ($\$$\sigma^2=$\$$"+`known_var`+")")
y0label = "Timeseries"
y1label = "Estimate for mean"
y2label = "Doubt in estimate"
prior_mean = 0.
prior_var = 1000000000000.
all_mean_guess = []
all_mean_doubt = []
for i in range(total_obs):
    posterior_mean_doubt = 1./(1./known_var+1./prior_var)
    posterior_mean_guess = (prior_mean/prior_var+x[i]/known_var)*posterior_mean_doubt


This code results in this plot:
We can see that there are two basic steps - generating the test data, and iteratively estimating the "unknown parameter", in this case the mean. We begin our estimate for the mean at any value (prior_mean = 0) and set the prior_var variable extremely large, indicating that our confidence in the mean actually being 0 is extremely low.

What happens in the opposite (though still "academic" case) where we know the mean but not the variance? Our best estimate for unknown variance $\sigma^2$ given the mean $\mu$ and some Gaussian distributed data vector $X$ will use some extra variables, but still represent our estimation process.:

$a_n = {\Large a_o + \frac{N}{2}}$
$b_n = {\Large b_o + \frac{1}{2}\sum\limits^N_{n=1}(x_n-\mu)^2}$
 $\lambda = {\Large \frac{a_n}{b_n}}$
$\sigma(\lambda)^2 = {\Large \frac{a_n}{b_n^2}}$
$\lambda = {\Large \frac{1}{\sigma^2}}$

This derivation is made much simpler by introducing the concept of precision ($\lambda$), which is simply 1 over the variance. We estimate the precision, which can be converted back to variance if we prefer. $\mu$ is known in this case.

import numpy as np
import matplotlib.pyplot as plot
total_obs = 1000
primary_mean = known_mean = 5
primary_var = 4
x = np.sqrt(primary_var)*np.random.randn(total_obs) + primary_mean
all_a = []
all_b = []
all_prec_guess = []
all_prec_doubt = []
f,axarr = plot.subplots(3)
f.suptitle("Known mean ($\$$\mu=$\$$"+`known_mean`+"), unknown variance ($\$$\sigma^2=$\$$"+`primary_var`+"; $\$$\lambda$\$$="+`1./primary_var`+")")
y0label = "Timeseries"
y1label = "Estimate for precision"
y2label = "Doubt in estimate"

for i in range(1,total_obs):

Here I chose to set the values for prior_a and prior_b to the "first" values of the estimation, we could just as easily have reversed the formulas by setting the precision to some value, and the "doubt" about that precision very large, then solving a system of two equations, two unknowns for a and b.

What happens if both values are unknown? All we really know is that the data is Gaussian distributed!

$\nu_o={\Large \kappa_o-1}$
$\mu_n={\Large \frac{\kappa_o\mu_o+\overline{X}N}{\kappa_o+N}}$
$\kappa_n={\Large \kappa_o+N}$
$\nu_n={\Large \nu_o+N}$
$\sigma_n^2={\Large \frac{\nu_o\sigma^2+(N-1)s^2+\frac{\kappa_oN}{\kappa_o+N}(\overline{X}-\mu_o)^2}{\nu_n}}$
$s={\Large var(X)}$

The technique here is the same as other derivations for unknown mean, known variance and known mean, unknown variance. To estimate the mean and variance of data taken from a single Gaussian distribution, we need to iteratively update our best guesses for both mean and variance. In many cases, $N=1$, so the value for $s$ is not necessary and $\overline{X}$ becomes $x_n$. Let's look at the code.

import matplotlib.pyplot as plot
import numpy as np

total_obs = 1000
primary_mean = 5.
primary_var = 4.
x = np.sqrt(primary_var)*np.random.randn(total_obs) + primary_mean
f, axarr = plot.subplots(3)
f.suptitle("Unknown mean ($\$$\mu=$\$$"+`primary_mean`+"), unknown variance ($\$$\sigma^2=$\$$"+`primary_var`+")")
y0label = "Timeseries"
y1label = "Estimate for mean"
y2label = "Estimate for variance"
prior_mean = 0.
prior_var = 1.
prior_kappa = 1.
prior_v = 0.
all_mean_guess = []
all_var_guess = []
for i in range(total_obs):
    posterior_mean = (prior_kappa*prior_mean+x[i])/(prior_kappa + 1)
    posterior_var = (prior_v*prior_var + prior_kappa/(prior_kappa + 1)*(x[i]-prior_mean)**2)/(prior_v + 1)
    prior_kappa += 1
    prior_v += 1
    prior_mean = posterior_mean
    prior_var = posterior_var

We can see that the iterative estimation has successfully approximated the mean and variance of the underlying distribution! We "learned" these parameters, given only a dataset and the knowledge that it could be approximated by the Gaussian distribution. In the next installment of this series, I will cover linear regression. These two techniques (parameter estimation and linear regression) form the core of many machine learning algorithms - all rooted in basic statistics.

Pattern Recognition and Machine Learning, C. Bishop
Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern, and D. Rubin
Classnotes from Advanced Topics in Pattern Recognition, UTSA, Dr. Zhang, Spring 2013