Sunday, February 19, 2012

A Bayesian Approach To Kalman Filtering

INTRODUCTION
Ripped straight from Wikipedia...
The Kalman filter is an algorithm, commonly used since the 1960s for improving vehicle navigation (among other applications, although aerospace is typical), that yields an optimized estimate of the system's state (e.g. position and velocity). The algorithm works recursively in real time on streams of noisy input observation data (typically, sensor measurements) and filters out errors using a least-squares curve-fit optimized with a mathematical prediction of the future state generated through a modeling of the system's physical characteristics.

TL;DR
Rockets

There are both linear and non-linear forms of the Kalman filter, with the non-linear forms being the Extended Kalman Filter (EKF), the invariant Extended Kalman Filter, and the Unscented Kalman Filter (UKF). The non-linear versions of this algorithm represent the current "gold standard" in many application areas, including GPS and navigation.

Before jumping in the deep end of the pool, I decided to implement a simple example that shows the ideas and implementation of Kalman filtering, using a recursive Bayesian approach.

TL;DR
Homework

WEIGHTING FUNCTION FOR KALMAN UPDATING
The Kalman filter is often derived from a matrix equation standpoint. The math, at least to me, is long, involved, and fairly nasty to solve without the help of some MATLAB matrix witchery. However, there is also a second, more "gut level" way to approach the Kalman filter - by approaching it as a case of recursive Bayesian filtering.

The derivation for the following equations can be found on pg. 47 of Bayesian Data Analysis, Second Edition, by Gelman, Carlin, Stern, and Rubin.

Basically, it is possible estimate the mean of a normal distribution by following a weighting equation for mean and variance, which can be represented as follows.

$\Large{\mu_1=\frac{\frac{1}{\tau_0^2}\mu_0+\frac{1}{\sigma^2}y}{\frac{1}{\tau_0^2}+\frac{1}{\sigma^2}}}$

$\Large{\frac{1}{\tau_1^2}=\frac{1}{\tau_0^2}+\frac{1}{\sigma^2}}$

Here, $\mu_1$ represents the new estimate for the mean and $\tau_1^2$ is the estimate for the new variance. $\mu_0$ represents the old mean, $\tau_0^2$ represents the old variance, $y$ is the new measured value, and $\sigma^2$ is the measurement variance. I will do an example later on to demonstrate this, but first, let's try to decide what this equation is really saying.

If the old value for variance, $\tau_0^2$, is very large, then $u_0$ is multiplied by a very small number, effectively making the equation...

$\Large{\mu_1 = \frac{\frac{1}{\sigma^2}y}{\frac{1}{\sigma^2}}}$

which can be further reduced to simply...

$\Large{\mu_1 = y}$

What this means, is that if we are not very confident ($\tau_0^2$ is large) of our old mean, $\mu_0$, our best guess for $\mu_1$ is simply the value we measured, $y$. Our new variance also reduces to

$\Large{\frac{1}{\tau_1^2}=\frac{1}{\sigma^2}}$

Conversely, if we are not very confident in the new result $y$, our new mean $\mu_1$ is basically the same as $\mu_0$. The new mean is based largely on the weighting between the new variance and the old - so if we are equally confident in both the previous mean and the new measurement, our best estimate for $\mu_1$ will split the difference.

Using this update function, it is easy to wrap a new estimate back into the prior distribution, which is the core idea behind Bayesian analysis. But how does it come into play with Kalman filtering?

APPLICATION
The Kalman filter can be boiled down to a few basic steps.
1. Predict a future value, based on previous information. (Predict)
2. Compare prediction and measured value, using results as the new previous information. (Update)
3. If you run out of data, or are satisfied with the estimate, exit. Otherwise, GOTO 1.

Pretty simple, right? This method can yield some very interesting results, cutting through measurement error and giving close approximation of the root signal.

From a high level, Bayesian derivation is

$P(X_i|Y_{1:i}) \alpha P(Y_i|X_i,Y_{1:i-1}) P(X_i|Y_{1:i-1})$

Because Bayesian analysis usually wraps it's results back into the prior model,
this reduces to

$P(X_i|Y_{1:i}) \alpha P(Y_i|X_i) P(X_i|Y_{1:i-1})$

What this means is we have a predictive function, $P(Y_i|X_i)$, which will be described a bit later, used in conjunction with a likelihood function, $P(X_i|Y_{1:i-1})$, which was described above. Together these two functions form the core of the Kalman filtering operation: Predict, Measure, Update, Repeat.

SETUP
The key things we need to know are:
The standard deviation of our source. I chose .6
The scale factor of the source. This can be calculated from
$1 = (scale factor)^2+(source std dev)^2$
The standard deviation of the measurements. I chose .5, but this is a parameter you can play around with. The higher the measurement sigma gets, the worse our filter performs, due to signal to noise ratio effects. An intial deviation of .5 is fairly high, so dropping this number to something like .25 will give "prettier" results.

In a real system, you could reasonably expect to have tolerances for both your source equipment and your measurement equipment. Maybe you could make measurements of this, or it could be found in the manufacturer's datasheet.

I will be showing this method through an example, using R. We will create a test case by following a few generic steps:
1. Create a source signal with some noise variance $X$.
2. Create a measurement signal, which will introduce measurement variance $Y$.
3. Kalman filter the measurement signal $k = filt(Y)$.
4. Compare the filtered data $k$ with the original transmission $X$.

So let's begin.

Step 1.
Create a source signal. I chose to generate a random walk.

randomwalk <- function(num, source_scale, source_sigma) {
randy <- c(0,1:num)
for(i in 2:length(randy)) {
randy[i] <- source_scale*randy[i-1]+rnorm(1, sd=source_sigma)
}
return(randy[2:length(randy)])
}

Step 2.
Now, we need a function to create a measurement signal, which will further distort our input.

measured <- sent+rnorm(length(sent), sd=meas_sigma)
return(measured)
}

Step 3.
The signal is pretty badly distorted now... time to write a Kalman filter function.

pred_mean <- function(source_scale, prev_mean) {
return(source_scale*prev_mean)
}

pred_sigma <- function(source_scale, prev_sigma, source_sigma) {
return(sqrt((source_scale**2)*(prev_sigma**2)+source_sigma**2))
}

update_mean <- function(pred_mean, pred_sigma, meas_val, meas_sigma) {
numerator <- (pred_mean/(pred_sigma**2))+(meas_val/(meas_sigma**2))
denominator <- (1/(pred_sigma**2))+(1/(meas_sigma**2))
return(numerator/denominator)
}

update_sigma <- function(pred_sigma, meas_sigma) {
r =(1/(pred_sigma**2))+(1/(meas_sigma**2))
return(1/sqrt(r))
}

filt <- function(y, source_scale, source_sigma, meas_sigma) {
last_mean <- 0
last_sigma <- source_sigma
k <- 1:length(y)
for(i in 1:length(y)) {
est_mean <- pred_mean(source_scale, last_mean)
est_sigma <- pred_sigma(source_scale, last_sigma, source_sigma)
k[i] <- est_mean+rnorm(1, sd=est_sigma)
last_mean <- update_mean(est_mean, est_sigma, y[i], meas_sigma)
last_sigma <- update_sigma(est_sigma, meas_sigma)
}
return(k)
}

A few quick derivations are required for this code to make sense. The update_mean and the update_variance functions were described at the start of the blog, but where on earth did the pred_mean and pred_sigma functions come from?
According to notation of the Kalman filter, we currently have
$X_i = \alpha_s X_{prev}+\omega_s$, where $\alpha_s$ is our source scale, and $\omega_s$ is the source standard deviation. In Kalman filter terms, this equation is the state model, and we got this from the knowledge we have about how the randomwalk values were generated.
The second equation, our space model, is $Y_i = X_i + \omega_m$, where $\omega_m$ is the measurement standard deviation.
In order to get the equations for our pred_mean and pred_sigma we want to find the expected value and variance of the state model, which looks like this.
$E[X_i] = E[\alpha_s X_{prev}+\omega_s]$

using the rules of associativity for expected value...

$E[X_i] = \alpha_s E[X_{prev}]+E[\omega_s]$

The expected value of zero-mean noise is 0, so now we find

$E[X_i] = \alpha_s E[X_{prev}]$

This matches exactly with the code for pred_mean - our new guess is the old mean multiplied by the scale factor.
The pred_sigma function is a little trickier.
$Var(X_i) = Var(\alpha_s X_{prev}+\omega_s)$
Since $Var(Q)$ is the same as $E[(Q-E[Q])^2]$, we now see
$Var(X_i) =$
$E[(\alpha_s X_{prev}+\omega_s)^2]-E[\alpha_s X_{prev}+\omega_s]^2$
Expanding, this now becomes
$Var(X_i) =$
$\alpha_s^2 E[X_{prev}^2]+2\alpha_s E[X_{prev}] E[\omega_s]+$
$E[\omega_s^2] - E[\alpha_s X{prev} + \omega_s]^2$

Continuing, we see that

$- E[\alpha_s X{prev} + \omega_s]^2$ can be expanded to
$- E[\alpha_s X_{prev}+\omega_s]E[\alpha X_{prev}+\omega_s]$

which becomes

$-(E[\alpha_s X_{prev}]+E[\omega_s])(E[\alpha X_{prev}]+E[\omega_s])$

FOIL, and we now see

$-\alpha_s^2E[X_{prev}]^2-2\alpha_s E[\omega_s]E[X_{prev}]-E[\omega_s]^2$

One view of this expression is that $\omega_s$ is a normally distributed random variable with a mean of 0. $E[\omega_s]$ is 0, and squaring that is still 0, so both the $E[\omega_s]$ and $E[\omega_s]^2$ end up going to 0. It is important to note that $E[\omega_s^2]$ will NOT go to zero, as the mean of the squared distribution is no longer centered about 0, and instead becomes $\omega_s^2$.

However, another, more natural view is to note that $E[\omega_s^2]-E[\omega_s]^2$ is identical to $Var(\omega_s)$ which is actually  $\omega_s^2$. The $2\alpha_s$ terms cancel.

Either way, the final result is the same.

$Var(X_i) = \alpha_s^2 Var(X_{prev}) + \omega_s^2$

This matches the code for pred_sigma - square root of scale factor squared times the previous sigma + the measurement sigma is our best guess for the new sigma.

See http://mathworld.wolfram.com/Variance.html for an alternate derivation that leads to the same result.

Step 4.
Now we can write a function to tie it all together.

runit <- function() {
source_sigma <- .01
source_scale <- sqrt(1-source_sigma**2)
meas_sigma <- .4
x <- randomwalk(1000, source_scale, source_sigma)
k <- filt(y, source_scale, source_sigma, meas_sigma)
plot(y, type="l", col="red")
lines(k, col="green")
lines(x, col="blue")

}

The blue is the source signal, the red is the signal at measurement, and green is the recovered signal. Even when measurement noise washes out the root signal, we can recover the original fairly well.

Try tuning each of the sigmas, and see how the results change - it's pretty interesting.

﻿
There you have it! A simple, logical derivation of the Kalman filter as a recursive Bayesian filter. In the future I plan to write about more complex statistical processing methods as I learn them, such as how to run this simulation with 0 known parameters, or implementation of one of the non-linear Kalman filter algorithms.

As always, critique is both welcome and appreciated.

CODE
To run this code, simply copy and paste into a source file. Then open an R interpreter, do source("path/to/source/file"), then do runit().

randomwalk <- function(num, source_scale, source_sigma) {
randy <- c(0,1:num)
for(i in 2:length(randy)) {
randy[i] <- source_scale*randy[i-1]+rnorm(1, sd=source_sigma)
}
return(randy[2:length(randy)])
}

measured <- sent+rnorm(length(sent), sd=meas_sigma)
return(measured)
}

pred_mean <- function(source_scale, prev_mean) {
return(source_scale*prev_mean)
}

pred_sigma <- function(source_scale, prev_sigma, source_sigma) {
return(sqrt((source_scale**2)*(prev_sigma**2)+source_sigma**2))
}

update_mean <- function(pred_mean, pred_sigma, meas_val, meas_sigma) {
numerator <- (pred_mean/(pred_sigma**2))+(meas_val/(meas_sigma**2))
denominator <- (1/(pred_sigma**2))+(1/(meas_sigma**2))
return(numerator/denominator)
}

update_sigma <- function(pred_sigma, meas_sigma) {
r =(1/(pred_sigma**2))+(1/(meas_sigma**2))
return(1/sqrt(r))
}

filt <- function(y, source_scale, source_sigma, meas_sigma) {
last_mean <- 0
last_sigma <- source_sigma
k <- 1:length(y)
for(i in 1:length(y)) {
est_mean <- pred_mean(source_scale, last_mean)
est_sigma <- pred_sigma(source_scale, last_sigma, source_sigma)
k[i] <- est_mean+rnorm(1, sd=est_sigma)
last_mean <- update_mean(est_mean, est_sigma, y[i], meas_sigma)
last_sigma <- update_sigma(est_sigma, meas_sigma)
}
return(k)
}

runit <- function() {
source_sigma <- .01
source_scale <- sqrt(1-source_sigma**2)
meas_sigma <- .4
x <- randomwalk(1000, source_scale, source_sigma)
k <- filt(y, source_scale, source_sigma, meas_sigma)
plot(y, type="l", col="red")
lines(k, col="green")
lines(x, col="blue")

}

1. Is that pseudocode? If not, what language is it?

1. I guess it is R http://www.r-project.org/

2. It is R - I am trying to cross the gap from MATLAB so I'll probably be doing several more posts with R code.

Though I still L O V E python.

2. Discussion fro reddit which prompted an update in my derivation...
user Kometes_:
I understood everything except the pred_sigma part. To simplify properly shouldn't the last line of the expansion be:
Var(X) = ... E[ω2 ] − α2 E[Xprev]2 - 2αE[Xprev]E[ωs] - E[ω]2
Also this gives the variance of X, so shouldn't pred_sigma be the square root of Var(X)?

pred_sigma <- function...
return sqrt(source_scale^2*prev_sigma^2 + source_sigma^2)

My response:
You are right on the money. I messed up pretty good - time to go correct myself.

Because we are expanding −E[αXprev+ωs]2 , I should have expanded to

−E[Xprev+ωs]E[αXprev+ωs]

which then becomes

−(E[αXprev]+E[ωs])(E[αXprev]+E[ωs])

FOIL, and we see

−α2E[Xprev]2−2αE[ωs]E[Xprev]−E[ωs]2

The reason it works either way is the ωs is a normally distributed random variable with a mean of 0. So E[ωs] is 0, and squaring that is also 0, so both of the bolded terms end up going to 0... I think.

It is important to note that E[ω2s] will NOT go to zero, as the mean of the squared distribution is no longer centered about 0, and instead becomes ω2s.

It was a mistake on my notes that worked out through luck - your expansions are the logical (and correct!) next step. I will correct the blog.

On the pred_sigma part - you are most definitely correct! I have updated my code and things look much better now.
Thanks for helping me get this right.

3. can you explain why in the filt function, k value has a ranif(1) in it? I asked this because i thought noise from the input are from the randomwalk/signal functions.

1. All of my noise functions are a little bit off - I mistranslated form my MATLAB code. I should have been using rnorm, not runif - I will update accordingly. And really, multiplying by the sigma isn't as clean as rnorm(1, sd=sigma) so I will do that as well.

4. For another simple introduction to the Kalman filter, in the setting of learning theory, see the third page of this article:

http://www.indiana.edu/~kruschke/articles/Kruschke2008.pdf

1. Thanks John - Good link! Your book and articles have helped me quite a bit the last few weeks.

5. Thanks a lot for this article. I ported your R code to C# to help me understand how it works:
https://github.com/TaylorClark/KalmanCS