Monday, March 26, 2012

How Fish Shoal (3): From Linear Fits to Neural Networks

In the last post we looked at how we could adjust the slope of a simple straight line fit to minimise the residual error between the data and the approximating function

Behaviour = Slope x Stimulus + residual

Although linear regression like this can be very powerful, and is often used to determine if two variables are linked (via correlation analysis), in general we cannot expect that animal behaviour is always so straightforwardly related to the stimuli or environmental cues we are interested in.

Consider for example how good you might feel over the course of an evening of heavy drinking. Here we can quantify the 'stimulus' as the number of drinks you consume. While some people are lucky enough to always consume in moderation to their acceptable intake, I would wager you have encountered evenings where your general state of enjoyment follows a trend like the one below.

Enjoy inference responsibly...

How can we fit a function to a more complicated looking curve like this? One solution would be to try and guess the shape of the curve....but what function does it look like? If we choose badly then we'll never get a good fit to the data.

We'd like a more flexible alternative that doesn't require so much guesswork. To see how we can get there, first we need to understand how we might apply the previous straight line fitting techniques when we have multiple stimuli. Imagine we think that an animals behaviour is a function of two stimuli

Behaviour = f(Stimulus1, Stimulus2)

Well, how about starting by extending the equation we had for a straight line. The original equation for just one stimulus was:

Behaviour = Slope x Stimulus + residual

For reasons that will become clear, we can draw this relationship using the following schematic



Extending this to two stimuli we have two slopes:

Behaviour = Slope1 x Stimulus1 + Slope2 x Stimulus2 + residual

and similarly a new picture, showing that the behaviour comes from adding the two different factors:




Now the behaviour varies in response to both stimuli. Each slope tells us how strong the relationship between stimulus and behaviour is, and we can find out what these slopes are in much the same way as before, either by trying different values until we get the lowest squared-sum-of-errors (see the last post), or by using the special formulae that exist for telling us the best solution. I don't intend to worry about these formulae here, but if you want to find them or see where they come from then the Wikipedia page on Ordinary Least Squares is a good place to look.

We can extend this basic approach further, including more and more stimuli. But, at this point you may be wondering "How does this help us fit functions like the one about drinking enjoyment above?" At this point we employ a wonderful little trick. Remember we talked about guessing the form of that function? I said that this would be too restrictive. But what if we guess lots of functions? 

The trick here can be seen simply by taking the equation for two stimuli we had above, but now imagine that stimulus 2 is related to stimulus 1. Instead of two different stimuli, imagine instead that we replace stimulus 2 with a function of stimulus 1. Lets try something simple, like Stimulus12.

Behaviour = Slope1 x Stimulus1 + Slope2 x Stimulus12 + residual

Now, if we find the best fit values for these slopes, based on the observed values of Stimulus1 and Stimulus12 we are actually fitting a non-linear, quadratic function to the data. If we know what Stimulus1is then we can easily calculate Stimulus1^2, and then we can treat them as if they were different stimuli. Fitting this function is still the same process as before - we choose different values for the slopes and find those that produce the lowest residual errors. See that our schematic of the model now has an additional layer between the stimulus and the behaviour.

We can take this simple trick further. Instead of many stimuli, we can use many different functions of the same stimulus.  Consider a set of different functions of the stimulus, gi(stimulus), where each value of i specifies a different function. We can model the behaviour as being a sum of these functions, each with its own slope

Behaviour = Slope1 x g1 (Stimulus) + Slope2 x g2 (Stimulus) + Slope3 x g3 (Stimulus)... etc

These functions could each be different powers of Stimulus, giving us a polynomial fit, or they could be any other set of functions. The key point here is that instead of guessing one kind of function to fit, we can try lots of different functions and weight them according to how important they are. Just as in every case before, when we try different values for the slopes, we get different residual errors, and we aim to find the best values that minimise those errors. Of course, as the number of slopes increases it becomes harder to find the best values easily, but in principle the task is the same.

The schematic of this kind of model (below) shows that we now have an expanding middle layer, where each circle represents a different function of the stimulus. The value of Stimulus is passed to each of these functions. The output of each function is then weighted according to the value of its Slope and passed to the Behaviour, which is made from the sum of all these bits.



The next and final stage is to consider developing this middle layer one step further, so we can once again consider multiple different stimuli. Instead of just one stimulus at the start which leads to our middle layer, imagine we have many stimuli to consider. Each function in the middle layer is now dependent on all the different stimuli, so e.g. g1 = g1 (S1 , S2 , S3 ...). Connect each stimulus to every function in the middle layer...

Don't worry about the switch from circles to squares or from black arrows to blue, I'm just reusing an old image!
Now we have a model which is going to look very complicated if we write it down as an equation. In essence though the idea remains the same as our first linear model. Along each line we simply multiply the input by some number, just like the slope of the linear plot. The behaviour predicted by the model is given by adding up all the different bits coming out of the middle layer into the final node, and we can adjust the multiplying numbers on each line to get the lowest error possible. When we find the best values for these numbers we have something which acts as a function taking stimuli to behaviour. We can then put it any value of stimuli we are interested in and see what the function predicts the animal will do.

You may have seen something like the above picture before. It is an example of what is known as an artificial neural network and is of a special type called a multi-layer perceptron. In such models the functions in the middle layer are usually sigmoidal functions that aim to mimic the highly thresholded response of real neurons to stimuli in the brain. Each function takes in a weighted sum of the stimuli, S, and sends out an output according to a profile like that below.


These neural networks provide us with a tool which can fit extremely variable non-linear functions of many stimuli. Neural networks have a lot of parameters that can be varied, each one essentially like the slopes we learnt for the linear model. Just as before, when these parameters are changed they alter the residual error between the behaviour predicted by the model and the observed behaviour. Unfortunately there is no general solution that quickly tells us what these parameters should be like there was for the linear model, but there are lots of clever ways to find good values for these parameters by iteratively changing them, making sure the error keeps going down. But this is a long way beyond the scope of today's post. 

Obviously a short blog post like this leaves many aspects of this kind of model fitting unaddressed, such as how we learn the parameters, or which functions we use for the middle layers. Although I have tried to give you some idea how such a model works by relating it back to the linear straight line fitting, the important thing to remember is what we shall see in the next post: you can use models like this while understanding almost nothing of what is going on inside. Lots of computer scientists have generously studied these sorts of models for decades, creating neat little toolboxes like Netlab for Matlab that allow us to fit complicated functions without getting our hands dirty with the modelling machinery! The important thing is to be secure enough in knowing what is going on in principle that you are happy to look away and let the toolbox do its work. In the next post I will try and show, with a bit of Matlab code, how we actually apply a neural network from this toolbox to (finally!) learn how fish shoal...

[If you are interested in more of the details surrounding this topic, I can highly recommend David Mackay's  textbook, Information Theory, Inference and Learning Algorithms (free online) - try Section V]

Thursday, March 22, 2012

How Fish Shoal (2): Fitting a Simple Function

In the last post we saw that understanding how a fish responds to a stimulus (such as the position of its neighbours) can be seen as determining an unknown function. The function takes the value of the stimulus as an input and outputs the response of the fish. Indeed we can even imagine the fish as a black box, that takes in stimuli and outputs behaviours.

Behaviour = f(Stimulus)


The question then is how we find out what this function is. If the data is quite simple we can get some idea by plotting all the recorded stimuli against the response they produced, e.g.



Here we only have one stimulus and one response, so it's quite easy to plot the data and see a pattern. Clearly the behaviour increases in line with the size of the stimulus, and the relationship seems close to a straight line. But of course the points don't lie exactly on a straight line, that would be too easy! We could try and fit a function that passed through every point exactly, but that would be rather complicated and wouldn't really add to our understanding...

Too complicated

Instead we can choose to see some of the variation from point to point as noise. Noise is simply variation in the observed behaviour that we either can't predict or are not interested in predicting. It might be due to inaccuracy in the measuring equipment, variation in the environment that we do not measure, or simply "biological variation" i.e. the fact that animals often act quite unpredictably!

Allowing for the existence of noise, we can now see the observed behaviour as the sum of predictable and unpredictable parts. The predictable part is given by the function we want to estimate, while the unpredictable part is the noise, also known as the residual

Behaviour = f(Stimulus) + residual

Generally we want to predict as much about the animal's behaviour as possible, so we want to minimise the unpredictable part of this equation, and maximise the predictable part. This means we need to find a function that minimises the distance between f(Stimulus) and the observed Behaviour. This is the justification behind least-squares regression (LSR)

To see LSR in action, lets try to fit a straight line to the data we saw above. Doubtless you will have had to do this in a maths lesson at some point, using some long formula to calculate the right line. Here instead we are going to examine exactly what we're doing when we try to fit a straight line to the data. For simplicity, lets say that our straight line will go through the origin (0, 0). So the equation of any straight line we might try is:

Behaviour = Stimulus × Slope + residual

If we imagine trying different slopes, we can see by eye that some are clearly better than others...

Good and bad straight lines

But why are some lines better than others? Because they 'go through the data better' - in other words they minimise the overall distance between the line and the data points. This can be measured by taking the sum of square errors (SSE) - a fancy way of saying 'measure the distance from the line to each data point, square each distance and add them all together'

SSE = r12 + r22 + r32 + r42 ... + r102

Measuring the errors

The important thing to realise is that the SSE quantifies how good the fit is, and that there is a value of SSE for every possible slope. The task of least-squares regression is to find the slope that minimises the SSE. But note that there is no slope that is "right" - we're just trying to find the one that is least wrong.

Every slope has its own quality of fit

Here the best fit line is found to have a slope of about 1.1, which is close the the slope of 1 that I used to make the data. Notice how the SSE (the total error) gets bigger as we move away from the best fit - making the line either steeper or shallower increases the error, which means that the fit is worse.

By seeing how quickly the fit gets worse when we change the slope we can estimate how certain we are about the best fit - if there is a wide area where the SSE is almost the same at the bottom then there are lots of slopes that are almost-as-good as the best line, which tells us that the data isn't enough to estimate the function well.

It is important to see that we could find the best slope without using a special formula. We could simply use trial and error, calculating the SSE for one slope, then another, moving in the same direction if the SSE improves and going back if it gets worse. Or we could do what I did to plot that figure and simply try lots of possible values of the slope and find the SSE for each one. When we start fitting more complicated functions we won't always have a nice neat formula that tells us what the best fit will be.

In the next post we'll look at how linear regression can be used for more than one stimulus and how we can fit non-linear functions.

Monday, March 19, 2012

How Fish Shoal (1): Understanding Regression

In late 2011 our research group at Uppsala University and our colleagues at the University of Sydney, published a study that aimed to show how fish in shoals change their speed and direction of movement in response to the other fish. We had placed groups of fish together in a tank and tracked their movements (see video below). How do we get from that to understanding how they respond to each other?



Let us begin by asking what we mean by saying 'we understand' why a fish changes its motion. My definition is that I understand why a fish does this if I can predict it. That is, I can say what a fish will do next if I know what its environment is like (I'm leaving the exact definition of 'the environment' intentionally vague at this point).

Environment → Next Behaviour

Mathematically we call this finding a mapping between the things on the left and the things on the right. We often write a mapping as a function, f.


Next Behaviour = f(Environment)

Understanding requires us to find how the environment influences behaviour. That is, we must find out what the function, f, is.




















Having defined our goal, our next task is to make things a little more concrete. What behaviours are we interested in predicting? Which aspects of the environment do we think could be important? We chose to try to predict how a fish will change its speed and direction, using the positions and directions of the other fish around it. Let us consider only the change of speed, we now want to find a function such that:

speed change = f(positions of other fish, directions of other fish)

Now these are things we can actually measure from the recorded tracks (see figure below). So we can say where, for example, the nearest fish to our focal fish was at each moment, and correspondingly how much the focal fish accelerated or decelerated at that moment.

Measuring the position of the nearest fish

So have data giving both the inputs to the function (the positions/directions of the other fish), and the output (the acceleration of the focal fish). How do we learn what the function is. This task is generally known as regression. You will almost certainly have come across linear regression at school, which is a special case of this method, but generally regression simply means learning how one variable predicts another. Specifically we use the term 'regression' in contrast to 'classification': regression is used when we are dealing with a output (such as the acceleration) that can take many values. Classification is used for either/or type outputs, such a whether someone recovers from a disease.

In the next post we'll see how we can use our data to learn what the function mapping the environment to behaviour looks like, starting by recapping on the principles of linear regression...

Of Prawns and Probability: Machine Learning in Animal Behaviour

My name is Richard, and I'm a Bayesian...

My research is centred around the application of techniques from machine-learning and computational statistics to analyse data from animal behaviour experiments. Examples of such techniques might be Artificial Neural Networks, Gaussian Processes and Hidden Markov Models. These might be used to understand how fish follow each other, how pigeons navigate or how prawns interact with their neighbours.

Typically when we come to write up such analyses in a paper we refer the reader to 1-3 standard texts where they could, in principle, find out all they need to know to reproduce our work, noting only the specific details that distinguish our use of whichever tool we employ, in a short paragraph in the small print.

When our readers (gambling on the plural), or even my colleagues refer to my work they usually say something along the line of 'that Bayesian stuff you do', with the suggestion that such things are practically incomprehensible and perhaps somehow almost like witchcraft. 

In fact, far from being a maze of complex mathematics that only committed disciples can penetrate, my use of these methods is predicated most importantly on knowing what I can afford not to understand in detail. I am very much an applications man, taking established tools off the shelf, tweaking them a little and employing them for new purposes. 

In this blog I'd like to demystify this process a little, explaining what I know about different methods, building up from some very basic ideas, and mostly showing how powerful a little knowledge can be. Over a series of posts I will try to show how analyses we have published actually work, trying to explain with a minimum of mathematics what is actually going on. So when I next say of some analysis 'this bit is fairly trivial', maybe someone will agree.