Prawns and Probability

Tuesday, November 6, 2018

Noise and collective behaviour

"You should never do anything random!", Michael Osborne told the room at large for the umpteenth time during my PhD. Mike, now an Associate Professor at Oxford, is still telling anyone who will listen about the evils of using random numbers in machine learning algorithms. He contends that for any algorithm that makes decisions at random, there is a better algorithm that makes those decisions according to some non-random criterion.

The strongest arguments for random numbers in computation seem to be their

1. low cost and their
2. unbiasedness.

Below are my best attempts at counter-arguments.
— Michael A Osborne (@maosbot) 28 September 2018

Having seen this debate sporadically erupt over a decade or so has burned the issue of random decision making deep into my psyche, where it has lurked until I recently had reason to think a bit more deeply about the use of random decision making in my own research area: modelling animal behaviour.

Many models of animal behaviour make use of random decisions, or 'noise'. For example, an animal may choose a new direction in which to move by averaging the current directions of the other individuals around itself, and then add some random error to that average to give its own new direction. But why should an animal do something random? Surely there is a 'best' action to be taken based on what the animal knows, and it should do this. Indeed, how would an animal do something random? It is remarkably difficult for humans to write down 'random' series of numbers without the aid of a random number generator, such as a coin or a dice. If you were asked to pick one of two options with a probability of 0.65, how would you do it? Why should an animal be any different?

Usually when we ascribe noise to animal or human decisions, what we are really doing is modelling that part of the decision that we either don't understand, or that we choose to ignore. For example, in a recent paper my coauthors and I looked at factors influencing neighbourhood choice in Stockholm. We modelled choices as being influenced by the characteristics of the moving household and the neighbourhoods they were choosing from, but ultimately being probabilistic - i.e. random. As we say in the paper, this is equivalent to assuming that the households are influenced by many other factors that we don't observe, and that they make the best choice given all this extra information. Because we don't see everything that influences the decision, it appears 'noisy' to us.

So far, so good. This is fundamentally no more controversial than treating a coin toss as random, even though we know that the coin is obeying deterministic physical rules. As long as we use a decent model for the stochasticity in these decisions, we can happily treat what are really deterministic decisions as being random, and still make solid inferences about what influenced them. But we can run into trouble when we forget that only we are playing this trick. This becomes a problem in the world of collective behaviour, where we want to understand how animals are influencing and being influenced by each other. Though we might treat individual animals' decisions as being partly random, we cannot guarantee that the animals themselves also do the same thing. Indeed, it is likely that the animals themselves have a better idea about what factors motivate and influence each other than we do. Where we might, in our ignorance, see a random action, another animal might well see a response to some cue that we haven't thought to look for.

To illustrate, lets imagine that you and I are trying to choose a restaurant. For the purposes of simplicity I will assume that we like very much the same things in a restaurant - we have the same tastes in food and ambience. We approach two seemingly similar-looking restaurants, A and B. I can smell that the food in restaurant A smells somewhat more appetising than in B. Nonetheless, I see you starting to walk in the direction of restaurant B. I know we can both smell that the food in A is better, so what should I make of your decision? If I assume your decision is partly random, I might just assume you made a mistake - A really is better, but you randomly picked B instead. I am then free to pick A. But if I assume you made the best choice with the information available, I must conclude that you have some private information that outweighs the information we share - maybe you earlier read an excellent review of restaurant B. Since our tastes are very similar, I should also conclude that if I had access to your private information as well, I would have made the same choice, since the choice is determined exactly by the information. So now I really ought to pick restaurant B as well.

This place looked great on the web...
[Kyle Moore, CC-SA 1.0]

Looking at collective decision making this way shows that how individuals should respond to each other depends on how much they ascribe the choices made by others to random chance, not how much we do. We therefore need to be careful not to assume that 'noise' in the behaviour of animals in groups is an intrinsic property of the decisions, but instead remember that it depends on choices we make in deciding what to measure, and what to care about. The animals themselves may make very different choices. The consequences of adopting this viewpoint are laid out in detail in my recent paper: Collective decision making by rational individuals. In short they are:

1. The order in which previous decisions have been made is crucial in determining what the next individual will do - the most recent decisions are the most important.

2. Because of the above, how animals appear to interact depends strongly on what we choose to measure. Social information isn't something we can measure in a model-free, neutral way.

3. Group behaviour should change predictably when we observe animals (or humans) in their natural habitat versus the laboratory. In general, social behaviour will probably be stronger in the lab.

None of this is to say that animals or humans always (or ever) do behave rationally. Rather, that they make decisions on the basis of reasons, not the roll of a dice. And their reaction to the choices made by others will be shaped by what they perceive those reasons to be in other individuals. Perhaps, to paraphrase Michael Osborne, we should never assume that other people or animals are doing anything random. Or at least we shouldn't assume that other people are assuming that.........

Thursday, November 1, 2018

Yet more reasons to fund diverse basic science

Research is an incremental, iterative process. New advances build on those that came before, and open up new lines of research to follow afterwards. But not all research leads anywhere. The office drawers of academics are full of manuscripts that never got published, or data from studies that never showed any results. Whole fields such as phrenology enjoy periods in the sun before fading away (if you know of any modern research that directly descends from phrenology, let me know in the comments).

In this respect, research is a lot like the Tree of Life, with each project or study being a species. Species may give rise to new species (new research questions), or they may go extinct, but the Tree of Research (hopefully) endures.

Mathematicians have tools for understanding tree-generating processes such as these: birth-death models. These specify what types of tree are likely to be generated based on the rates of speciation and extinction for individual species.

Graham Budd and I recently published a study investigating the properties of these processes. Trees generating by birth-death processes are very vulnerable; a newly created tree with only a few species can easily stop growing if all of those species go extinct. On the flip side, trees that have already generated many species can be very robust and are hard to push towards extinction. A consequence of this is that trees that do survive a long time tend to have bursts of rapid diversification at the start. Looking more deeply into the trees that survive, we find that the surviving lineages (those species that have modern descendants) are always diversifying like crazy, speciating at twice the rate we would otherwise expect.

Trees that survive for a long time tend to diversify quickly when they are small (Budd & Mann 2018)

What does this have to do with research funding? Increasingly research funding is allocated on the basis of competitive grant applications. I have written before about the waste involved in this, but another consequence is that research diversity suffers. To get a grant in the UK for example, you must convince the funder and reviewers that you have a very good chance to make notable findings and have impact in academia, industry and elsewhere. This requirement, along with the notable and growing bias towards funding senior academics who have substantial previous funding, favours research that is predictable, which follows the researcher's previously demonstrated expertise and where preliminary results are already available. This in turn reduces the diversity of possible research avenues that might be explored.

What is the result of reducing diversity? Our research suggests that if we depress the diversification of research we risk extinguishing the Tree of Research altogether. If we focus research efforts too narrowly we put too many eggs in too few baskets. The future success of those research areas is less predictable than we might like to think - few phrenologists thought that their expertise would one day be seen as quackery. If those bets don't pay off then scientific progress may slow down or stop altogether.

Lineages that give rise to long-term descendants are always diversifying quickly (red lines). Green lines diversify slowly and go extinct (Budd & Mann 2018)

But surely, you might reply, isn't it a good idea to check on the track record of scientists and look at their ideas before giving them lots of public money? No doubt there is some value in scrutiny, but given the competition for academic jobs I think we can safely say that most academics have already been scrutinised before they start asking for money. As stated above, I believe our ability to predict what will be a success is highly limited. Moreover, several studies have shown that we can't even agree on what is good or not anyway, reducing weeks or months of labour to a lottery. Just as importantly, as another of my recent papers, this time with Dirk Helbing, has shown, the way that we allocate rewards and resources based on past success can distort the things that people choose to research, and as a result reduce the collective wisdom of academia as a whole. Dirk and I showed that too much diversity in what people choose to research is greatly preferable to too little: as a collective we need the individuals who research seemingly mad questions with little chance of success. Unfortunately, the most natural ways to reward and fund academics based on their track record would seem to create far too little diversity of research.

Rewards influence diversity and collective wisdom. Too much diversity (orange line) is better than too little (black and blue lines). (Mann & Helbing 2017).

So what can be done? Dirk and I showed that collective intelligence can be optimised by retrospectively rewarding individuals who are proved right when the majority is wrong. This mirrors approaches in statistics for ensemble learning called Boosting, wherein we train models to predict data that other models were unable to predict accurately. So I would be in favour of targeting grants to those who have gone against prevailing opinion and been proved right. However, we also showed that if agents choose what to research at random this will create greater collective intelligence than many reward schemes. This would support funding many scientists with unconditional funding that supports research wherever their curiosity takes them. This would have the additional advantage of removing much of the deadweight cost of grant applications.

References:

Budd & Mann (2018): History is written by the victors: The effect of the push of the past on the fossil record. Evolution

Mann & Helbing (2017): Optimal incentives for collective intelligence. PNAS

Pier et al. (2018) Low agreement among reviewers evaluating the same NIH grant applications. PNAS

Price (2014), The NIPS Experiment: http://blog.mrtz.org/2014/12/15/the-nips-experiment.html

Monday, May 21, 2018

What crosswords can teach us about collective intelligence

Dear reader: it is only fair to give you advance warning that this post will be a thinly-veiled excuse for me to crow about winning the prize for the weekly Times Jumbo Cryptic Crossword...

...that being said, I have long meant to write a post about crosswords, and in particular what they can teach us about collective intelligence. So here we go:

Alexander wept, for there were no more worlds to conquer.

Most weeks I complete several crosswords in The Times (London not New York). I'm not an especially good crossword solver, and solving a typical crossword might take me anywhere from 30min to several hours depending on the difficulty. Clearly solving a cryptic crossword is a task that requires 'intelligence' to perform, though exactly how transferable that concept of intelligence is can be debated. You don't need to be a maths whizz or a language expert - most of it is about learning a few basic rules of cryptic clueing and fostering a reasonably open mind. In the case of The Times it also helps to absorb a lot of weirdly specific knowledge and jargon of the sort that a certain demographic of person possesses - picture an English man in his 50's-70's who went to private or grammar school and then Oxbridge, and who grew up on a diet of Enid Blyton books and cricket. For reasons that are completely inexplicable you also need to know that 'rhino' can be a synonym for 'money'.

All of that is to say that I am not positing crosswords as a benchmark for general intelligence, but that they can be used as an example of a task that requires some type of intelligence to perform. In terms of crosswords, we can measure 'intelligence' firstly by how many clues one gets right, and among those who get all clues right, by the speed of completion.

What does this have to do with collective intelligence? Well, on many occasions I complete crosswords together with my friend Graham Budd [1]. As the saying goes, two heads are better than one, and when solving together we typically finish the crossword more quickly than on my own, despite us wasting time bemoaning the particularly excruciating clues and otherwise dwelling on our perceptions of ongoing societal collapse. As such, this is an example of collective intelligence - together we are able to solve a problem with greater intelligence than either of us alone.

An excruciating clue: Sunday Times Crytic 4619

So far this is not especially noteworthy. Of course we are faster together! We can divide the labour. When one of us gets a clue we both get it. Even if we don't agree to split the clues, neither of us has the solve all the clues ourselves. However, what is surprising is that we often finish the crossword in less than half the time it would take me alone. That means that our 'intelligence' has more than doubled.

Such a case is called superadditive; If I write the performance of some individuals as f(Individuals) then:

f(Me + Graham) > f(Me) + f(Graham)

Or in plain language, Graham and I are 'more than the sum of our parts'

Conversely, many cases of collective intelligence are subadditive, i.e.

f(Me + Graham) < f(Me) + f(Graham)

For instance, one of the most famous examples of collective wisdom comes from Francis Galton. Galton observed punters guessing the weight of a bull at a fair, and noted that the average of their guesses was uncannily accurate. We know that this is a consequence of the Law of Large Numbers, and thus we also know that the error in the average guess scales as 1/√N, where N is the number of guessers. This is a subadditive relation. If we double the number of guessers we do not halve the error, but only reduce it by a factor of about 1.4.

These illustrate two fundamentally different regimes of collective intelligence. The superadditive relationship is one we typically see when groups have evolved specifically to work together, such as a colony of insects or the cells in your brain. An termite colony is truly more intelligent than the sum of its parts: no single termite could build the large intricate nest that the colony inhabits, even if it were given a huge amount of time to try. Likewise, no single neuron in your brain could learn...almost anything. The interactions between individuals produce something far beyond what they can do alone. In these situations the group can grow very large, as the benefits of group living increase with each new member.

Grand designs

On the flipside, subadditive collective intelligence is what we often see in groups of unrelated individuals, like the punters guessing the weight of Galton's bull. Other examples with similar properties are seen in the way that navigating birds pool their knowledge about how to fly home, or how groups of fish become better at avoiding predators. In each case the group is better than one individual, but there are diminishing benefits of adding more and more group members. In such situations the benefits of being in a group are naturally limited: for example, you might get better at finding or catching food, but then you have to share it with more other individuals.

Humans are not like insect colonies - we do not live in groups of genetically identical individuals who specialise and collaborate for the common good. But the most interesting examples of human collective intelligence occur when, despite this, we still find superadditive scenarios, where we can become more than the sum of our parts. Some problems naturally lend themselves to this type of collective solution. A good example is mathematics, where someone may work on a problem for years until they meet just the right person with the right knowledge to solve a problem together. On a more humdrum level, consider Graham and I completing the crossword. Despite sharing some things in common, we also have a lot of different knowledge. This means that Graham will easily solve some of the clues I find most difficult and vice versa. And due to the nature of the puzzle, when Graham solves a clue he may make the one I am looking at easier, by giving me some of the letters. We don't just divide the clues at random, we naturally each tend to look at the ones we are most likely to solve. Researchers in the USA have in fact shown that group intelligence is more related to the diversity of group members and the extent to which all individuals are able to participate than it is to the intelligence of the individuals themselves.

In the example of the crossword, this diversity of skills is a happy accident. But in other cases there are incentives for people to be specialised. Adam Smith noted that division of labour made industrial production much more efficient. Similarly, markets such as the stock exchange can reward a diversity of knowledge - the best way to make a profit is to know something about a company that other people do not. Some of my recent research has looked at how these incentives can be manipulated, and we find exactly this: rewarding people for accurately predicting something that other people were unable to predict creates the best environment for fostering collective intelligence.

What is the future for collective intelligence. Globalisation, increasing urbanisation and the internet have created ever greater rewards for specialisation. This has fostered economic growth and associated improvements in health and education, especially in what were once under-developed countries now enjoying the fruits of industrialisation. There has thus naturally been a drive to follow this trend further. But we should be wary of continuing this drive to specialisation indefinitely for the sake of group performance. One of the most depressing anecdotes I have ever heard [2] relates to specialisation: an accountant, finding his job rather unfulfilling, began spending large amounts of time playing the online multiplayer game World of Warcraft. In this game [3], players typically join together in 'guilds' to complete 'quests' together. Completing quests can gain players experience points and prizes which can be used to improve their characters in the game. However, when a guild completes quests together, the resources they expend or win, and the new materials they buy or sell must be managed and divided equitably. Slowly over time this accountants guild found they needed to devote more and more time to managing resources - a task that called for some specialisation. Eventually our hero finds himself coming home from 8 hours of real world accountancy only to spend his evening doing the guild's accounts, while other players do the fighting on his behalf! As this anecdote illustrates, if we follow our specialities too closely we may eventually become alienated and lose our motivation to participate in the group at all, which will then reduce the group performance.

Feel the wrath of my double-entry bookkeeping [4]

In contrast, as I noted at the start, I am not a particularly good crossword solver. It can easily take me 4-5 hours in total to solve one of the 'Jumbo' crosswords on Saturdays. Given that one is unlikely to win the prize even if the crossword is completed correctly, and that the prize is a set of books that one could get for about £50 on Amazon, this is not an efficient way for me to acquire an atlas and a dictionary. I complete the crosswords because I find the puzzle intrinsically interesting and diverting. What's more, I value my ability to do a range of tasks, some of which I might even be actively bad at (as erstwhile members of Uppsala Wanderers FC will attest). Robert Heinlein said 'specialisation is for insects', and I'm inclined to agree; while a degree of specialisation is useful, too much goes against what makes us human, and deprives us of motivation and intrinsic reward in activities. To perform well, and to lead fulfilling lives, we need not just diversity between individuals, but also diversity within ourselves and our own minds - to experience the joy of mastering multiple tasks, and to have agency over our own lives rather than to feel like ever smaller cogs in an every larger machine. This also makes us more flexible and robust. Mastering one task makes you vulnerable to that task becoming redundant. Being able to work in many different groups in a multitude of ways makes you more able to contribute as society's needs change.

So in conclusion, as someone who studies collective intelligence, I am most interested in finding how this can be fostered without crushing individual autonomy. I don't want us to end up looking like an insect colony. I'd rather we ended up like Graham and me, coming together to solve tasks that bring us satisfaction in a job collectively well done. I for one will be enjoying my atlas far more than any I could have bought on Amazon!

To the victor, the spoils

[1] Though not in the case of my immortal triumph in Cryptic Jumbo 1313!

[2] I vaguely recall this coming from Daniel Strömbom, of robot sheepdog fame.

[3] My knowledge of WoW is all at least 3rd hand so please excuse any inaccuracies in this description.

[4] As an academic statistician, I'm aware that I really shouldn't be nerd-shaming anyone.

Monday, November 6, 2017

Feedback in academic selection

I recently finished reading Cathy O'Neil's excellent book, 'Weapons of Math Destruction', which describes how large scale use of algorithms and metrics are creating dangerous systems that create and perpetuate unfairness and pathological behaviour. I highly recommend it to anyone interested in how the seemingly opaque systems that increasingly govern our lives work, and came into existence.

One thing that O'Neil wrote about that struck a chord with me was about systems that have no feedback to tell them whether they are working well. For instance, O'Neil writes about the use of baseball statistics to select a winning team. In this case, if the algorithms don't work, the steam won't win and the team's statisticians are forced to change their models. She compares this to university ranking systems, where the true quality of a university is measured via a range of proxies, such as entrance scores, employment stats, publication metrics etc. In this case their is no external factor that can determine whether these measures are right or wrong, so in effect the proxies become the quality. As a result universities spend a lot of time chasing good scores on these proxies, rather than attending to their fundamental purpose of research and education

As I was reading this I started thinking about how many systems in academia, and elsewhere, operate with a similar lack of useful feedback. As a result, many decisions are being made without any meaningful opportunity to reflect on whether these decisions, and the criteria on which they were based, were any good. For example, in the past few years I have sat on both sides of various hiring committees. These typically involve a group of faculty members interviewing several candidates, reviewing their work and watching their presentations, before collectively deciding which would best serve the needs of the department. This collective decision can be more or less equally shared between members of the committee, and may focus on particular immediate needs such as teaching shortages, or more generalised goals such as departmental research directions and reputation. In some institutions the candidates face a relatively short interview, while in others (particularly in the USA), they meet with many members of the department over several days. Different systems no doubt have their own particular merits and downsides.

What is rarely done though is to precisely define what the department hopes to achieve with this hire. Even rarer is to evaluate later whether the hire was a right decision. For instance, a department may want to increase its research reputation. This is a goal which may mean different things to different people - some may think it implies gaining more research funding, others may consider that publications in top tier journals are more important. To define a measure of success, the department could decide that it wants the hired candidate to publish as many papers as possible in a defined set of acceptable journals, or to bring in as much grant income as they can. It can then measure the success of the decision with respect to these numbers later.

But there remains a problem here. What threshold determines a good decision? The goal of the hiring committee was to select the best candidate. They should not be considered a success if they picked one good candidate from many others, nor a failure if they hired one poor candidate from a generally weak field. To decide if the hiring process was successful or not, it is necessary to keep track of the paths not taken, the candidates not selected. Academics are fairly easy to keep track of online - we have a strong tendency to build up elaborate online presences to advertise our research. Therefore it should be possible to keep an eye on shortlisted candidates who were not hired, and see how they perform.

Such a process raises statistical and ethical issues. Selected candidates may perform better simply because they were given a chance while others were not. Would it be ethical or wise for the department to make tenure contingent on outperforming the other shortlisted candidates (I would argue not, but this would be similar to the practice of hiring more assistant professors than the department plans to give tenure to). Nonetheless, applied sparingly and with a little common sense, it could give some idea as to whether hiring committees were able to accurately judge which candidates were genuinely the best for the job better than picking from the shortlist at random. This could then be as evidence for improving hiring procedures in the future. For example, a department aiming to improve its research ranking might choose to employ young academics with papers in top journals, only to find that they struggled to replicate this success without the support of their previous supervisor. Over time they could recognise this pattern and look for more evidence of independent work and research leadership.

Similar questions could be asked of selection procedures for allocating grant money and publishing papers. In some cases there is a process for evaluating success (grant funders ask for reports, journals check their citations and impact factors), but all too rarely do those doing the selecting evaluate whether the people, papers or proposals that they rejected would have been better than those they selected, i.e. whether they succeeded in the task of selecting the best. Without this feedback, it is easy for institutions to lapse into making selections based on intuitively sensible criteria which have little hard evidence to support them.

Saturday, September 16, 2017

Guest post on Academic Life Histories

I have a new guest post up on the Academic Life Histories blog, detailing how luck and selection biases influence how we perceive success in academia.

Friday, June 16, 2017

Rethinking Retractions: Rethought

Four years ago I published what turned out to be one of my most popular blogposts: 'Rethinking Retractions'. In that post I related the story of how I managed to mess up the analysis in one of my papers, leading to a horrifying realisation when I gave my code to a colleague: a bug in my code had invalidated all our results. I had to retract the paper, before spending another year reanalysing the data correctly, and finally republishing our results in a new paper.

Since I wrote that blogpost I have found there are a lot of people out there who want to talk about retractions, the integrity of the scientific literature and the incentives researchers face around issues to do with scientific honesty.

Here a few of the things that have resulted from that original blogpost:

David Duvenaud (who spotted the original bug in my code) created a presentation and a paper on the pitfalls of creating code for analysis, and sanity checks the analyst can use to avoid the same thing happening to them
The story was picked up by Times Higher Education
I was invited to tell my story at a symposium at the World Conference on Research Integrity 2017. The symposium was organised by Elizabeth Moylan, who with co-authors wrote a proposal for a new system of post-publication article alterations.
After speaking at the symposium, the story of my retracted paper was covered by the founders of Retraction Watch in STAT and then by Science

Speaking at the World Conference on Research Integrity

Looking back now at the original blogpost, I can see the situation with some more distance and detachment. The most important thing I have to report, five years after the original cock-up and retraction, is that I never suffered any stigma from having to retract a paper. Sometimes scientists talk about retractions as if they are the end of the world. Of course, if you are forced to retract half of your life's work because you have been found to have been acting fraudulently then you may have to kiss your career goodbye. But the good news is that most scientists seem smart enough to tell the difference between an honest error and fraud! There are several proposals going around now to change the terminology around corrections and retractions of honest errors to avoid stigma, but I think the most important thing to say is that, by and large, the system works - if you have made an honest mistake you should go ahead and correct the literature, and trust your colleagues to see that you did the right thing.

Meanwhile, I'm just hoping I still have something to offer the scientific community beyond being 'the retraction guy'...

Analogues between student learning and machine learning

Back in 2014 I was trying to make some progress towards my docent (Swedish habilitation) by fulfilling the requirement to undertake formal pedagogic training. As it happens, I left Sweden before either could be completed, but I recently went back through my materials, and found this essay I had written as part of that course. In the absence of anything else to to do with it, here it now lies...

Introduction

Over time people have developed increasingly sophisticated theories of learning and education, and correspondingly teaching methods have changed and adapted. As a result, much is now known about what activities most promote student learning, and the differences between individuals in their learning techniques and strategies.

At the same time, computer scientists have developed increasingly powerful artificial intelligences. The creation of powerful computational methods for learning patterns, making predictions and understanding signals has drawn attention to a more mathematical understanding of how learning happens and can be facilitated.

Some of the parallels between these fields are obvious. For example, the development of artificial neural networks was driven by the analogy between these mathematical structures and the neuronal structure of the brain, and encouraged scientists to describe the brain from a computational perspective (e.g. in [Kovács, 1995]). However, the analogies between theories of learning in education and computer science are deeper than these surface resemblances, and go to the heart of what we consider useful information and knowledge, and what we mean by understanding.

In this report I will review elements of both the pedagogical and machine learning literature to draw attention to specific examples of what I consider to be direct analogues in these two fields, and how these analogies help organise our knowledge of the learning process and motivate approaches to student learning.

Learning to learn

When computer scientists first began creating an artificial intelligence, their first approach was to try to encode useful knowledge about the world directly in the machine, by explicitly inclusion in the computer’s programming. For example, in attempting to create a computer vision system that could recognise handwriting letters, the programmer would try to describe in computer code what an ‘A’ or a ‘B’ looked liked in terms that the computer could recognise in the images it received. However, this procedure generally proved dramatically ineffective. The sheer range of ways in which an ‘A’ can be written, the possible permutations on the basic design and the different angles and lighting that the computer could receive defeated the attempt to systematically describe the pattern in this top-down fashion.

Instead, success was first achieved in these tasks when researchers tried the radically different approach not of teaching the computer each concept individually, but instead teaching the computer how to learn itself. In 1959 Arthur Samuel defined machine learning as a ‘Field of study that gives computers the ability to learn without being explicitly programmed’ [Simon, 2013]. By providing the computer with algorithms that allowed it to observed examples of different letters, and learn to distinguish these itself from the examples, much greater success was possible in identifying the letters. In essence, by teaching the computer good methods for learning, the computer could gain much greater understanding itself, and with less input from the programmer.

The parallel here with the teacher-student relationship is very direct. A teacher is responsible, of course, for providing a great deal of information to a student. But the best teachers are more successful because they teach the students how to learn for the themselves, how to fit new examples into their existing understanding and how to seek the new information and examples they need. At the higher levels of tuition, encouraging and enabling this self-directed learning is essential. Anne Davis Toppins argues that within 30 minutes ‘I can convince most graduate students that they are self-directed learners’ [Toppins, 1987]. However, much as programmers initially tried to directly tell computers what they needed to know, before realising the greater efficiency of teaching them to learn for the themselves, so has the pedagogical approach taken a similar path [Gustafsson et al., 2011]:

'For some lecturers, thinking in terms of emphasising with and supporting the students’ learning and “teaching them to learn”, i.e. supporting them in their development of study skills, can constitute a new or different perspective. [...] Some teachers claim that since the students have studied for such a long time in other school situations, the higher education institution should not have to devote time to the learning procedure.'

In other words, there have been, and indeed still are many lecturers who view their role primarily in terms of transmitting information, rather than in developing the students’ abilities to think and learn for themselves.

Conceptual understanding

In the modern teaching literature, much importance is placed on aiming for, and testing students conceptual knowledge. That is, students are expected to learn not simply a series of factual statements, or isolated results, but instead to incorporate their knowledge into higher level abstract concepts that they can use to understand unfamiliar situations, solve unseen problems and extrapolate their knowledge to new domains. The prevailing doctrine of constructive alignment [Biggs, 1999] that forms the basis for recommended teaching approaches in European countries under the Bologna process is designed to make sure that teaching methods, student activities and assessment assignments all align towards this goal of promoting and testing whether students understand the ‘big picture’.

According to a computer scientists view of knowledge and information, there is a very good reason why we should aim to promote such a concept-centred approach for students. Identifying unifying principles that tie knowledge together and understanding how apparently different fields may link together reduces the amount and the complexity of the information that a student or computer must store, access and process, and maximises the effectiveness of extrapolating to new domains.

Consider as a simple example the data shown in figure 1. How can this data be effectively stored? The simplest method would be the record each pair of (x, y) co-ordinates. Assuming we use a 1 byte per number (single-precision floating point accuracy), this will take us 20 bytes (10 x’s, 10 y’s). But visually we can immediately recognise an important pattern; the data clearly lie along a straight line. If we know the gradient of this line we can immediate translate any value of x into a value of y. Therefore we can reproduce the whole data set by specifying just 12 numbers – the 10 values of x, one value for the intercept and one value of the gradient. Therefore by understanding one big idea, one concept about the data, that they lie along a line, we have almost halved the effort of learning and storing that information. Furthermore, we can now extrapolate to any new slue of x, immediately knowing the correct corresponding value of y. If we had simply memorised the 10 pairs of co-ordinates we would have no way to do this. In the field on machine-learning this line of reasoning has been formalised into the principles of Minimum Message Length or Minimum Description Length, first proposed by Chris Wallace [Wallace and Boulton, 1968] and Jorma Rissanen [Rissanen, 1978] respectively. This states that the best model, or description of data set is the one which requires the least information to store. Modern texts on machine-learning theory focus heavily on the superiority of the simplest possible models that enable reconstruction of the necessary information and stress the connection to the well established principle of Occam’s Razor (e.g. [MacKay, 2003]). Applications of machine learning theory to animal behaviour have further suggested that animals apply the same principles to maximise the value of their limited processing and storage capabilities [Mann et al., 2011], so it is likely that humans also apply similar methods

Figure 1: By observing conceptual patterns in the data we can reduce the amount of memory needed to store it, whether on a machine or in a human mind. In this simple example identifying the linear relation between the X and Y co-ordinates (Y = 2X), we need to store only the X values, the intercept and the gradient, reducing the number of stored numbers from 20 to 12.

An analogous example in student learning might be seen in teaching mathematics students to solve equations. The most naive way for students to learn how to solve a particular type of problem in an exam would be to observe many, many examples of the problem, remember the solution to each one and then attempt to identify a match in the exam and recall the solution for the matching equation. Such an approach, while not entirely unknown among students cramming for final exams, is likely doomed to failure. It requires an enormous amount of (trustworthy!) memory to store even a fraction of the possible problems one might see in the exam, and if a new problem is encountered there is no way to generalise from the known solutions to other equations in order to solve it. A much more efficient method is to learn general techniques that can be applied to any possible equation. In this case the student need only remember a few core principles and how to apply them. They can then solve both equations they have seen before, or new examples

Strategic learning

A common characteristic of high-achieving students is a strategic approach to learning. They have a good overview of what they need to learn to achieve their life goals. They set realistic but challenging learning goals for themselves to the end of learning this material. And they actively seek out information from teachers, reading materials and other sources to aid their learning. Whether their goals are intrinsic (interest in the subject, desire for knowledge) or extrinsic (obtaining a degree, getting a job), this strategic approach to learning systematically produces better outcomes than passively receiving whatever information is offered.

Analogously, in the field of machine learning, recent developments have tended more and more towards ideas termed ‘active learning’ [Settles, 2010]. The previous paradigm of simply offering many examples to the computer to learn from and then assessing or using the results of that process has been overturned. Instead, the programmer/mathematician devises a strategy for the computer to seek out new examples, based on what it wants to achieve (e.g. identifying written letters successfully) and what it currently knows. For example, if the computer has a good idea how to recognise an ‘A’, but frequently confuses a ‘U’ and a ‘V’, it will seek out or request more examples of these letters so that it can improve its knowledge. This way it does not waste time learning redundant material, but maximises the result of its effort by focusing on the most rewarding areas.

Likewise a high-performing student will focus their attentions on areas where they are weak and/or particularly crucial concepts that provide a pivot for understanding. They will ask their teachers for more feedback on their efforts in these areas, spend more time on mastering them and prioritise them ahead of areas of less importance or that are already understood. Mckeachie’s Teaching Tips [McKeachie and Svinicki, 2013] devotes a chapter to the importance encouraging strategic and self-regulated learning. One of their descriptions of a strategic learner states:

‘Strategic learners know when they understand new information and, perhaps more important, when they do not. When they encounter problems studying or learning, they use help-seeking strategies’.

This emphasis on the importance of know where understanding is lacking and the resultant help- seeking strategy perfectly aligns with what information theory tells us is the optimal way to gain useful knowledge.

Mckeachie’s Teaching Tips [McKeachie and Svinicki, 2013] also focuses on the importance of student learning goals. My own research in the field of active-learning corroborate this view, demonstrating that even when a learner has a good learning strategy, the success of that strategy depends intimately on the goals that the learner sets themselves. Indeed, without a suitable goal the learner is unable to define a useful strategy [Garnett et al., 2012]. Thus, in order to develop students strategic learning skills, it is essential first to help them define, and identify what their individual goals are. A student for whom this is an essential course, but who is otherwise uninterested, may be best helped by helping them to clarify what they wish to achieve (a certain final grade for instance), and then working with them to establish what strategy will most likely allow them to reach that outcome. A student with greater intrinsic motivation for the course may need help setting specific staged learning goals that enable a learning strategy. The teacher’s experience in understanding the most effective path through the material would therefore be essential in establishing effective goals that the student can then apply a strategy to achieve.

Discussion

While student and machine learning are clearly not direct parallels of each other (could one imagine a machine equivalent for tiredness, or skipping class to watch TV?), the analogies that do exist be- tween the two help us to understand why certain approaches to student learning are more successful than others, via the large body of technical knowledge that exists regarding how machines can be taught. In this report I have analysed a selection of those analogies, aiming to draw conclusions about how students should be taught.

In particular, a common theme of modern pedagogical approaches is to move from information transfer to a student directed learning approach. In a sense, computer scientists have been down this path already, switching from a programmer-led to a computer-led learning approach that has resulted in far superior learning outcomes. This should motivate and support the equivalent transition in student learning

In teaching computers how to think and learn, we have also needed to help them establish goals and strategies for learning, and this is now the forefront of machine learning research. The dramatic improvement in computer learning outcomes when well-developed strategies are employed should remind us that it is the manner in which the student approaches new information and requests help and feedback that matter at least as much as the amount of information they are presented with. Such knowledge demands that we devote time to monitoring and developing students learning strategies and discussing what they hope to achieve via our courses.

Students, like all of us, are presented with a great deal more information than they can easily process and digest. If computer science in the 21st century has taught us anything, it is the importance of identifying general patterns in the vast body of information we are now exposed to via the media, the Internet and other sources. Without relatively simple general principles, information can easily become overwhelming. That the same principle applies in student learning should not surprise us. How is a student to retain all the information we attempt to transfer to them without organising it into general principles rather than a huge array of specific cases? The content of any course therefore should revolve as much around this organisational structure as the raw information itself, demanding generalised understanding rather than specific regurgitation. Thankfully this is the direction modern pedagogy is taking, with such concepts of constructive alignment and the SOLO taxonomy.

References

[Biggs, 1999] Biggs, J. (1999). What the student does: teaching for enhanced learning. Higher Education Research & Development, 18(1):57–75.
[Garnett et al., 2012] Garnett, R., Krishnamurthy, Y., Xiong, X., Schneider, J., and Mann, R. (2012). Bayesian optimal active search and surveying. In Proceedings of the International Con- ference of Machine Learning.
[Gustafsson et al., 2011] Gustafsson, C., Fransson, G., Morberg, ̊A., and Nordqvist, I. (2011). Teaching and learning in higher education: challenges and possibilities.
[Kovács, 1995] Kovács, I. (1995). Maturational windows and adult cortical plasticity, volume 24. Westview Press.
[MacKay, 2003] MacKay, D. J. C. (2003). Information Theory, Inference and Learning Algorithms. Cambridge: Cambridge University Press.
[Mann et al., 2011] Mann, R., Freeman, R., Osborne, M., Garnett, R., Armstrong, C., Meade, J., Biro, D., Guilford, T., and Roberts, S. (2011). Objectively identifying landmark use and predicting flight trajectories of the homing pigeon using gaussian processes. Journal of The Royal Society Interface, 8(55):210–219.
[McKeachie and Svinicki, 2013] McKeachie, W. and Svinicki, M. (2013). McKeachie’s teaching tips. Cengage Learning.
[Rissanen, 1978] Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5):465–471.
[Settles, 2010] Settles, B. (2010). Active learning literature survey. University of Wisconsin, Madison, 52:55–66.
[Simon, 2013] Simon, P. (2013). Too Big to Ignore: The Business Case for Big Data. John Wiley & Sons.
[Toppins, 1987] Toppins, A. D. (1987). Teaching students to teach themselves. College Teaching, 35(3):95–99.
[Wallace and Boulton, 1968] Wallace, C. S. and Boulton, D. M. (1968). An information measure for classification. The Computer Journal, 11(2):185–194.