We have so many aspirations for big data and evidence based policy, but apparently a fatally limited capacity to see the obvious: voters were furious about immigration and the EU. Techniques exist to build better empirical evidence regarding issues that matter to citizens; we should use them or risk a repeat of the referendum.   

Commentators from all over the spectrum believe that the leave vote represents not (only) a desire to leave the EU, but also the release of a tidal wave of pent up anger. That anger is often presumed to be partly explained by stagnating living standards for large parts of the population. As the first audience question on the BBC’s Question Time program asked the panel “Project Fear has failed, the peasants have revolted, after decades of ignoring the working class how does it feel to be punch in the nose?”. The Daily Mail’s victorious front page said the “Quiet people of Britain rose up against an arrogant, out-of-touch, political class”. The message is not subtle.

Amazingly, until the vote, no one seemed to have known anything: markets and betting odds all suggested remain would win. Politicians, even those on the side of Leave, thought Brexit was unlikely. The man bankrolling the Brexit campaign lost a fortune betting that it wouldn’t actually happen (the only good news I’ve seen in days). Niall Ferguson was allegedly paid $500,000 to predict that the UK would remain.

This state of ignorance contrasts radically with what we do know about the country. We know, in finicky detail, the income of every person and company. We measure changes in price levels, productivity, house prices, interest rates, and employment. Detailed demographic and health data are available – we have a good idea of what people eat, how long they sleep for, where they shop, we even have detailed evidence about people’s sex lives.

Yet, there seems to be have been very little awareness of (or weight attached to) what the UK population itself was openly saying in large numbers.

Part of the reason must be that the government didn’t want to hear. Post crisis everything was refracted through the prism of TINA – There Is No Alternative. There was no money for anything, so why even think about it? Well, now we have an alternative.

The traditional method for registering frustration is obviously to vote – a channel which was jammed in the last election. Millions of people voted UKIP, or for the Green Party, and got one MP a piece: no influence for either point of view.  A more proportional voting system is one well known idea, and I think an excellent one, but there are lots of other possibilities too.

What if there was a more structured way to report on citizen’s frustrations on a rolling basis? An Office of Budgetary Responsibility, but for national sentiment – preparing both statistical and qualitative reports that act as a radar for public anger. It would have to go beyond the existing ‘issue tracking’ polling to provide something more comprehensive and persuasive. Perhaps the data could be publicly announced with the same fanfare as quarterly GDP.

Consultative processes at the local level are much more advanced than at the national level. Here is some of the current thinking on the best ways to build a national ‘anger radar’, drawing on methods widely used at the local level.

Any such process faces the problem of  ‘strategic behaviour’. If someone asks you your opinion on immigration, you might be tempted to pretend you are absolute furious about it, even if you are are only mildly piqued by the topic. Giving extreme answers might seem like the best way to advocate for the change you want to see. Such extreme responses could mask authentically important signals. Asking respondents to rank responses in order or assign monetary values to outcomes are classic ways to help mitigate strategic behaviour.

Strategic behaviour can also be avoided by looking at actions that are hard to fake. Economists refer to these as ‘revealed’ preferences – often revealed by the act of spending money on buying something. It’s awful to think about, but house prices might encode public opinions on immigration. If house prices are lower in areas of high immigration, it might reveal to us the extent to which citizen truly find it to be an issue. Any such analysis would have to use well established techniques for removing confounding factors, for example accounting for the fact the immigration might disproportionately be to areas with lower house prices anyway. This approach might not be relevant for the issues in EU referendum, but might be important for other national policies. Do people pay more for a house which falls in the catchment of an academy school, for example. (More technical detail on all these approaches).

Social media is another source of data. Is the public discourse, as measured on Twitter or Facebook (if they allowed access to the data) increasingly mentioning immigration? What is the sentiment expressed in those discussions? Certainly a crude measure, but perhaps part of a wider analysis – and ultimately no cruder than the methods used to estimate inflation.

All these approaches are valuable because they tell us about ‘raw’ sentiment – what people believe before they are given a space to reflectively consider. ‘Raw’ views are important since they are the ones that determine how people will act, for example at a referendum.

But that is not enough on it’s own. As discussed in a previous post, good policy will also be informed by a knowledge of what people want when they have thought more deeply and have information that allows them to act in their own best interests. These kinds of views could be elicited using using processes such as the RSA’s recently announced Citizen’s Economics Council, where 50-60 (presumably representative) citizens will be given time and resources to help them think deeply about economic issues of the day, and subsequently give their views to policy makers.

Delib, a company that provides digital democracy software, offers a budget simulator which achieves a similar goal. The affordances of the interface mean that uses have to allocate a fixed budget between different options using sliders. In the processes of providing a view, users intrinsically become aware of the various compromises that must be made, and deliver a more informed decision.

We live in a society where more data is available about citizen’s behaviour then ever before. As is widely discussed, that represents a privacy challenge that is still being understood. The same data represents an opportunity for governments to be responsive in new ways. Did the intelligence services know which way the vote would go using their clandestine monitoring of our private communications? Who knows.

We cannot predict everything, famously a single Moroccan street vendor’s protest set off the whole of the Arab Spring. But we can see the contexts that makes that kind of volatility possible, and I believe the anti immigration context could easily have been detected in the run up to the referendum.

There is no longer any reason for a referendum about the EU to become a channel for anger about tangentially related issues. The political class would not have been ‘punched on the nose’ if they were a little better a listening.

Hat tip: Thanks to the Delib Twitter account, which has been keeping track of the conversation about new kinds of democracy post Brexit, which I’ve used in this post.

Couple of notes from the Long Now Foundation health panel, both regarding how we aggregate and distribute knowledge.

Alison O’Mara-Eves (Senior Researcher in the Institute of Education at University College London) told us about the increasing difficulty of producing systematic reviews. Systematic reviews attempt to synthesise all the research on a particular topic into one view point: how much can you drink while pregnant, what interventions improve diabetes outcomes, etc.  These reviews, such as  venerable Cochrane reviews,  are struggling to sift through the increasing volumes research to decide what actionable advice to give doctors and the public. The problem is getting worse as the rate of medical research increases (although more research is obviously a good thing in itself).  We were told the research repository Web of Science indexes over 1 billion items of research. (I’m inclined to question what item is since there must be far less 100 million scientists in the world, and most of them must have contributed less than 10 items, however I take the point that there’s a lot of research.)

Alison sounded distinctly hesitant about using automation (such as machine learning) to assist in selecting papers to be included in a systemic review, as a way of making one of the steps of the process less burdensome. The problem is transparency: a systematic review ought to explain exactly what criteria they use to include papers, so that criteria can be interrogated by the public. That can be hard to do if an algorithm has played a part in the process. This problem is clearly going to have to be solved, research is no  use if we can’t sythesise it into an actionable form. And it seems tractable – we already have IBM Watson delivering medical diagnoses, apparently better than a doctor. In any case, I’m sure current systematic reviews of medical papers are carried out using various databases’s search function – who knows how that works or what malarkey those search algorithms might be up to in the background?

Mark Bale (Deputy Director in the Health Science and Bioethics Division at the Department of Health) was fascinating on the ethics of giving genetic data to the NHS, through their program the 100,000 genomes project. He described a case where a whole family who suffered with kidney complaints were treated due to one member having their genome sequenced, thus identifying a faulty genetic pathway. Good for that family, but potentially good for the NHS too – Mark described the possibility that by quickly identifying the root cause of a chronic, hard to diagnose ailment through genetic sequencing might save money too.

But – what of the ethics? What happens if your genome is on the database and subsequent research indicates that you may be vulnerable to a particular disease – do you want to know? Can I turn up at the doctors with my 23 and Me results? Can I take my data from the NHS and send it to 23 and Me to get their analysis? What happens if the NHS decides a particular treatment is unethical and I go abroad to a more permissive regulatory climes? What happens if I have a very rare disease and refuse to be sequenced, is that fair on the other sufferers? What happens if I refuse to have my rare disease sequenced, but then decide I’d like to benefit from treatments developed through other people’s contributions? I’ll stop now…

To me the part of the answer is that patients are going to have to acquire – at least to some extent – a technical understanding of the underlying process so they can make informed decisions. If that isn’t possible, perhaps smaller representative groups of patients who receive higher levels of training can play into decisions. One answer that’s very ethically questionable from my perspective is to take an extremely precautionary approach. This would be a terrible example of the status quo bias, many lives would be needlessly lost if we decided to overly cautious. There’s no “play it safe” option.

It’s interesting that with genomics the ethical issues are so immediate and visceral that they get properly considered, and have rightly become the key policy concern with this new technology. If only that happened for other new technologies…

The final question was whether humanity would still exist in 1000 years – much more in the spirit of the Long Now Foundation. Everyone agreed it would be, at least from a medical perspective, so don’t worry.




Matt Biddulph (one of the Dopplr founders) is to blame. At least I think he’s the one that started the “Silicon Roundabout” name off.  What did Larry Page say when someone told him Google were buying space at London’s Silicon Roundabout? Probably, “what’s a roundabout?”. Americans are so cut-and-thrust they don’t have roundabouts, roundabouts imply too much collaboration between drivers.  At least Brighton’s Silicon Beach makes sense, in that sand is made of silicon (only, there’s no sand on Brighton beach.) Anyway, the organisers of Silicon Milkroundabout can’t be blamed for perpetuating a silly name, or making it sillier by punning it with the idea of the university milk round.

In case you haven’t come across it Silicon Milkroundabout is a job fair – startups (and mature companies) have stalls. On Saturday product managers went round the stalls and tried to find jobs, on Sunday developers did the same, and in much greater numbers. Having tried to hire developers, I can say that anything that makes finding them easier is a good thing.

I’ve never quite known what my job title ought to be, but it seems like I’m mostly in the product manager camp. So it was nice to meet a bunch of people who do the same thing and chat about our shared experiences. But I’m also a bit of a dev, so I went on Sunday too.

Aside from trying to find some work, it was an opportunity to see what kind of companies are growing. Distilling customer tastes from big data was definitely the standout theme. There were companies that mined data on previous purchases to discover what products you might like, others that looked at your results in personality test, and others that looked across you social graph. The objective was either to serve better targeted adverts, or to customise a website to highlight the products that a particular user is most likely to buy.

I’m slightly inclined to question a fundamental assumption in all this: that I have fixed propensity to purchase any given product, and that propensity can be discovered by looking at my behaviour, or my friend’s behaviour.

I have quite a vivid memory of going to a party where guy with a massive beard and a comforting northern accent was playing music. Everything he put on was different, unknown to me and really good. Everything he put on I asked what it was, and he told me some interesting things about the song and its context. He worked in a record shop, and if I was a customer I think I might have bought about 50% of the stuff he was playing.

Instead I got home and YouTubed most of it. It turned out that, listening off my laptop on my own, I liked much less of it – Yacht was the only band that really stuck with me. Even then, by playing perhaps 6 records he found one that was of genuine interest. He had a good conversion rate.  Obviously this is anecdotal, but there are two things that might be interesting:

  1. I felt “active” in the discovery process. I was at the right party speaking to the right guy to find these things out. I had exclusivity, if someone asked how I found out about Yacht I had a story to tell. Not a great story, but there was a real connection behind it.  I would have dismissed exactly the same results if they’d appeared to me as automatically generated recommendations in a  UI. In fact, I would probably have said they were stupid, because there was no personal investment in the selection process.
  2. No amount of looking through my previous purchases would have shown artists similar to Yacht. No amount of looking through my social graph would have shown that my friends liked Yacht. That was what made them a great discovery, I could say to my friends – “Hey, I found a cool thing”, and be reasonably sure I had new information.

Often I want website suggestion algorithms to fail, confirming what I like to think of as my unique and distinctive tastes.

Liking Yacht isn’t a deep-seated feature of my brain that could be discovered if you had enough data about me. It was something that happened when I guy that thought was cool, but not too cool, told me about them. Meeting him in that context made it better.

I hate Amazon Books suggestions. Even if they could perfectly predict what I would have bought, as soon as I see the suggestions I change my mind. My reading is a deep part of my individuality; if it can be predicated by an Amazon algorithm then I feel obliged to switch it up a bit. Conversely, I’d be more than happy to have a film recommended to me: film taste isn’t something that’s particularly important to me.

I love the Hype Machine, a site that finds music from blogs. It understands that my musical taste is not going to be formed by a suggestion engine, but by what other identifiable humans have said. Each track is presented with a snippet from the blog it appeared in. I have to search for it – I’m an active participant. I discover the music, rather than it being suggested to me.

Obviously, suggestions systems do work, enough for Netflix to invest in a million dollar prize for anyone who could improve their algorithm by 10%.  Small increases in conversion rate are worth a lot of money – I just wonder if they would work better if they took a deeper account of the social factor than crawling my Facebook friends. Or if you can generate the “meeting a guy at a party” moment on a website.

Turns out I’m not that into Yacht that much anymore. I don’t want to be identified with the kind of people who “like” them on YouTube.



I’ve been to the Royal Society once before for an event about understanding risk, and I was surprised to see some of the same people at the Web Science conference. I’m envious that for some people the Royal Society is a way of life. Especially the man who wears two pairs of glasses at the same time and always asks questions from the perspective of torpedo design – I should say that so far as I understand the questions they always appear to be pertinent, so far as they are comprehensible.

You might reasonably ask what Web Science means, ironcically it’s question that Google will not help you answer. I’m not sure there is a short answer, but there were strong and consistent links between the speakers so it definitely designates something. In terms what university department Web Science belongs in, it seems to be something of a coalition of disciplines, mainly social science, network mathematics and computer science.

However you triangulate the location of Web Sciencce, it’s in an area that I think is very exciting. I hope to have a tiny claim to have played some part in the area through having worked on the BBC’s Lab UK project, which uses the web as social laboratory.

Despite the spectrum of intellectual backgrounds day one was remarkably focused. Other than to call it Web Science the only way I can think of to elucidate the commonality is to use the example of Jon Kleinberg’s talk, which seemed most neatly to encapsulate it. Here goes…

You may have heard of Stanley Milgram for the famous electric shock experiment, but he also did an investigation which gave prominence to the idea of ‘6 degrees of separation’. His ingenious method was to randomly send letters out which contained the name of a target person and short description that target (eg. Jeff Adams, a Boston based lawyer). In the letter there were also instructions indicating that it should be forwarded to someone who might know the target, or know someone who might know someone who would know the target, etc.

Famously the letter will arrive at it’s target in six steps, on average, hence the frequently cited idea that you are six friendships away from everyone in the world (though his experiment was US based).

There’s a strikingly effective way to understand how it can be that the letter finds its destination. It involves imaging the balance between your local friends and your distant friends.

If you only had local friends then a letter would take a large number of steps to find a target individual in, say, Australia. The reason the average can be as low as six steps is that everyone has friends who live abroad, or in another part of the country, so the letter can cover long distances in big hops.

However, imagine all friendships were long distance. If I live in London and I want to get a letter to the lawyer in Boston then I’m going to have a problem. I could send the letter to a friend who lives in Boston, but then his friends are spread equally around the globe, just like mine are. So although the letter can travel great distances it’s course is so unpredictable that no one can tell which direction to forward the letter to get it nearer to it’s target.

It turns out that there is a specific ratio of long and short links which allows the notional letter to get to its destination in the shortest number of links.

This discovery came some time ago, but nobody could measure it the actual ratio of short and long range friends that real people have. To measure it would require a list of millions of people, their location and the names of their friends. Cue Facebook…

Computer scientists have analysed the data on Facebook and it turns out that the actual ratio of short to long links is very close to the optimal ratio, in terms of getting that letter to it’s destination. That is, the mixture of distant contacts and local ones as indicated by the information on Facebook is exactly the right on to deliver the letter in shortest number of links – six.

That’s pretty incredible, and of course it probably isn’t a coincidence. Social scientists posit that perhaps in some way people will their friendships to exhibit this distribution – after all, as we’ve just demonstrated in one sense it’s the most effective mode of linkage. Whatever the eventual explanation, it’s a fascinating incite into human behavior.

Stepping back from the specifics of this argument, here is a perfect example of web science: mathematical theory posing a hypothesis (calculating the optimum ration), computer science providing empirical evidence (working out the real world ratio), and then a social scientific search for explanation. It’s the combo of these three areas which seems to constitute the “new frontier” described in the title of the Royal Society event.

There are other configurations of the various disciplines. Jennifer Chayes, of Microsoft Research, pointed out that mathematicians like herself will study any kind of network for it’s intellectual beauty. She suggested that a very important role for social scientists was to pose meaningful real-world questions which mathematicians and computer scientists could then collaborate to answer.

The ‘web science approach’ has produced all kinds of exciting results. For example Albert-László Barabási (whose excellent book Bursts I can highly recommend) has used the data to discover that the web is a ‘rich get richer’ type of network, meaning that is has a distribution of a few highly connected websites (ie. Google) and many less connected web pages (ie. this one) – which it turns out makes it similar to many other types of network. It’s by using this kind of understanding of how the web grows naturally that Google can tell a potential spammy website from a real one.

A number of predictions flow from this work which I won’t go into here, but there are plenty of practical results coming out of his work. To prove this he showed a graph of citations for ‘network science’ papers which has peaked recently at 800 a year, compared with approximately 300 for the famous Lorentz attractor paper which more or less defined chaos theory, and even fewer for various other epochal chaos papers. That isn’t surprising, Barabási use examples from yeast proteins to human genomics in his talk – it’s much deeper and more widely applicable than just the web.

If you’re still thinking this research might have limited practical application then Robert May’s talk should convince you otherwise. He demonstrated that understanding of ecological networks has spilled over into modeling the extremely real subject of HIV transmission. One of the most ingenious ideas he bought up was that of giving a vaccine for a infectious disease to a population and asking them to administer it to a friend. That means the person with most friends gets the most vaccine. This is handy, because the person with the most friends is also the person most likely to spread the disease.

There were so many other contributions that an exhaustive list of even the most exciting points would also be exhausting to read, so I’ll stop now. But it was an exciting event, not least for the fact that its a genuine intellectual frontier, but one that seems to be surprisingly easy to understand for people who don’t work in full time academia, at least in a broad sense.

What do we really think about music? I’ve tried to find some data about how people think about musical genres using the Last FM API.

Ishkur’s strangely compelling guide to electronic music is a map of the relationships between various kinds of music, and a perfect example of the incredibly complex genre structures that music builds up around itself. He lists eighteen different sub-genres of Detroit techno including gloomcore, which I suspect isn’t for me. I wanted to try and create a similar musical map using data from Last FM.

I’ve written a bit before about the way in which the web might change the development of genres – what I didn’t ask was how important the concept of genre would continue to be. It’s difficult to listen to music in a shop, so having a really good system of classification means you have to listen to fewer tracks before you find something you like. Also, in a shop you have to put the CD in a section, so it can only have one genre attributed to it.

But on the web it’s easy to listen lots of 30 second samples of music, so arguably you don’t need to be so assiduous about categorisation. In addition, the fact that music doesn’t have to be physically located in any particular section of a shop also undermines the old system – one track can have two genres (or tags, in internet parlance).

Despite this online music shops like Beatport still separate music into finely differentiated categories, much as you would find in a bricks and mortar record shop. But do they reflect the way people actually think about their musical tastes?

Interestingly, two of the most commonly used tags on Last FM are “seen live” and “female vocalist” (yes, women have been defined as “the other” again), which aren’t traditional genres at all. “Seen live” is obviously personal, and “female singer” isn’t a part of the normal lexicon. Looking through people’s tags other anomalies crop up – “music that makes me cry” and tags based on where a person intends to listen to the music are examples.The more obscure genres from Iskur’s guide are lost in the noise of random tags that people have made for themselves. I would suggest Gloomcore isn’t used in a functional way that ‘metal’ or ‘pop’ are. It’s a classification that people do not naturally use to denote a particular kind of music on Last FM – perhaps it’s a useful term for writing about music, but nobody thinks they’d like to stick on some Gloomcore while they make breakfast.

I searched the Last FM database of top tags – the 5 tags most used by a user, and assumed that there was a link between any two genres that one person liked. For example, if you have ‘gothic’ and ‘industrial’ as top tags then I marked those two tags as linked. In the diagrams below I show the links that occurred between 1000 random Last FM users. If a link between two tags occurred more than about 15 times then it shows up on the diagram below.

Unsurprisingly, indie and rock are things that people often note they have seen live. By contrast, though people might talk of having heard electronic music ‘out’ (ie. not at home), they don’t care enough about it to use define a tag around it.

I was surprised to see tags such as ‘British’ and ‘German’, so I broke the above diagram down by country. Last FM has significant UK, German and Japanese user bases. Here is the result for Germany:


I think it’s very telling that while most of the connections are as you might expect, ‘black metal’ and ‘death metal’ are not connected to the main graph. I’m not particularly aware of these genres, but it certainly seems plausible they are very insular.

Here is the Japanese version:


Yep, plenty of references to Japan. The only nation to feature Jazz too. Here is the British version:

Lost in the noise: what we really think about musical genres

In Japan and Germany a defining feature of music is that it is Japanese or German. In Britain we don’t care. I suspect that’s because our musical tastes aren’t defined against a background of lyrics in a foreign language, as perhaps they are in the other two countries.

Last FM may well have particular ‘subculture’ of user in each country, so its hard to draw any firm conclusions because of this potential skew. As with so many of the insights you can gain from data gleaned from the web, at the moment it’s only possible to tell that one day this kind of tool could be very reveling about our psychology – what it will reveal isn’t very clear yet.

None the less, it will be interesting to see how these diagrams evolve over time – perhaps they will gradually diverge from the old names we’ve used to identify music, or perhaps there will be less and less consensus about what genres are called.

Incidentally, this would have been a post about data from Linked In, looking at the way your professional affects the kind of friendship group you have, but the Linked In API is so restricted that I gave up.

The data is available blow. It’s in the .dot format that creates these not very sexy spider diagrams.


I can provide a better version of this data if anyone wants it – send me a message.

Over the course of the General Election I recorded 1000 random tweets every hour and sent them to tweetsentiments.com for sentiment analysis.

Tweetsentiment have a service which gives one of three values to each tweet. ‘0’ means a negative sentiment (unhappy tweet), ‘2’ a neutral or undetermined sentiment and ‘4’ positive (happy tweet). Similar technology is used to detect levels of customer satisfaction at call centres by monitoring phone calls.

Obviously it’s difficult for a machine to detect the emotional meaning of a sentence, especially with the strange conventions used on Twitter. Despite this Tweetsentiment seems to be fairly reliable – tweets always which express happy emotions tend to be rated as such, and vice verse. More accurately, if Tweetsentiment does make a classification it tends to get it right. Sometimes an obviously positive / negative tweet gets a ‘2’, but that shouldn’t affect things here.

My hypothesis was that the Twitterati would be less happy if there was a Conservative victory. Of course I can’t prove that Twitter has a bias to the left, but I would presume that young, techy, early adopters are more likely to be left leaning. The reaction to the Jan Moir Stephen Gately article perhaps supports this.

David Cameron famously noted that Twitter is for twats, I wondered if Twitter would reciprocate…


The graph indicates that usually Twitter is just slightly positive, with a mood value of 2.1 on average. As predicted, as a conservative victory becomes apparent on Thursday evening there is a decline in mood which lasts until Saturday lunchtime. Then everyone cheers up, presumably goes down the pub, and is pretty chirpy for Sunday lunch. Sentiment only returns to average for the beginning of work on Monday morning.

In short, it does look like the election result was a disappointment to Twitter.

Obviously we need to know what normal Twitter behaviour is over the course of the week to draw very much information from the graph, and this is something that I’m going to try and produce a graph for soon.

It does look as though the size of negative reaction to a once-a-decade change in government is about the same magnitude as the positive mood elicited by the prospect of Sunday lunch – which I think is fairly consistent with the vicissitudes of Twitter as I experienced them.

I used Twitter’s API to gather the data, and frankly, it’s not particularly great, particularly if you want to get Tweets from the past. I was surprised to discover that any Tweets more than about 24 hours old simply disappear from the search function on Twitter.com – in effect they only exist in public for a day. For this reason the hourly sample size wasn’t always exactly 1000, but it was on average.

I’ll post again when I have some more data on normal behaviour. I’m also curious to find out if different countries have different average happiness levels on Twitter, but I think finding a Tweetsentiment-style service for other languages might prove difficult.

My last post used Wikipedia’s list of dates of births and deaths to build a timeline showing the lifespans of people who have pages on Wikipedia. There are a lot of people with Wikipedia pages, so I limited it to only include dead people.

That still leaves you with a lot of people to fit on one timeline, so I wanted to prioritise ‘important’ or ‘interesting’ people at the top and show only the most ‘important’ 1000. Some have been confused by my method for doing this, and others have questioned its validity, so this post will address both issues. I’m also going to suggest an improvement. It turns out that whatever I do Michael Jackson is more important than Jesus. I’m just the messenger.

Explaining the method
To get a measure of ‘importance’ I used work done by Stephan Dolan. He has developed a system for ranking Wikipedia pages which is very similar to the PageRank system which Google uses to prioritise its search results.

Wikipedia’s pages link to one another, and Stephan Dolan’s algorithm gives a measure of well linked to all the other Wikipedia pages a particular page is. If we want to know how well linked in the page about Charles Darwin is the algorithm examines every other page in Wikipedia and works out how many links you would have to follow to get from the page it is examining to the Charles Darwin page using the shortest route.

For example, to get from Aldous Huxley to Charles Darwin takes two links, one from Aldous to Thomas Henry Huxley (Aldous’s father) and then another to Darwin (TH Huxley famously defended evolution as a theory). Dolan’s method calculates the average number of clicks from every page in Wikipedia to the Charles Darwin page, and then takes an average value. To get to Charles Darwin takes an average 3.88 clicks from other Wikipedia pages.

Equivalently, Google shows pages that have many links pointing to them nearer the top in its search results.

This method works OK, but it could be better. For example Mircea Eliade ranks as the fifth most important dead person dead person on Wikipedia, taking on average 3.78 clicks to find him. But Mircea Eliade is a Romanian historian of religion – hardly a household name. We can take this as a positive statement, perhaps Mircea Eliade is a figure of hither to unrecognised importance and influence. On the other hand it seems impossible that he can be more ‘important’ than Darwin.

Testing the validity of the Dolan Index
I decided it would be interesting to compare what I’m going to call the Dolan index (the average number of clicks as described above) with two other metrics that could be construed as measuring the importance of a person. Before we do that, here is a Graph of what the Dolan index of dead people on Wikipedia looks like.

The bottom axis shows the rank order of pages, from Pope John Paul II, who is has the 275th highest Dolan index on Wikipedia, to Zi Pitcher, who comes 430900th in terms of Dolan index. It makes a very tidy log plot.

As I mentioned previously, the Dolan index is very similar to a Google PageRank, so lets compare them.




The x axis is the same as the first graph, Wikipedia pages from highest to lowest Dolan index. A well linked page has a low Dolan index, but a High PageRank, so I used the reciprocal of PageRank for the y axis. I’ve also added a log best fit line.

Comparing with PageRank seems to indicate there is a reasonable correlation between Dolan index and PageRank, which is indicated by the fact the first and second graphs have a similar shape.

PageRank is only given in integer values between 1-10 (realistically, all Wikipedia pages have a PageRank between 3-7), so I’ve smoothed the curve using a moving average.

This seems to lend some weight to the Dolan Index as a measure.

I’ve also made a comparison between the Dolan index the number of results returned when searching for a person’s name (without quotes) in Google search. It should be noted that this number seems to be quite unstable – a search will give a slightly different number of results from one day to the next. I’ve used a log scale because of the range of results.



There is barely any correlation here, except a very low values of Dolan index. Despite this, it’s still possible for the number of Google results to be useful, as becomes in apparent when trying to improve my measure of ‘importance’.

A suggestion for improvement
The problem with all the measures seems to be the noise inherent in the system. While Dolan Index, PageRank and number of Google results all provide a rough guide to ‘importance’ or ‘interest’ overall, each of them frequently gives unlikely results. How about using a mixture of all three? Here is a table comparing the top 25 dead people by Dolan index and using a hybrid measure of importance constructed from all three metrics.

Dolan index Hybrid measure
Pope John Paul II Michael Jackson
Michael Jackson Jesus
John F. Kennedy Ronald Reagan
Gerald Ford Jimi Hendrix
Mircea Eliade Abraham Lincoln
Peter Jennings Adolf Hitler
John Lennon Albert Einstein
Adolf Hitler William Shakespeare
Harry S. Truman Charles Darwin
Rold Reagan Oscar Wilde
J. R. R. Tolkien Woodrow Wilson
James Brown Isaac Newton
Anthony Burgess Elvis Presley
Elvis Presley Walt Disney
Christopher Reeve John Lennon
Susan Oliver George Washington
Franklin D. Roosevelt John F. Kennedy
Winston Churchill Timur
Ernest Hemingway Martin Luther
Theodore Roosevelt Voltaire


To get the hybrid measure I just messed around until things felt right. Here is the formula I came up with:


Hybrid measure = ((1/Dolan index)x 20) + (PageRank x0.6) + (log(number of results)x 0.6)

For some reason additive formulas give better results than multiplicative ones.

Using the hybrid measure seems to have removed the surprises (like Peter Jennings) although you might still argue that Oscar Wilde or Jimi Hendrix are much too high. Michael Jackson comes out as bigger than Jesus, but then he is an exceptionally famous person, and he died much more recently than Jesus. Timur (AKA Tamerlane) is a bit of a curiosity.

I considered ignoring Number of Google results because its such a noisy dataset, however it’s the only reason that Jesus appears at all in this list, he gets a very low ranking (4.01) from the Dolan Index. Any formula which brings Jesus out on top (which I think you could make a reasonable case for his deserving, at least over Michael Jackson!), gives all kinds of strage results elsewhere.

I am a bit suspicious of “number of google results” metric. In addition to volatility Number of results fails to take into account that occurrences of words such as “Newtonian” should probably count towards Newton’s ranking, but that people called David Mitchell will benefit artificially from the fact that at least two famous people share the name.

Any further investigation would have to consider what made a person ‘important’ – would it simply be how prominent they are in the minds of people (Michael Jackson and Jimi Hendrix) or would it reflect how influential they were (Charles Darwin for example, or the notably absent Karl Marx)?

I love the idea that the web reflects the collective conciousness, a kind of super-brain aggregation of human knowlege.

Just this week the idea of reflecting the whole of reality in one enormous computer systemwas promoted by Dirk Helbing, although my formula doesn’t rate him as very important, so I’m unsure as to how seriously to take this.

DBpedia mashup: the most important dead people according to Wikipedia

The timeline below shows the names of dead people and their lifespans, as retrieved from Wikipedia. They are arranged so that people nearer the top are the best linked in on Wikipedia, as measured by the average number of clicks it would take to get from any Wikipedia page to the page of the person in question.

I had imagined that Wikipedia ‘linkedin-ness’ would serve as a proxy for celebrity, which it kind of does – but only in a lose way.

Values range from 3.72 (at the top) to 4.04 (at the bottom). This means that if you were to navigate from a large number of Wikipedia pages, using only internal Wikipedia links, it would take you, on average, 3.72 clicks to get to Pope John Paul II. This data set was made by Stephan Dolan, who explains the concept better than me. Basically, it’s the 6 degrees of Kevin Bacon on Wikipedia.

I looped through the data set and queried DBpedia to see if the Wikipedia article was about a person, and if so retrieved their dates of birth and death.

The timeline does show a certain amnesia on the part of Wikipedia, Shakespeare and Newton are absent, while Romainian historian of religion Mircea Eliade comes 5th. If I had included people who are alive tennis players would have dominated the list (I don’t know why) – Billie Jean King is the second best-linked article on wikipedia, one ahead of the USA (the UK is number one!).

Any mistakes (I have seen some) are due to the sketchiness of the DBpedia data, though I can’t rule out having made some mistakes myself…

There results are limited to the top 1000, and they only go back to 1650. Almost no names previous to 1650 appeared, the exceptions being Jesus (who was still miles down) and Guy Fawkes.

In case you were wondering ‘Who’s Saul Bellow below?’, the answer is Rudolph Hess.