Thursday, June 18, 2009

Benford's law and The Commons
A modest proposal

Have you ever tried to make up numbers? I mean, to write a series of random numbers on a bit of paper? It's a problem faced by fraudsters: how to fabricate a series of plausible data. 

Consider the digits. I had always supposed that, in a given series of numbers, you'd expect the likelihood of a number starting with a 1, a 2, a3,..., or a 9 would be about equal (for a large enough sample of numbers) at ~11%. 

I supposed erroneously. 

Actually, for a surprisingly wide range of data types -from the lengths of rivers in an atlas to amounts of money in an account- numbers beginning with a '1' are much more common, appearing about 30% of the time. Numbers beginning with a '2' occur about 18% of the time, and higher digits with decreasing frequency. This is called Benford's law.

This remarkable result has been put to use in fraud detection, by using it to pick out suspicious sets of accounts for further investigation.

Now, there's a few caveats: it doesn't work for all data - sequentially assigned numbers like bank account numbers, for instance. However, the MPs expenses scandal (via the Guardian data blog) provides a natural test for this, as MPs expenses are all either:

(a) amounts over £250, which had to be supported by a receipt, or

(b) amounts under £250, which didn't. 

Will both sets of figures follow Benford's law? Will neither? Or will most sets of expenses (a) follow the law, while the un-receipted expenses (b) won't? This might raise some interesting questions. 


pj said...

Oh that sounds fun! I guess you'd predict that expenses over £250 would tend to follow the law with the caveat that you've excluded all data below £250. On the other hand, figures under £250 would, in theory, follow the law (with the caveat) but will actually tend to overrepresent larger numbers because the thieving MPs will have tried to push their expenses claims up against the £250 limit.

LemmusLemmus said...

Couldn't you deal with the problem of the limit introducing additional bias be dealt with by only looking at stuff below 100 pounds? What about second digits only?

A sociologist called Andreas Diekmann once reported some fun with Benford's law: First he asked student to make up regression coefficients so that they fit some theory. Sure enough, they looked fradulent. He then looked at the regressions from one volume of the American Journal of Sociology and found no evidence of fraud.

Political Scientist said...

Email from my mate Ed:

"Rather than go through the rigmarole of setting up a Blogger account to comment, I thought I'd go straight to source. A test of Benford's law would be quite fun, though how will your statistics cope with Mickey Fab's spectacularly constant ACA claims? Having said that, I have long suspected that Lichfield is in something of a timewarp...

Will respond to everyone, later today.

Political Scientist said...

"with the caveat that you've excluded all data below £250."

I'm not sure what the arbitrary limit will do - on the one hand, the scale-invariant nature of the distribution might mean it isn't affected by the cut-off, on the other the cut-off excludes numbers like 350, 450 etc but not 150, 250. I think the best bet is to find someone who has submitted receipts for amounts under 250 (I think there must be some of these, because the Torygraph's been running short comedy excerpts from them, e.g. 2.99 for a dog bowl). This would allow us to establish if it's reasonable to expect a Benford's distribution for (demonstrably) honest claims. Alternativly, LemmusLemmus sub- hundred-quid cut off is the way to go, assuming there are enough items <100.

"What about second digits only?" There is a Benford's law for second digits, which certainly could be used to check the distribution. In the paper on forensic accounting, they discus using this as a test for rounding (effecting 2nd, 3rd, digits more than the 1st). Good to hear sociologists look at the sociology of sociology as well!

" how will your statistics cope with Mickey Fab's spectacularly constant ACA claims?"

There's a few things MPs can just claim for without accounting for them e.g. 250 quid a month petty cash (IIRC Alan Duncan is one of those). As long as we stick to physical items (ignore petty cash etc) I think it shouldn't affect the frequency distribution. (I also need to remind everyone that George Osborne charged to expences the cost of 2 DVDs of his speech "value for taxpayers money". Some people think this has stopped being funny. They are mistaken)

It also occurs to me - following on from PJs comments above - that it might be interesting to compare the mean -unreciepted- expence with the mean reciepted expences (for sub-250 quid expences). Would this show a tendency to inflate the claims with no receipt?