Given a large set of data (bank accounts, river lengths, populations, etc) what is the probability that the first non-zero digit is a one? My first thought was that it would be 1/9. There are nine non-zero numbers to choose from and they should be uniformly distributed, right? Turns out that for almost all data sets naturally collected, this is not the case. In most cases, one occurs as the first digit most frequently, then two, then three, etc. That this seemingly paradoxical result should be the case is the essence of Benford's Law. Benford's Law [1] states that for most real-life lists of data, the first significant digit in the data is distributed in a specific way, namely: $$ P(d) = \mbox{log}{10}\left(1 + \frac{1}{d}\right) $$ The probabilities for leading digits are roughly P(1) = 0.30, P(2) = 0.18, P(3) = 0.12, P(4) = 0.10, P(5) = 0.08, P(6) = 0.07, P(7) = 0.06, P(8) = 0.05, P(9) = 0.04. So we would expect the first significant digit to be a one almost 30% of the time! But where would such a distribution come from? Well, it turns out that it comes from a distribution that is logarithmically uniform. We can map the interval [1,10) to the interval [0,1) by just taking a logarithm (base ten). These logarithms are then distributed uniformly on the interval [0,1). We can now get some grasp for why one should occur as the first digit more often in a uniform log distribution. In the figure below, I have plotted 1-10 on a logarithm scale. In a uniform log distribution, a given point is equally likely to be found anywhere on the line. So the probability of getting any particular first digit is just its length along that line. Clearly, the intervals get smaller as the numbers get bigger. But we can quantify this, too. For a first digit on the interval [1,10), the probability that the first digit is d is given by: $$ P(d) = \frac{\mbox{log}(d+1) -\mbox{log}{10}(d)}{\mbox{log}(10) -\mbox{log}{10}(1)} $$ which is just $$ P(d) =\mbox{log}(d+1) -\mbox{log}{10}(d) $$ or $$ P(d) = \mbox{log}\left( 1 + \frac{1}{d} \right) $$ which is the distribution of Benford's Law. So how well do different data sets follow Benford's Law? I decided to test it out on a couple easily available data sets: pulsar periods, U.S. city populations, U.S. county sizes and masses of plant genomes. Let's start first with pulsar periods. I took 1875 pulsar periods from the ATNF Pulsar Database (found here). The results are plotted below. The bars represent the fraction of numbers that start with a given digit and the red dots are the fractions predicted by Benford's Law. From this plot, we see that the pulsar period data shows the general trend of Benford's Law, but not exactly. Now let's try U.S. city populations. This data was taken from the U.S. census bureau from the 2009 census and contains population data for over 81,000 U.S. cities. We see from the chart below that there is a near exact correspondence between the observed first-digit distribution and Benford's Law. Also from the U.S. census bureau, I got the data for the land area of over 3000 U.S. counties. These data also conform fairly well to Benford's Law. Finally, I found this neat website that catalogs the genome masses of over 2000 different species of plants. I'm not totally sure why they do this, but it provided a ton of easy-to-access data, so why not? Neat, so we see that wide variety of natural data follow Benford's Law (some more examples here). But why should they? Well, as far as I have gathered, there are a few reasons for this. The first two come from a paper published by Jeff Boyle [2]. Boyle makes (and proves) two claims about this distribution. First, he claims that "the log distribution [Benford's Law] is the limiting distribution when random variables are repeatedly multiplied, divided, or raised to integer powers." Second, he claims that once such a distribution is achieved, it "persists under all further multiplications, divisions and raising to integer powers." Since most data we accumulate (scientific, financial, gambling,...) is the result of many mathematical operations, we would expect that they would tend towards the logarithmic distribution as described by Boyle. Another reason for why natural data should fit Benford's Law is given by Roger Pinkham (in this paper). Pinkham proves that"the only distribution for the first significant digits which is invariant under scale change of the underlying distribution" is Benford's Law. This means that if we have some data, say the lengths of rivers in feet, it will have some distribution in the first digit. If we require that this distribution remain the same under unit conversion (to meters, yards, cubits, ... ), the only distribution that satisfies this distribution would be the uniform logarithmic distribution of Benford's Law. This "scale-invariant" rationale for this first digit law is probably the most important when it comes to data that we actually measure. If we find some distribution for the first digit, we would like it to be the same no matter what units we have used. But this should also be really easy to test. The county size data used above was given in square miles, so let's try some new units. First, we can try square kilometers. Slightly different than square miles, but still a very good fit. Now how about square furlongs? Neat! Seems like the distribution holds true regardless of the units we have used. So it seems like a wide range of data satisfy Benford's Law. But is this useful in any way or is it just a statistical curiosity? Well, it's mainly just a curiosity. But people have found some pretty neat applications. One field in which it has found use is Forensic Accounting, which I can only assume is a totally rad bunch of accountants that dramatically remove sunglasses as they go over tax returns. Since certain types of financial data (for example, see here) should follow Benford's Law, inconsistencies in financial returns can be found if the data is faked or manipulated in any way. Moral of the story: If you're going to cook the books, remember Benford! [1] Benford's Law, in the great tradition of Stigler's Law, was discovered by Simon Newcomb. [2] Paper can be found here. Unfortunately, this is only a preview as the full version isn't publicly available without a library license. The two points that I use from this paper are at least stated in this preview.

Comments