What Is Zipf’s Law?
Alright, we’re not going to lie, “Zipf’s Law” sounds like something you got after running your hand up and down your keyboard. But no, that’s not it. Zipf’s law is actually a really weird quirk of language that has helped shed light on some other really strange aspects of human society. It’s one of those coincidences-aren’t-coincidences-but-also-are-coincidences type deal. So, what is Zipf’s law, anyway? How does it work?
Here’s a fun fact: Despite being named after George Kingsley Zipf, Zipf himself never claimed ownership over his discovery. Jean-Baptiste Estoup of France and Felix Auerbach of Germany noted the statistics that would become Zipf’s law first.
If you were to look up the top 10 words in the English language, you’d get the following list.
It’s probably a little difficult to construct a sentence out of these, but hey. Statistically, someone’s probably tried… And failed? Here’s a list of the 3,000 most common words (unranked) if you feel like making some English sentences. But back to our list above. It turns out that these 10 words make up like 25% of the used words throughout all of English. Heck, the word “the,” statistically, makes up like 7% of English according to the Brown Corpus. The actual corpus itself is a compilation of a bunch of text on linguistics–so the words that appear most frequently there are a little different than all of English.
The top 100 words (including these 10) make up 50% of what’s used in English.
That in itself may seem a little funky, but there’s even more to it. If you were to rank every word in the English language by frequency, you’d find that the second most common word appears roughly half as much as the most common. The third roughly a third as much, and so on. What’s more, you’d find this with all English texts as well. Like a book or article. The most commonly used word in that book will be used about twice as much as the second most common word.
Skew In Zipf’s Law & More Stats
Unfortunately, not everything is perfect, and Zipf’s law is actually quite strict. The actual formula finds that only like 15% of English texts comply in full with the law. That’s where we get the Zipf-Mandelbrot law, which basically generalizes Zipf’s law a little better. Anyway, let’s have some fun with Zipf’s law. We took the American National Corpus and found that the word “the” occurred 1,204,816 times. The word “of” 606,545. That fraction is almost exactly 50%.
But it turns out Zipf’s law applies to all the languages. Even extinct languages we haven’t translated yet!
Zipf’s Law Outside of Language
Zipf’s law also can be applied to city populations, solar flares, earthquakes, and more!
Let’s go with city populations in the USA. New York sits at a population estimate of 8,336,817 as of 2019. Second place? Los Angeles with 3,979,576. LA is about 47% of New York. Alright, you might be skeptical. So let’s go to Chicago–third place at 2,693,976. That population is 32% of New York’s–remember that 3rd place is supposed to be 1/3 of the highest population. Yay, stats!
Linguistics are fun, so let’s look at who speaks what the most here.