Zipf’s Law

In natural language, the most frequent word occurs about twice as often as the second most frequent word, three times as often as the third most frequent word, and so on.

In the Brown Corpus, a text collection of a million words, the most frequent word, the, accounts for 7.5% of all word occurrences, and the second most frequent, of, accounts for 3.5%. A mere 135 vocabulary items account for half the corpus, and about half the total vocabulary of about 50,000 words are hapax legomena, words that occur once only.

Similar distributions are found in data throughout the physical and social sciences; the law is named after the American linguist George Kingsley Zipf.