Ecologists often have to estimate the number of unseen species in an ecosystem: If I count x species of butterfly during my time on an island, how many species probably live there that I did not see? In 1975, Stanford statisticians Bradley Efron and Ronald Thisted applied the same question to the works of William Shakespeare: If we take the Bard’s existing works as a sample, what can we infer about the size of his total vocabulary?
Shakespeare’s known works comprise 884,647 words, which fall into 31,534 “types,” or distinguishable arrangements of letters. Efron and Thisted applied two approaches and found that they produced the same estimate: If a new cache of the playwright’s works were discovered today, equal in size to the old, it would likely contain about 11,460 new word types, with an expected error of less than 150.
So how many word types altogether did Shakespeare know? No upper bound is possible, but they established a lower bound of 35,000 beyond the 31,534 already used — in other words, to write the works that we know of, he likely used less than half his total vocabulary.
(Bradley Efron and Ronald Thisted, “Estimating the Number of Unseen Species: How Many Words Did Shakespeare Know?”, Biometrika 63:3 , 435-447.) (Thanks, Brent.)