English by Degrees

In his landmark paper “A Mathematical Theory of Communication,” Claude Shannon experimented with a series of stochastic approximations to English. He started with a sample message in which each of the 26 letters and the space appear with equal probability:

XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD.

In the next message, the symbols’ frequencies are weighted according to how commonly they appear in English text (for example, E is more likely than W):

OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL.

In the third he linked each letter to its predecessor: After one letter is recorded, the next is chosen in a manner weighted according to how frequently such a pair appears in natural English (a “digram”):

ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE.

In the fourth he applied the same idea to sets of three letters (“trigrams”):

IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE.

In the fifth he shifts from letters to words. Words appear in a manner weighted by their frequency in English (without regard to the prior word):

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.

Finally, he applies the digram technique to words — each word is chosen based on the frequency with which pairs of words appear in English:

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

Already this is starting to look like English — Shannon notes that the 10-word phrase ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS could find a home in a natural sentence without much strain.

He had to stop there, as this was 1948 and he was using paper books. “But the modern availability of computing power has made carrying out such calculations automatically a near-trivial task for reasonably-sized bodies of sample text,” writes UC-Santa Cruz computer scientist Noah Wardrip-Fruin. “As Shannon also pointed out, the stochastic processes he described are comonly considered in terms of Markov models. And, interestingly, the first application of Markov models was also linguistic and literary — modeling letter sequences in Pushkin’s poem ‘Eugene Onegin.’ But Shannon was the first to bring this mathematics to bear meaningfully on communication, and also the first to use it to perform text-generation play.”

(Noah Wardrip-Fruin, “Playable Media and Textual Instruments,” in Peter Gendolla and Jörgen Schäfer, eds., The Aesthetics of Net Literature, 2007.)