One Billion Words In The English Language

Ken AshfordPopular CultureLeave a Comment

The Associated Press headline proclaims "English Language Hits 1 Billion Words" and the first three paragraphs of the article read:

A massive language research database responsible for bringing words such as "podcast" and "celebutante" to the pages of the Oxford dictionaries has officially hit a total of 1 billion words, researchers said Wednesday.

Drawing on sources such as weblogs, chatrooms, newspapers, magazines and fiction, the Oxford English Corpus spots emerging trends in language usage to help guide lexicographers when composing the most recent editions of dictionaries.

The press publishes the Oxford English Dictionary, considered the most comprehensive dictionary of the language, which in its most recent August 2005 edition added words such as "supersize,""wiki" and "retail politics" to its pages.

Wow.  That’s a lot of words, you’re thinking.  (You’re also trying to figure out how many words you probably know).

But . . . not so fast, hombre.  The next graf spoils the party:

Oxford University Press lexicographer Catherine Soanes said the database is not a collection of 1 billion different words, but of sentences and other examples of the usage and spelling.

And if you go to the actual Oxford English Corpus website, you learn this:

Because the corpus is a collection of texts, there are not one billion different words: the humble word ‘the’, the commonest in the written language, accounts for 50 million of all the words in the corpus!

And you see this screenshot of a small snippet of the database, showing twenty instances of the word "sublime":

Corpus_concordance

And for all we know, there may be more than twenty appearances of the word "sublime" in the database.

So there aren’t one billion English words.  It’s just that the database used to catalogue and monitor English usage has collected one billion words, a far different (and less interesting) story.  And even then, most of the one billion words are repeated dozens, hundreds, thousands, or even millions of times.

Now go back to the top of this post and read the AP headline.  Deceptive, isn’t it?