13.5 Application: Making a Lexicon

A lexicon is any list of words—usually the words of a language or of some body of texts produced in that language. In this Section we’ll combine our file-handling skills with regular expressions to create a lexicon for L. Frank Bloom’s classic The Wizard of Oz.

First you need to get hold of the text itself. The Wizrd of Oz has been in the public domain for many years now, and is available from the Gutenberg Project. You could look it up there and find the URL for download:

http://www.gutenberg.org/cache/epub/55/pg55.txt

Go ahead and download the file:

ozURL <- "http://www.gutenberg.org/cache/epub/55/pg55.txt"
if ( !(file.exists("downloads/oz.txt")) ) {
  download.file(url = ozURL,
                destfile = "downloads/oz.txt")
}

Next, read in the file, line by line:

oz <- readLines(con = "downloads/oz.txt")

Take some time to look through oz.txt. If we plan to make a lexicon for the book, then we probably don’t want to include Gutenberg’s header material or their discussion of licensing at the end. We’ll need to cut this material out of oz:

# a helper function
findIndex <- function(pattern, text) {
  str_detect(text, pattern = pattern) %>% which()
}

# now find lines to start and end at:
firstLine <- findIndex("^\\*\\*\\* START OF THIS PROJECT GUTENBERG", oz) + 1
lastLine <- findIndex("^End of Project Gutenberg's", oz)  - 1

# trim oz to the desired text:
oz2 <- oz[firstLine:lastLine]

Let us now take a first pass at splitting the text into all of its “words.” Our first thought is that words are the parts of the text that are separated by one or more white-space characters, so we might try this:

ozwds <- 
  oz2 %>% 
  str_split(pattern = "\\s+") %>% 
  unlist()

Next, let’s convert all of the words to lowercase, then make a new word-list with no repeats, then sort that list:

ozWords <- 
  ozwds %>% 
  str_to_lower() %>% 
  unique() %>% 
  str_sort()

It’s time now to examine our prospective lexicon. Look around a bit in ozWords.

There are some numbers. We’ll get rid of them.
There are an awful lot of strings that begin or end with punctuation. This should not be difficult to remove with str_replace().
There are plenty of contractions, like "aren't". It seems reasonable to count these as words, so we’ll leave them alone.
There are hyphenated words. What should we do with them? In a book for grown-ups you’ll find lots of hyphenated words where the components words are meaningful in themselves. That’s because of a grown-up grammar rule that says we should “hyphenate two or more words when they come before a noun they modify and act as a single idea.” This is important: after all,“short-listed candidates” refers to candidates who appear on our short list, whereas “short listed candidates” appears to refer to short candidates who appear on our list. If this were a grown-up text then I’d want to split on hyphens in order to capture the meaningful word-components. In The Wizard of Oz, though, it seems that hyphenation is used mainly to create interesting new words out nonsense sound-fragments, as in:

So the Wicked Witch took the Golden Cap from her cupboard and placed it upon her head. Then she stood upon her left foot and said slowly:

“Ep-pe, pep-pe, kak-ke!”

Next she stood upon her right foot and said:

“Hil-lo, hol-lo, hel-lo!”

After this she stood upon both feet and cried in a loud voice:

“Ziz-zy, zuz-zy, zik!”

It would seem that (for example) “zuz-zy” should count as a word, but its parts “zuz” and “zy” should not count as words.
On the other hand, we often find pairs of hyphens that appear to act as the emdash character (—):

The cyclone had set the house down very gently--for a cyclone--in the midst of a country of marvelous beauty.

This causes such strings as “gently--for” and “cyclone--in” to appear in our lexicon. We don’t want that, so we need to split on double-hyphens.

Let’s go back and try the splitting again:

ozwds <-
  oz2 %>% 
  str_split(pattern = "(?x)    # allow comments
                      (-{2,})  # two or more hyphens
                      |        # or
                      (\\s+)   # whitespace
                      ") %>% 
  unlist()

Now let’s strip off any leading or trailing punctuation. In ICU Unicode, \p{P} is a shortcut for any punctuation character:

ozWords <-
  ozwds %>% 
  str_replace(pattern = "^\\p{P}+",
              replacement = "") %>%   # strip leading punctuation
  str_replace(pattern = "\\p{P}+$",
              replacement = "") %>%   # strip trailing punctuation
  str_to_lower() %>% 
  unique() %>% 
  str_sort()

Taking a second look at ozWords, we see that we need to get rid of some numbers and a spurious empty string:

isNumber <- str_detect(ozWords, pattern = "^\\d+")
isEmpty <- ozWords == ""
validWord <- !isNumber & !isEmpty
ozWords <- ozWords[validWord]

A final check of ozWords appears not turn up any serious problems. We’ll take it as our lexicon for The Wizard of Oz.

For fun, let’s make an index to the book. First, a little helper function:

indexFactory <- function(lexicon, fn) {
  index <- list()
  fileLines <- readLines(con = fn)
  for ( i in seq_len(length(lexicon)) ) {
    word <- lexicon[i]
    pattern <- paste0("(?i)", word)
    hasWord <- str_detect(fileLines, pattern = pattern)
    index[[word]] <- which(hasWord)
  }
  index
}

Now we call our helper function to create the index:

ozIndex <- indexFactory(ozWords, "downloads/oz.txt")

We can use the index to look up words in the original text, without having to use regular expressions explicitly. First we make a convenience-function for looking up words::

ozLookup <- function(word) {
  
  fn <- "downloads/oz.txt"
  lexicon <- ozWords
  index <- ozIndex
  
  file <- readLines(con = fn)
  if ( !(word %in% lexicon ) ) {
    message <- paste0("\"", word, "\" is not in the lexicon!\n")
    return(cat(message))
  }
  matchLines <- index[[word]]
  number <- length(matchLines)
  cat("There are ", number, "lines that contain your request.\n\n")
  hrule <- rep("-", times = 30)
  for ( i in 1:number ) {
    lineNum <- matchLines[i]
    cat(hrule, "\n")
    cat(lineNum, ":  ", file[lineNum], "\n")
  }
}

Now we give it a try:

ozLookup("lolliop")

## "lolliop" is not in the lexicon!

And another try:

ozLookup("humbug")

## There are  10 lines that contain your request.
## 
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
## 61 :     16.  The Magic Art of the Great Humbug 
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
## 3421 :   a humbug." 
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
## 3424 :   it pleased him.  "I am a humbug." 
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
## 3438 :   "Doesn't anyone else know you're a humbug?" asked Dorothy. 
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
## 3472 :   being such a humbug." 
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
## 3603 :   one I am a humbug." 
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
## 3607 :   Terrible Humbug," as she called him, would find a way to send her back 
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
## 3613 :     16.  The Magic Art of the Great Humbug 
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
## 3738 :   they wanted.  "How can I help being a humbug," he said, "when all these 
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
## 3802 :   "Yes, of course," replied Oz.  "I am tired of being such a humbug.  If