Exercises
The word “caracara” is in
words
. Note that it in this word a sequence of four letters is immediately repeated. Write a function that uses a regular expression to find all of the words inwords
that contain a doublet like this: specifically, a sequence of four or more letters that are immediately repeated. (The repeated sequence need NOT make up the entire word, as in the case of “caracara.”) The program shouldcat()
the words to the console, one per line.Write a program that uses a regular expression to find all of the words in
words
that contain a “q” that is not immediately followed by a “u.” The program shouldcat()
the words to the console, one per line.Write a program that uses a regular expression to find all of the words in
words
that are 4-letter or 5-letter palindromes. The program shouldcat()
the words to the console, one per line. (Hint: consult Section 12.3.8.)It can be proven mathematically that there is no regular expression, no matter how complex, that matches all and only the palindromes. We can write regular expressions to match all and only the palindromes that are less than some fixed number of characters, but we can’t match palindromes of arbitrary length. Write a program to find all of the palindromes in
words
. Obviously this program won’t have to use a regular expression!(*) Gutenburg’s version of Jane Austin’s classic novel Pride and Prejudice may be downloaded from the URL:
Your mission is to create a lexicon for Austin’s novel. Follow the pattern of the work done in the text to make a lexicon for The Wizard of Oz. You will have to make different choices, though, about what constitutes a “word.” For example, Austin’s prose is complex and “grown-up”in comparison to the prose of Oz, so the parts of hyphenated words probably constitute valid words in and of themselves. Austin has a habit of concealing certain place-names with dashes, and occasionally in letters to each other Austin’s characters will refer to a person by an initial. Should initials and sequences of dashes count as words? And what about “12th” as in “the 12th of December?” As you examine the text and your initial word list you will have to make decisions about these and other matters.
Write a report in R Markdown in which you create the lexicon. Include the code for all of your work—beginning from the file-download—so that a person who runs all of the code in the document will create the very same lexicon you made. As you develop the lexicon, explain your code and the rationale for your choices about what counts as a “word” in Austin’s novel.
Conclude the report with a data frame of the twenty most common words in Austin’s novel that are more than eight letters long. The frame should have two variables: one for the words and another for the number of occurrences. The frame should be sorted so that the words appear in decreasing order of frequency (i.e., most common word comes first).
(*) Install the tidytext and gutenbergr packages. gutenbergr simplifies the task of downloading a Project Gutenberg text, stripping off the spurious leading and trailing material. tidytext automates a number of basic text-analysis tasks. Read the Vignette in the gutenberg packages and learn how to find the words in a text. Find the unique words in The Wizard of Oz as determined by tidytext, and use the
setdiff()
function to compare theozWords
in this Chapter with the word-list according to tidytext. What are the primary differences?