Moby Dick is Herman Melville’s epic novel of 1851. It tells the story of Captain Ahab’s pursuit of a great white sperm whale that has bitten off his leg. He risks his own life and that of his crew on the whaling ship Pequod. He is gripped by a narcissistic rage in his single-minded voyage of revenge. The tale is narrated by Ishmael, taking part in his first whaling expedition, and we encounter a multi-national crew including Queequeg, Starbuck, Stubb, Tashtego, Flask and Daggoo. The story is interspersed with detailed chapters on whales, almost in the form of a mini encyclopedia.
Below is an analysis of the text:
First, download the book from Project Gutenberg and display the output:
## # A tibble: 23,571 x 3
##    text               linenumber chapter
##    <chr>                   <int>   <int>
##  1 MOBY DICK;                  1       0
##  2 OR THE WHALE                2       0
##  3 ""                          3       0
##  4 by Herman Melville          4       0
##  5 ""                          5       0
##  6 ""                          6       0
##  7 ""                          7       0
##  8 ""                          8       0
##  9 "  CHAPTER 1"               9       0
## 10 ""                         10       0
## # ... with 23,561 more rows
Then, tidy the text into a more manageable format
| linenumber | chapter | word | 
|---|---|---|
| 1 | 0 | moby | 
| 1 | 0 | dick | 
| 2 | 0 | or | 
| 2 | 0 | the | 
| 2 | 0 | whale | 
| 4 | 0 | by | 
| 4 | 0 | herman | 
| 4 | 0 | melville | 
| 9 | 0 | chapter | 
| 9 | 0 | 1 | 
Next, remove the stop words – the uninteresting, common words, such as: I, me, my, myself, we, our, ours, ourselves, you…
Of the remaining words, find the book’s most frequent and display the top 10 in a table and then in a word cloud:
| word | n | 
|---|---|
| whale | 1094 | 
| sea | 451 | 
| ahab | 436 | 
| ship | 431 | 
| ye | 430 | 
| head | 343 | 
| time | 332 | 
| captain | 308 | 
| boat | 291 | 
| white | 282 | 
 The R code used is available on github