Moby Dick is Herman Melville’s epic novel of 1851. It tells the story of Captain Ahab’s pursuit of a great white sperm whale that has bitten off his leg. He risks his own life and that of his crew on the whaling ship Pequod. He is gripped by a narcissistic rage in his single-minded voyage of revenge. The tale is narrated by Ishmael, taking part in his first whaling expedition, and we encounter a multi-national crew including Queequeg, Starbuck, Stubb, Tashtego, Flask and Daggoo. The story is interspersed with detailed chapters on whales, almost in the form of a mini encyclopedia.
Below is an analysis of the text:
First, download the book from Project Gutenberg and display the output:
## # A tibble: 23,571 x 3
## text linenumber chapter
## <chr> <int> <int>
## 1 MOBY DICK; 1 0
## 2 OR THE WHALE 2 0
## 3 "" 3 0
## 4 by Herman Melville 4 0
## 5 "" 5 0
## 6 "" 6 0
## 7 "" 7 0
## 8 "" 8 0
## 9 " CHAPTER 1" 9 0
## 10 "" 10 0
## # ... with 23,561 more rows
Then, tidy the text into a more manageable format
linenumber | chapter | word |
---|---|---|
1 | 0 | moby |
1 | 0 | dick |
2 | 0 | or |
2 | 0 | the |
2 | 0 | whale |
4 | 0 | by |
4 | 0 | herman |
4 | 0 | melville |
9 | 0 | chapter |
9 | 0 | 1 |
Next, remove the stop words – the uninteresting, common words, such as: I, me, my, myself, we, our, ours, ourselves, you…
Of the remaining words, find the book’s most frequent and display the top 10 in a table and then in a word cloud:
word | n |
---|---|
whale | 1094 |
sea | 451 |
ahab | 436 |
ship | 431 |
ye | 430 |
head | 343 |
time | 332 |
captain | 308 |
boat | 291 |
white | 282 |
The R code used is available on github