Text Mining in R had quite the boost in 2016. David Robinson’s fascinating analysis of Donald Trump’s real and ‘official’ tweets got a lot of publicity (something the president-elect was probably all too happy with) and his collaboration with Julia Silge resulted in one of the best books,Tidy text Mining with R yet published using the bookdown package
Professor Silge also released a couple of R packages
- tidytext - useful for tidying text for subsequent analyses
- janeaustenr - a dataset of Jane Austen’s novels
I’m not completely sold on the value of textual analysis for works of fiction, at least at its current stage of development, though I’m prepared to be convinced otherwise. To me, it is the equivalent of perusing the list of ingredients on a packaged good in order to assess its taste. When I want to know whether to read a novel, I’m interested in themes, settings, characters, quality of writing etc. which I doubt this can provide.
Nevertheless, it is now a lot easier (and fun) to process novels - at least those in the public domain and on Project Gutenberg, thanks again to David Robinson and his gutenberger package
An interesting comparison to Jane Austen is Charles Dickens. His books are more wide ranging than Austen’s and have many memorable characters mixed in with social comment on Victorian England.
First we load the libraries and see what titles are available
#load libraries library(tidyverse) library(tidytext) library(gutenbergr) library(plotly) library(stringr) library(feather) library(wordcloud2)
dickens <-gutenberg_works(author == "Dickens, Charles") glimpse(dickens)
## Observations: 74 ## Variables: 8 ## $ gutenberg_id <int> 46, 98, 564, 580, 588, 644, 650, 653, 675,... ## $ title <chr> "A Christmas Carol in Prose; Being a Ghost... ## $ author <chr> "Dickens, Charles", "Dickens, Charles", "D... ## $ gutenberg_author_id <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37... ## $ language <chr> "en", "en", "en", "en", "en", "en", "en", ... ## $ gutenberg_bookshelf <chr> "Christmas/Children's Literature", "Histor... ## $ rights <chr> "Public domain in the USA.", "Public domai... ## $ has_text <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ...
##  "Christmas/Children's Literature" ##  "Historical Fiction" ##  "Mystery Fiction" ##  "Best Books Ever Listings" ##  NA ##  "Christmas" ##  "Children's Literature" ##  "Children's History/United Kingdom" ##  "Harvard Classics" ##  "Detective Fiction" ##  "Children's Picture Books"
So, extremely prolific and wide-ranging. I will probably want to limit this analysis to his novels and will start with one of his most highly-regarded, Great Expectations.
I probably read the book as a child but definitely remember a BBC series and the excellent 1946 film version(not on first-release), which differs somewhat from the novel
We can download it’s text, via the gutenberg_id, which takes barely a second. Then
plagiarise follow the Tidy text book’s code to get it into a ‘tidy’ format
## Needed to add mirror when main site went down it was a raised issue by someone else expectations <- gutenberg_download(1400,mirror = "http://mirrors.xmission.com/gutenberg/") glimpse(expectations)
## Observations: 20,024 ## Variables: 2 ## $ gutenberg_id <int> 1400, 1400, 1400, 1400, 1400, 1400, 1400, 1400, 1... ## $ text <chr> "GREAT EXPECTATIONS", "", "[1867 Edition]", "", "...
tidy_expectations <- expectations %>% mutate(linenumber = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>% ungroup() %>% unnest_tokens(word, text) %>% #186,000+ # remove most common words of which there are 1149 in total anti_join(stop_words) #55,000 tidy_expectations
## # A tibble: 55,583 x 4 ## gutenberg_id linenumber chapter word ## <int> <int> <int> <chr> ## 1 1400 3056 10 tramps ## 2 1400 6743 20 lounging ## 3 1400 4086 13 strengthening ## 4 1400 19276 57 strengthening ## 5 1400 3860 12 daily ## 6 1400 4362 14 daily ## 7 1400 17060 51 daily ## 8 1400 18781 56 daily ## 9 1400 18906 56 daily ## 10 1400 19713 58 daily ## # ... with 55,573 more rows
We now have a tidy data frame with each row a single word by linenumber/chapter.
Interestingly the word ‘expectations’ does not first appear until Chapter 18, when the lawyer, Jaggers informs Joe Gargery and Pip that the latter ‘will come into a handsome property’
We can now visualize the most common words in a couple of ways. Hover plots for exact numbers
word_count <- tidy_expectations %>% count(word, sort = TRUE) %>% mutate(word = reorder(word, n)) word_count %>% head(10) %>% plot_ly(x=~n, y=~word) %>% layout(title="Most common words (excluding stop-words) in Great Expectations", xaxis=list(title="Total Occurrences"), yaxis=list(title="") ) %>% config(displayModeBar = F, showLink = F)
word_count %>% head(100) %>% wordcloud2()
As is often the case in novels, character names predominate but it is of interest that Joe is so well in the lead. ‘expectations’ ranks in the low 200’s and ‘great’ is a stop word.
Let’s have a look at the occurrence of Joe throughout the story
tidy_expectations %>% filter(word=="joe") %>% group_by(chapter) %>% count(word) %>% plot_ly(x=~chapter,y=~n) %>% add_bars(color=I("blue"), alpha=0.5) %>% layout(title="Occurrences of word 'Joe' by Chapter", yaxis=list(title="Occurrences"), yaxis=list(title="Chapter") ) %>% config(displayModeBar = F, showLink = F)
As you may recall, or can read here, Joe is Pip’s brother-in-law and surrogate father. He is a strong, positive, influence on Pip as a boy Chapter 27 is when Joe visits a mortified Pip in London, which brings out the worst in our ‘hero’ and Ch 57 is when Joe comforts Pip, who now realizes how badly he has treated a true friend, in his illness
We can use the tools of text mining to approach the emotional content of text programmatically. The tidyverse package has three sentiment lexicons for evaluating opinion or emotion in text. Here I will replicate some of the code in the book with the occasional tangent
# lets look how one of the lexicons classifies words nrc <- get_sentiments("nrc") unique(nrc$sentiment)
##  "trust" "fear" "negative" "sadness" ##  "anger" "surprise" "positive" "disgust" ##  "joy" "anticipation"
get_sentiments("nrc") %>% filter(sentiment == "positive") %>% config(displayModeBar = F, showLink = F)
## # A tibble: 2,312 x 3 ## word sentiment x ## <chr> <chr> <list> ## 1 abba positive <list > ## 2 ability positive <list > ## 3 abovementioned positive <list > ## 4 absolute positive <list > ## 5 absolution positive <list > ## 6 absorbed positive <list > ## 7 abundance positive <list > ## 8 abundant positive <list > ## 9 academic positive <list > ## 10 academy positive <list > ## # ... with 2,302 more rows
Good to see that ‘academic’ is positive! However, I will leave positive and negative out at this stage
Let’s look as the other emotions as a percentage of all words in each chapter
# first all words words_chapter <- tidy_expectations %>% group_by(chapter) %>% count() %>% rename(total=n) # sentiments to exclude chuck <- c("negative","positive") # tidy_expectations %>% inner_join(nrc) %>% filter(!sentiment %in% chuck) %>% group_by(sentiment,chapter) %>% count() %>% inner_join(words_chapter) %>% mutate(pc=round(100*n/total,1)) %>% filter(chapter!=0) %>% plot_ly(x=~chapter,y=~pc,color=~sentiment) %>% add_bars() %>% layout(barmode = 'stack', title="Great Expectations - % of each Chapter with words of varying emotions ", yaxis=list(title="Percentage"))
Stacked bar-charts are not often the best method of visualization but just toggle on the legend to remove/add emotions. For instance, the fear factor peaks in the chapter when Pip has just attempted to rescue Miss Havisham from the fire and he determines that Estella is Magwitch’s daughter
Another use of sentiment analysis is to examine the flow throughout the novel by breaking the word-count into equal chunks. This time using the bing lexicon which just splits words by a binary positive/negative. The Bing lexicon has more negative words so wariness should be applied to a single novel. Any trajectory over time and comparison with other novels would be more robust
tidy_expectations %>% inner_join(get_sentiments("bing")) %>% count( index = linenumber %/% 100, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative) %>% plot_ly(x=~index,y=~sentiment) %>% add_bars()
Even with the caveat above, this is a bit of a downer especially given that apparantly (i.e as referenced in wikipedia) G.K. Chesterton admired the novel’s optimism
Here is the tidytext books outcome of the Jane Austen novels
I guess Dicken’s novel is a little grittier than life in upper middle-class country homes
Julia (I trust I am not being over-familiar) has extended her analysis in a blog post on readability. If you want to read more about the technique (and you should) head off there but suffice to say it starts with the premise that useful categories include
- Number of sentences
- Number of words with three or more syllables
Let’s have a look at sentences first.
# easiest just to download again ge <- gutenberg_download(c(1400),mirror = "http://mirrors.xmission.com/gutenberg/", meta_fields = "title") tidy_ge <- ge %>% mutate(text = iconv(text, to = 'latin1')) %>% nest(-title) %>% mutate(tidied = map(data, unnest_tokens, 'sentence', 'text', token = 'sentences')) tidy_ge
## # A tibble: 1 x 3 ## title data tidied ## <chr> <list> <list> ## 1 Great Expectations <tibble [20,024 x 2]> <tibble [10,637 x 2]>
# we are only interested in the tidied column which should be in sentences. Lets check tidy_ge <-tidy_ge %>% unnest(tidied) tidy_ge %>% sample_n(5) %>% select(sentence)
## # A tibble: 5 x 1 ## sentence ## <chr> ## 1 the mist was heavier yet when i got out upon the marshes, so that instead o ## 2 but it was very pleasant to see the pride with which he hoisted it up and m ## 3 his blue bag was slung over his shoulder, honest industry beamed in his eye ## 4 miss havisham was taking exercise in the room with the long spread table, l ## 5 politely omitting young fellow.
# Mine look good # What is distribution like sentences_ge <- tidy_ge %>% unnest_tokens(word, sentence, drop = FALSE) %>% unique() %>% group_by(sentence) %>% summarize(length=n()) summary(sentences_ge)
## sentence length ## Length:10033 Min. : 1.00 ## Class :character 1st Qu.: 6.00 ## Mode :character Median : 13.00 ## Mean : 15.97 ## 3rd Qu.: 23.00 ## Max. :105.00
sentences_ge %>% plot_ly(x=~length)
# and the longest sentences_ge %>% arrange(desc(length)) %>% head(1) %>% .$sentence
##  "again among the tiers of shipping, in and out, avoiding rusty chain-cables frayed hempen hawsers and bobbing buoys, sinking for the moment floating broken baskets, scattering floating chips of wood and shaving, cleaving floating scum of coal, in and out, under the figure-head of the john of sunderland making a speech to the winds (as is done by many johns), and the betsy of yarmouth with a firm formality of bosom and her knobby eyes starting two inches out of her head; in and out, hammers going in ship-builders' yards, saws going at timber, clashing engines going at things unknown, pumps going in leaky ships, capstans going, ships going out to sea, and unintelligible sea-creatures roaring curses over the bulwarks at respondent lightermen, in and out,--out at last upon the clearer river, where the ships' boys might take their fenders in, no longer fishing in troubled waters with them over the side, and where the festooned sails might fly out to the wind."
The longest sentence is a reference to the River Thames when they are trying to effect Magwitch’s escape
Now let’s look at syllables. The function is a long one so I used ‘echo = FALSE’ in the code chunk
This is the code which takes an age to run but is available in case you want to render it yourself
# check that function is working correctly txt <-"at a time and place of our own choosing. Some of it may be explicit and publicized; some of it may not be" count_syllables(txt)
##  28
# tidy_ge <- tidy_ge %>% # unnest_tokens(word, sentence, drop = FALSE) %>% # rowwise() %>% # mutate(n_syllables = count_syllables(word)) %>% # ungroup() # write_feather(tidy_ge,"data/tidy_ge.feather") # loading precompiled file tidy_ge <- read_feather("data/tidy_ge.feather") # plot the distribution tidy_ge %>% plot_ly(x=~n_syllables)
##  1.329031
tidy_ge %>% filter(n_syllables>6) %>% select(word) %>% count(word, sort=TRUE)
## # A tibble: 7 x 2 ## word n ## <chr> <int> ## 1 apologetically 4 ## 2 inaccessibility 2 ## 3 unceremoniously 2 ## 4 architectooralooral 1 ## 5 incompatibility 1 ## 6 irreconcilability 1 ## 7 unsympathetically 1
The word with the most syllables(8) is irreconcilability whilst ‘architectooralooral’ is predictably from the mouth of Joe when he is up in London-town trying his best to meet Pip’s high-and-mighty standards
We can now gauge a readability level for the book
left_join(tidy_ge %>% group_by(title) %>% summarise(n_sentences = n_distinct(sentence)), tidy_ge %>% group_by(title) %>% filter(n_syllables >= 3) %>% summarise(n_polysyllables = n())) %>% mutate(SMOG = 1.0430 * sqrt(30 * n_polysyllables/n_sentences) + 3.1291)
## Joining, by = "title"
## # A tibble: 1 x 4 ## title n_sentences n_polysyllables SMOG ## <chr> <int> <int> <dbl> ## 1 Great Expectations 10033 12997 9.631162
The SMOG (“Simple Measure of Gobbledygook”) value of 9.6 indicates that, for an average reader, around the middle of Grade 9 would be an appropriate starting age. I would imagine that the action could be followed by someone younger but some of the themes such as pride, love (both unrequited and unknown) and ambition make it an interesting read into maturity