Textual Analysis

Great Expectations

Text Mining in R had quite the boost in 2016. David Robinson’s fascinating analysis of Donald Trump’s real and ‘official’ tweets got a lot of publicity (something the president-elect was probably all too happy with) and his collaboration with Julia Silge resulted in one of the best books,Tidy text Mining with R yet published using the bookdown package

Professor Silge also released a couple of R packages

  • tidytext - useful for tidying text for subsequent analyses
  • janeaustenr - a dataset of Jane Austen’s novels

I’m not completely sold on the value of textual analysis for works of fiction, at least at its current stage of development, though I’m prepared to be convinced otherwise. To me, it is the equivalent of perusing the list of ingredients on a packaged good in order to assess its taste. When I want to know whether to read a novel, I’m interested in themes, settings, characters, quality of writing etc. which I doubt this can provide.

Nevertheless, it is now a lot easier (and fun) to process novels - at least those in the public domain and on Project Gutenberg, thanks again to David Robinson and his gutenberger package

An interesting comparison to Jane Austen is Charles Dickens. His books are more wide ranging than Austen’s and have many memorable characters mixed in with social comment on Victorian England.


First we load the libraries and see what titles are available

 #load libraries
 
library(tidyverse)
library(tidytext)
library(gutenbergr)
library(plotly)
library(stringr)
library(feather)
library(wordcloud2)
dickens <-gutenberg_works(author == "Dickens, Charles")
glimpse(dickens)
## Observations: 74
## Variables: 8
## $ gutenberg_id        <int> 46, 98, 564, 580, 588, 644, 650, 653, 675,...
## $ title               <chr> "A Christmas Carol in Prose; Being a Ghost...
## $ author              <chr> "Dickens, Charles", "Dickens, Charles", "D...
## $ gutenberg_author_id <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37...
## $ language            <chr> "en", "en", "en", "en", "en", "en", "en", ...
## $ gutenberg_bookshelf <chr> "Christmas/Children's Literature", "Histor...
## $ rights              <chr> "Public domain in the USA.", "Public domai...
## $ has_text            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ...
(unique(dickens$gutenberg_bookshelf))
##  [1] "Christmas/Children's Literature"  
##  [2] "Historical Fiction"               
##  [3] "Mystery Fiction"                  
##  [4] "Best Books Ever Listings"         
##  [5] NA                                 
##  [6] "Christmas"                        
##  [7] "Children's Literature"            
##  [8] "Children's History/United Kingdom"
##  [9] "Harvard Classics"                 
## [10] "Detective Fiction"                
## [11] "Children's Picture Books"

So, extremely prolific and wide-ranging. I will probably want to limit this analysis to his novels and will start with one of his most highly-regarded, Great Expectations.

I probably read the book as a child but definitely remember a BBC series and the excellent 1946 film version(not on first-release), which differs somewhat from the novel

We can download it’s text, via the gutenberg_id, which takes barely a second. Then plagiarise follow the Tidy text book’s code to get it into a ‘tidy’ format

## Needed to add mirror when main site went down it was a raised issue by someone else
expectations <- gutenberg_download(1400,mirror = "http://mirrors.xmission.com/gutenberg/")

glimpse(expectations)
## Observations: 20,024
## Variables: 2
## $ gutenberg_id <int> 1400, 1400, 1400, 1400, 1400, 1400, 1400, 1400, 1...
## $ text         <chr> "GREAT EXPECTATIONS", "", "[1867 Edition]", "", "...
tidy_expectations <- expectations %>% 
mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>% 
  unnest_tokens(word, text) %>% #186,000+
  # remove most common words of which there are 1149 in total
  anti_join(stop_words)  #55,000
tidy_expectations
## # A tibble: 55,583 x 4
##    gutenberg_id linenumber chapter          word
##           <int>      <int>   <int>         <chr>
##  1         1400       3056      10        tramps
##  2         1400       6743      20      lounging
##  3         1400       4086      13 strengthening
##  4         1400      19276      57 strengthening
##  5         1400       3860      12         daily
##  6         1400       4362      14         daily
##  7         1400      17060      51         daily
##  8         1400      18781      56         daily
##  9         1400      18906      56         daily
## 10         1400      19713      58         daily
## # ... with 55,573 more rows

We now have a tidy data frame with each row a single word by linenumber/chapter.
Interestingly the word ‘expectations’ does not first appear until Chapter 18, when the lawyer, Jaggers informs Joe Gargery and Pip that the latter ‘will come into a handsome property’


We can now visualize the most common words in a couple of ways. Hover plots for exact numbers

word_count <- 
  tidy_expectations %>%
  count(word, sort = TRUE) %>%
  mutate(word = reorder(word, n)) 

word_count %>%
  head(10) %>% 
  plot_ly(x=~n, y=~word) %>% 
  layout(title="Most common words (excluding stop-words) in Great Expectations",
         xaxis=list(title="Total Occurrences"),
         yaxis=list(title="")

)  %>%
config(displayModeBar = F, showLink = F)
word_count %>%
  head(100) %>% 
  wordcloud2()

As is often the case in novels, character names predominate but it is of interest that Joe is so well in the lead. ‘expectations’ ranks in the low 200’s and ‘great’ is a stop word.

Let’s have a look at the occurrence of Joe throughout the story

tidy_expectations %>%
  filter(word=="joe") %>% 
  group_by(chapter) %>% 
  count(word) %>% 
  plot_ly(x=~chapter,y=~n) %>% 
  add_bars(color=I("blue"), alpha=0.5) %>% 
  layout(title="Occurrences of word 'Joe' by Chapter",
         yaxis=list(title="Occurrences"),
         yaxis=list(title="Chapter")

)  %>%
config(displayModeBar = F, showLink = F)

As you may recall, or can read here, Joe is Pip’s brother-in-law and surrogate father. He is a strong, positive, influence on Pip as a boy Chapter 27 is when Joe visits a mortified Pip in London, which brings out the worst in our ‘hero’ and Ch 57 is when Joe comforts Pip, who now realizes how badly he has treated a true friend, in his illness


Sentiment Analysis

We can use the tools of text mining to approach the emotional content of text programmatically. The tidyverse package has three sentiment lexicons for evaluating opinion or emotion in text. Here I will replicate some of the code in the book with the occasional tangent

# lets look how one of the lexicons classifies words 

nrc <- get_sentiments("nrc")
unique(nrc$sentiment)
##  [1] "trust"        "fear"         "negative"     "sadness"     
##  [5] "anger"        "surprise"     "positive"     "disgust"     
##  [9] "joy"          "anticipation"
get_sentiments("nrc") %>% 
  filter(sentiment == "positive")  %>%
config(displayModeBar = F, showLink = F)
## # A tibble: 2,312 x 3
##              word sentiment          x
##             <chr>     <chr>     <list>
##  1           abba  positive <list [4]>
##  2        ability  positive <list [2]>
##  3 abovementioned  positive <list [2]>
##  4       absolute  positive <list [2]>
##  5     absolution  positive <list [2]>
##  6       absorbed  positive <list [2]>
##  7      abundance  positive <list [2]>
##  8       abundant  positive <list [2]>
##  9       academic  positive <list [2]>
## 10        academy  positive <list [2]>
## # ... with 2,302 more rows

Good to see that ‘academic’ is positive! However, I will leave positive and negative out at this stage

Let’s look as the other emotions as a percentage of all words in each chapter

# first all words 

words_chapter <- tidy_expectations %>%
   group_by(chapter) %>% 
  count() %>% 
  rename(total=n)

# sentiments to exclude
chuck <- c("negative","positive")

#
tidy_expectations %>%
  inner_join(nrc) %>% 
  filter(!sentiment %in% chuck) %>% 
  group_by(sentiment,chapter) %>% 
  count() %>% 
  inner_join(words_chapter) %>% 
  mutate(pc=round(100*n/total,1)) %>% 
  filter(chapter!=0) %>% 
  plot_ly(x=~chapter,y=~pc,color=~sentiment) %>% 
  add_bars() %>%
  layout(barmode = 'stack',
         title="Great Expectations - % of each Chapter with words of varying emotions ",
        
         yaxis=list(title="Percentage"))

Stacked bar-charts are not often the best method of visualization but just toggle on the legend to remove/add emotions. For instance, the fear factor peaks in the chapter when Pip has just attempted to rescue Miss Havisham from the fire and he determines that Estella is Magwitch’s daughter


Another use of sentiment analysis is to examine the flow throughout the novel by breaking the word-count into equal chunks. This time using the bing lexicon which just splits words by a binary positive/negative. The Bing lexicon has more negative words so wariness should be applied to a single novel. Any trajectory over time and comparison with other novels would be more robust

tidy_expectations %>%
  inner_join(get_sentiments("bing")) %>%
  count( index = linenumber %/% 100, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>% 
  plot_ly(x=~index,y=~sentiment) %>% 
  add_bars()

Even with the caveat above, this is a bit of a downer especially given that apparantly (i.e as referenced in wikipedia) G.K. Chesterton admired the novel’s optimism

Here is the tidytext books outcome of the Jane Austen novels

I guess Dicken’s novel is a little grittier than life in upper middle-class country homes


## Readability

Julia (I trust I am not being over-familiar) has extended her analysis in a blog post on readability. If you want to read more about the technique (and you should) head off there but suffice to say it starts with the premise that useful categories include

  • Number of sentences
  • Number of words with three or more syllables

Let’s have a look at sentences first.

 # easiest just to download again
ge <- gutenberg_download(c(1400),mirror = "http://mirrors.xmission.com/gutenberg/",
                              meta_fields = "title")
  
  tidy_ge <- ge %>%
    mutate(text = iconv(text, to = 'latin1')) %>%
    nest(-title) %>% 
    mutate(tidied = map(data, unnest_tokens, 'sentence', 'text', token = 'sentences'))
  
  tidy_ge
## # A tibble: 1 x 3
##                title                  data                tidied
##                <chr>                <list>                <list>
## 1 Great Expectations <tibble [20,024 x 2]> <tibble [10,637 x 2]>
  # we are only interested in the tidied column which should be in sentences. Lets check
  
  tidy_ge <-tidy_ge %>% 
    unnest(tidied)
 
  tidy_ge %>% 
    sample_n(5) %>% 
    select(sentence)
## # A tibble: 5 x 1
##                                                                      sentence
##                                                                         <chr>
## 1 the mist was heavier yet when i got out upon the marshes, so that instead o
## 2 but it was very pleasant to see the pride with which he hoisted it up and m
## 3 his blue bag was slung over his shoulder, honest industry beamed in his eye
## 4 miss havisham was taking exercise in the room with the long spread table, l
## 5                                             politely omitting young fellow.
# Mine look good
 
 # What is distribution like
   sentences_ge <- tidy_ge %>%
    unnest_tokens(word, sentence, drop = FALSE) %>%  
    unique()   %>% 
    group_by(sentence) %>% 
    summarize(length=n())
   
   summary(sentences_ge)
##    sentence             length      
##  Length:10033       Min.   :  1.00  
##  Class :character   1st Qu.:  6.00  
##  Mode  :character   Median : 13.00  
##                     Mean   : 15.97  
##                     3rd Qu.: 23.00  
##                     Max.   :105.00
  sentences_ge %>% 
    plot_ly(x=~length) 
  # and the longest
  
  sentences_ge %>% 
  arrange(desc(length)) %>% 
  head(1) %>% 
  .$sentence
## [1] "again among the tiers of shipping, in and out, avoiding rusty chain-cables frayed hempen hawsers and bobbing buoys, sinking for the moment floating broken baskets, scattering floating chips of wood and shaving, cleaving floating scum of coal, in and out, under the figure-head of the john of sunderland making a speech to the winds (as is done by many johns), and the betsy of yarmouth with a firm formality of bosom and her knobby eyes starting two inches out of her head; in and out, hammers going in ship-builders' yards, saws going at timber, clashing engines going at things unknown, pumps going in leaky ships, capstans going, ships going out to sea, and unintelligible sea-creatures roaring curses over the bulwarks at respondent lightermen, in and out,--out at last upon the clearer river, where the ships' boys might take their fenders in, no longer fishing in troubled waters with them over the side, and where the festooned sails might fly out to the wind."

The longest sentence is a reference to the River Thames when they are trying to effect Magwitch’s escape


Now let’s look at syllables. The function is a long one so I used ‘echo = FALSE’ in the code chunk

This is the code which takes an age to run but is available in case you want to render it yourself

# check that function is working correctly

txt <-"at a time and place of our own choosing. Some of it may be explicit and publicized; some of it may not be"

  count_syllables(txt)
## [1] 28
# tidy_ge <- tidy_ge %>%
#     unnest_tokens(word, sentence, drop = FALSE) %>%
#     rowwise() %>%
#     mutate(n_syllables = count_syllables(word)) %>%
#     ungroup()
 
#   write_feather(tidy_ge,"data/tidy_ge.feather")

# loading precompiled file
tidy_ge <- read_feather("data/tidy_ge.feather")
  
# plot the distribution
  tidy_ge %>% 
    plot_ly(x=~n_syllables)
  mean(tidy_ge$n_syllables, na.rm=TRUE)
## [1] 1.329031
  tidy_ge %>% 
    filter(n_syllables>6) %>% 
    select(word) %>% 
    count(word, sort=TRUE)
## # A tibble: 7 x 2
##                  word     n
##                 <chr> <int>
## 1      apologetically     4
## 2     inaccessibility     2
## 3     unceremoniously     2
## 4 architectooralooral     1
## 5     incompatibility     1
## 6   irreconcilability     1
## 7   unsympathetically     1

The word with the most syllables(8) is irreconcilability whilst ‘architectooralooral’ is predictably from the mouth of Joe when he is up in London-town trying his best to meet Pip’s high-and-mighty standards


We can now gauge a readability level for the book

left_join(tidy_ge %>%
                         group_by(title) %>%
                         summarise(n_sentences = n_distinct(sentence)),
                       tidy_ge %>% 
                         group_by(title) %>% 
                         filter(n_syllables >= 3) %>% 
                         summarise(n_polysyllables = n())) %>%
    mutate(SMOG = 1.0430 * sqrt(30 * n_polysyllables/n_sentences) + 3.1291)
## Joining, by = "title"
## # A tibble: 1 x 4
##                title n_sentences n_polysyllables     SMOG
##                <chr>       <int>           <int>    <dbl>
## 1 Great Expectations       10033           12997 9.631162

The SMOG (“Simple Measure of Gobbledygook”) value of 9.6 indicates that, for an average reader, around the middle of Grade 9 would be an appropriate starting age. I would imagine that the action could be followed by someone younger but some of the themes such as pride, love (both unrequited and unknown) and ambition make it an interesting read into maturity

Share