Guardian Film Reviews

2017-10-07 · 221396 · 1040

entertainment

rvest · purrr

From being an avid daytime moviehouse attendee. I have retreated to the occasional Netflix viewing at home. Although the ‘Because you watched’ algorithm works fairly well The Matching on Netflix does not work for me so I’d kinda like to get a rating from a reputable source maybe along with a review

As I’m a Guardian reader, a recent post by Maelle Salmon on extracting data from their Experience columns gave me a nudge to do something and I have used her code as a template for my works. So my thanks to her

Here is an extract from a typical reviews summary page

This seems to give me some of what I want but extracting the rating from the page proved problematic and it would also be nice to have the reviewer’s name and date.
So this turns into a two phase process. Garner the links from the summary pages and then call these pages up for most of the data

In addition to the usual suspects, I’ll use the rvest package to scrape and the robotstxt to ensure that scraping is acceptable



library(tidyverse)
library(plotly)
library(rvest)
library(stringr)

library(robotstxt)

Firstly, the review links. If you read Maelle’s Salmon’s post you will see that to adhere to scraping etiqutte you can check if the path you want is allowed


robotstxt::paths_allowed("https://www.theguardian.com/film+tone/reviews")
## [1] TRUE

Phew!

You should also set a crawl delay. The Guardian specifies one second. As this example only scrapes a couple of pages of reviews I will dispense with that - though of course in larger scale one should. That’s what nights are for, after all

There are over 800 summary pages going back more than a decade. I’ll just process the first two for this demonstration. The tricky part is often determining the correct nodes for your needs. Either use the selectorGadget, inspect via developer tools or ‘View Source’


 # function to obtain urls of individual reviews
 xtract_links <- function(node) {
    css <- '.fc-item__title a'
    link <- html_nodes(node, css) %>% html_attr('href')
  }
  
  # function to obtain page content from summary pages
  get_links_from_page <- function(page_number){
    #Sys.sleep(1)
    link <- paste0("https://www.theguardian.com/film+tone/reviews?page=", page_number)
    page_content <- read_html(link)
    xtract_links(page_content)
  }
  
 # use purr to obtain a vector of review urls
  review_links <- map(1:2, get_links_from_page) %>% unlist()
  
  review_links[1:3]
## [1] "https://www.theguardian.com/film/2017/nov/10/fireworks-review-anime"                                                
## [2] "https://www.theguardian.com/film/2017/nov/10/a-caribbean-dream-review-shakespeare-midsummer-nights-dream-adaptation"
## [3] "https://www.theguardian.com/film/video/2017/nov/09/watch-the-trailer-for-the-florida-project-video"

So what is returned is a character vector, length 40, of the most recent reviews. These are generally of an individual film but might be a more wide ranging article

OK, lets get the details for these 40. The broad approach of the previous chunk code is replicated. It takes a few seconds even without inserting the courtesy 1 second delay


 xtract_info <- function(review_content){
    
    title <-review_content %>%
      html_node(xpath = '//meta[@property="og:title"]') %>%
      html_attr('content')
    
    description <-review_content %>%
      html_node(xpath = '//meta[@property="og:description"]') %>%
      html_attr('content')
    
    author <-review_content %>%
      html_node(xpath = '//meta[@name="author"]') %>%
      html_attr('content')
    
    date <-review_content %>%
      html_node(xpath = '//meta[@property="article:published_time"]') %>%
      html_attr('content')
    
    categories <-review_content %>%
      html_node(xpath = '//meta[@property="article:tag"]') %>%
      html_attr('content')
    
    rating <- review_content %>% 
      html_node(xpath = '//span[@itemprop="ratingValue"]') %>%
      html_text()
    
    data.frame(title=title,description=description,author=author,date=date,categories=categories,rating=rating)
    
  }
  
  get_info_from_review <- function(url){
     #Sys.sleep(1)
   
   
    
    review_content <- read_html(url)
    xtract_info(review_content)
  }
  
  review_info <- purrr::map_df(review_links, get_info_from_review)
  
  ## tabulate data for paging/searching/ordering 
  review_info %>%
     DT::datatable(class='compact stripe hover row-border order-column',rownames=FALSE,options= list(paging = TRUE, searching = TRUE,info=FALSE))

OK. That seems to work fine but there is some tidying up to do. Let’s just glimpse the structure

str(review_info)
## 'data.frame':    40 obs. of  6 variables:
##  $ title      : chr  "Fireworks review  anime romance sparkles with strangeness" "A Caribbean Dream review  Shakespeare goes to carnival" "Watch the trailer for The Florida Project  video" "The Florida Project review  a wondrous child's-eye view of life on the margins" ...
##  $ description: chr  "This disorientating teen tale  think Japans answer to Sliding Doors  follows the divided destinies of three "| __truncated__ "Shakirah Bournes tender-hearted adaptation of A Midsummer Nights Dream is a refreshingly low-key palate clean"| __truncated__ "The Florida Project is the latest film from director Sean Baker, written by Baker and Chris Bergoch, starring W"| __truncated__ "A young cast give brilliantly naturalistic performances in this glorious story  about a bunch of deprived kids "| __truncated__ ...
##  $ author     : chr  "Peter Bradshaw" "Cath Clarke" NA "Peter Bradshaw" ...
##  $ date       : chr  "2017-11-10T09:00:14.000Z" "2017-11-10T06:00:11.000Z" "2017-11-09T18:47:57.000Z" "2017-11-09T15:40:43.000Z" ...
##  $ categories : chr  "Animation,World cinema,Film,Culture,Anime,Japan,Teenage" "Film adaptations,Film,Culture,William Shakespeare,Caribbean,World cinema" "Film" "The Florida Project,Culture,Film,Drama,Willem Dafoe" ...
##  $ rating     : chr  "3" "3" NA "5" ...

There are a few things to address

title is really a title and tagline
date needs to be changed to a date field and time is irrelevant
categories: if this was list-column we could extract genre more easily
rating might be better as integer for future processing
add the url for linking to Guardian web page



df_mini <-review_info %>% 
  separate(col=title,into=c("title","tagline"),sep="review", extra = "merge") %>% 
  mutate(tagline=str_sub(tagline,3)) %>% # removes unnecessary hyphen
  mutate(date=as.Date(date)) %>% 
  mutate(rating=as.integer(rating)) %>% 
  mutate(categories=str_split(categories,",")) 

df_mini <- cbind(df_mini,link=review_links) 

glimpse(df_mini)
## Observations: 40
## Variables: 8
## $ title       <chr> "Fireworks ", "A Caribbean Dream ", "Watch the tra...
## $ tagline     <chr> " anime romance sparkles with strangeness", " Shak...
## $ description <chr> "This disorientating teen tale  think Japans ans...
## $ author      <chr> "Peter Bradshaw", "Cath Clarke", NA, "Peter Bradsh...
## $ date        <date> 2017-11-10, 2017-11-10, 2017-11-09, 2017-11-09, 2...
## $ categories  <list> [<"Animation", "World cinema", "Film", "Culture",...
## $ rating      <int> 3, 3, NA, 5, 3, 3, 4, 3, 4, 4, 4, 5, 3, 2, 3, 2, 3...
## $ link        <fctr> https://www.theguardian.com/film/2017/nov/10/fire...

# the data cannot easily be saved in csv or feather format when there are list-columns
saveRDS(df_mini,"data/movieReviewsMini.rds")

Meaningful Data

The above code indicates the process but obviously is pretty limited in terms of helping choose a movie on Netflix

I have, however, collated, 8000 reviews in total, covering the latest 6+ years.

Review Links

With this number, the data is fairly large to show in a datatable so I have cut down to the bare minimum of name, description and rating It may still take a long while. If you want to play around with the data an alternative to datatables might be considered. Alternatively, the dataset could be reduced in size by filtering by date, rating, genres of interest etc.

The ‘name’ field is a conflation of the title and link and clicking on it will open a new tab in your browser with the review page
The data is ordered by descending date and is searchable.


df <- readRDS("data/movieReviews.rds")

df %>% 
mutate(name = paste0("<a href=", link, ">", title, "</a>")) %>% 
  select(name,description,rating) %>% 
                         DT::datatable(class='compact stripe hover row-border order-column',rownames=FALSE,escape = FALSE,options= list(paging = TRUE, searching = TRUE,info=FALSE))


# df %>%
#   filter(rating==5) %>% 
#                          DT::datatable(class='compact stripe hover row-border order-column',rownames=FALSE,options= list(paging = TRUE, searching = TRUE,info=FALSE))
# 
# # It seems your data is too big for client-side DataTables. You may consider server-side processing: http://rstudio.github.io/DT/server.htmlIt seems your data is too big for client-side DataTables. You may consider server-side processing: http://rstudio.github.io/DT/server.html
# 
# 
# fiveStars <- df %>%
#   mutate(name = paste0("<a href=", link, ">", title, "</a>")) %>% 
#   filter(rating==5)  %>% #396 #1702944 bytes from object.size
#   #select(-c(categories,description)) #1442456 bytes
#     select(name,description,author,date,rating)
# 
# 
#   
# fiveStars %>% 
#   #head() %>% 
#                          DT::datatable(class='compact stripe hover row-border order-column',rownames=FALSE,escape = FALSE,options= list(paging = TRUE, searching = TRUE,info=FALSE))
# 
# 
# df_new %>% 
#   mutate(name = paste0("<a href=", link, ">", title, "</a>")) %>% 
#   select(name,description,rating) %>% 
#                          DT::datatable(class='compact stripe hover row-border order-column',rownames=FALSE,escape = FALSE,options= list(paging = TRUE, searching = TRUE,info=FALSE))
# 
# # dput(df,"data/movieReviews")
# # df_new <- dget("data/movieReviews")
# # identical(df,df_new) #FALSE
# # 
# # library(dataCompareR)
# # rCompare(df,df_new)
# # # Running rCompare...
# # # Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
# # #   arguments imply differing number of rows: 13, 6, 7, 5, 8, 11, 3, 4, 9, 2, 10, 12, 17, 20, 15, 14, 16, 18, 22, 25, 19, 21, 1
# # 
# # 
# # ## Look for Romance
# # df_new %>%
# #   filter(map_lgl(categories, ~ "Romance" %in% .x))
# # 
# # genres <- as.data.frame(table(df))
# # 
# # #df$categories %>% unlist() %>% sort() confirms there are 6
# # 
# # genres <-df$categories %>% 
# #   unlist() %>% 
# #   table() %>% 
# #   as.data.frame() 
# # 
# # names(genres) <- c("category","count")
# # ## includes more than just genres
# # 
# # 
# # write.csv(df,"data/movieReviewsSmall.csv")
# # save(df, file = "data/movieReviewsSmall.RData")
# # 
# # df_new <- load(file = "data/movieReviewsSmall.RData")
# # 
# # identical(df,df_new)
# # 
# # 
# # dput(mean, "foo") #type file
# # ## And read it back into 'bar'
# # bar <- dget("foo")
# # 
# # dput(df,"data/movieReviewsSmall")
# # 
# # dput(review_info_lots,"data/movieReviewsRaw")
# # df_new <- dget("data/movieReviewsSmall")
# # # library(feather)
# # # write_feather(df,"data/movieReviewsSmall.feather")# same error

This seems to do the trick. So for instance, if you wanted a Meryl Streep film entering her name in the search field is likely to come up with a good selection based on her name appearing in the description

Further Analysis

As we have the data available, I might as well dlve a little further

Let’s look at the reviewers


# df %>% 
#   count(author, sort=TRUE)

author_most <-df %>% 
  group_by(author) %>% 
  summarise(count=n(),`Av. rating`=round(mean(rating,na.rm=TRUE),1)) %>% 
  arrange(desc(count)) 

author_most  %>%
                         DT::datatable(width=250,class='compact stripe hover row-border order-column',rownames=FALSE,options= list(paging = TRUE, searching = TRUE,info=FALSE))

So 253 have contributed over the full period but Peter Bradshaw, as the Guardian film critic throughout this period, has filed more than double anybody else. He tends to rank moviews more favourably than some of his colleagues. For some reason, Philip French does not provide a rating

Here is a more detailed look at the top 15


# df %>% 
#   group_by(author) %>% 
#   plot_ly(x=~rating, color=~author) %>% 
#   add_boxplot()

## poss exploding boxplot but then might have prob with names
## order by 

major_critics <- df %>% 
  count(author, sort=TRUE) %>% 
  head(15) %>% 
  pull(author)

# major_critics <- df %>% 
#   count(author, sort=TRUE) %>% 
#   head(15) 

# all v samey
# df %>% 
#   filter(author %in% major_critics) %>% 
#   plot_ly(x=~rating, color=~author) %>% 
#   add_boxplot() %>% 
#   layout(margin=list(l=150)) %>%  config(displayModeBar = F,showLink = F)

#library(forcats) #y=~fct_reorder(TEAMNAME, n)

df %>% 
  filter(author %in% major_critics) %>% 
  #left_join(major_critics) %>% 
  #plot_ly(x=~date,y=~fct_reorder(author, n),color=~as.factor(rating), # some miss out
  plot_ly(x=~date,y=~author,color=~as.factor(rating), 
          hoverinfo="text",
          text=~paste0(title)) %>% 
  add_markers(size=I(4L))  %>% 
  layout(margin=list(l=100),
         title="Leading Movie Reviewers - Guardian Group",
         xaxis=list(title=""),
         yaxis=list(title="")
         ) %>%  
  
  config(displayModeBar = F,showLink = F)

You can click on the legend to toggle between star ratings and zoom in for a period of time/author. Hover points for movie titles

Wendy Ide is film critic of the Observer - the Guardian’s Sunday sister paper - but male contributors dominate

Let’s have a quick look at genre



genres <-df$categories %>% 
  unlist() %>% 
  table() %>% 
  as.data.frame() 

names(genres) <- c("Genre", "Count")

genres %>%
  arrange(desc(Count)) %>% 
   DT::datatable(class='compact stripe hover row-border order-column',rownames=FALSE,options= list(paging = TRUE, searching = TRUE,info=FALSE))

Genres covers a wide range but further analysis of change over time, ratings and incorporation into the initial search table could be done

Other work in this area that could be done

link to netflix etc. availability by country
link to imdb/wikipedia/rotten tomatoes/metacritic coverage
Textual analysis of reviews
Create shiny app to fine tune selections and keep reviews contemporary