Guardian Film Reviews

From being an avid daytime moviehouse attendee. I have retreated to the occasional Netflix viewing at home. Although the ‘Because you watched’ algorithm works fairly well The Matching on Netflix does not work for me so I’d kinda like to get a rating from a reputable source maybe along with a review

As I’m a Guardian reader, a recent post by Maelle Salmon on extracting data from their Experience columns gave me a nudge to do something and I have used her code as a template for my works. So my thanks to her


Here is an extract from a typical reviews summary page

This seems to give me some of what I want but extracting the rating from the page proved problematic and it would also be nice to have the reviewer’s name and date.
So this turns into a two phase process. Garner the links from the summary pages and then call these pages up for most of the data


In addition to the usual suspects, I’ll use the rvest package to scrape and the robotstxt to ensure that scraping is acceptable



library(tidyverse)
library(plotly)
library(rvest)
library(stringr)

library(robotstxt)

Firstly, the review links. If you read Maelle’s Salmon’s post you will see that to adhere to scraping etiqutte you can check if the path you want is allowed


robotstxt::paths_allowed("https://www.theguardian.com/film+tone/reviews")
## [1] TRUE

Phew!

You should also set a crawl delay. The Guardian specifies one second. As this example only scrapes a couple of pages of reviews I will dispense with that - though of course in larger scale one should. That’s what nights are for, after all

There are over 800 summary pages going back more than a decade. I’ll just process the first two for this demonstration. The tricky part is often determining the correct nodes for your needs. Either use the selectorGadget, inspect via developer tools or ‘View Source’


 # function to obtain urls of individual reviews
 xtract_links <- function(node) {
    css <- '.fc-item__title a'
    link <- html_nodes(node, css) %>% html_attr('href')
  }
  
  # function to obtain page content from summary pages
  get_links_from_page <- function(page_number){
    #Sys.sleep(1)
    link <- paste0("https://www.theguardian.com/film+tone/reviews?page=", page_number)
    page_content <- read_html(link)
    xtract_links(page_content)
  }
  
 # use purr to obtain a vector of review urls
  review_links <- map(1:2, get_links_from_page) %>% unlist()
  
  review_links[1:3]
## [1] "https://www.theguardian.com/film/2017/nov/10/fireworks-review-anime"                                                
## [2] "https://www.theguardian.com/film/2017/nov/10/a-caribbean-dream-review-shakespeare-midsummer-nights-dream-adaptation"
## [3] "https://www.theguardian.com/film/video/2017/nov/09/watch-the-trailer-for-the-florida-project-video"

So what is returned is a character vector, length 40, of the most recent reviews. These are generally of an individual film but might be a more wide ranging article

OK, lets get the details for these 40. The broad approach of the previous chunk code is replicated. It takes a few seconds even without inserting the courtesy 1 second delay


 xtract_info <- function(review_content){
    
    title <-review_content %>%
      html_node(xpath = '//meta[@property="og:title"]') %>%
      html_attr('content')
    
    description <-review_content %>%
      html_node(xpath = '//meta[@property="og:description"]') %>%
      html_attr('content')
    
    author <-review_content %>%
      html_node(xpath = '//meta[@name="author"]') %>%
      html_attr('content')
    
    date <-review_content %>%
      html_node(xpath = '//meta[@property="article:published_time"]') %>%
      html_attr('content')
    
    categories <-review_content %>%
      html_node(xpath = '//meta[@property="article:tag"]') %>%
      html_attr('content')
    
    rating <- review_content %>% 
      html_node(xpath = '//span[@itemprop="ratingValue"]') %>%
      html_text()
    
    data.frame(title=title,description=description,author=author,date=date,categories=categories,rating=rating)
    
  }
  
  get_info_from_review <- function(url){
     #Sys.sleep(1)
   
   
    
    review_content <- read_html(url)
    xtract_info(review_content)
  }
  
  review_info <- purrr::map_df(review_links, get_info_from_review)
  
  ## tabulate data for paging/searching/ordering 
  review_info %>%
     DT::datatable(class='compact stripe hover row-border order-column',rownames=FALSE,options= list(paging = TRUE, searching = TRUE,info=FALSE))

OK. That seems to work fine but there is some tidying up to do. Let’s just glimpse the structure

str(review_info)
## 'data.frame':    40 obs. of  6 variables:
##  $ title      : chr  "Fireworks review – anime romance sparkles with strangeness" "A Caribbean Dream review – Shakespeare goes to carnival" "Watch the trailer for The Florida Project – video" "The Florida Project review – a wondrous child's-eye view of life on the margins" ...
##  $ description: chr  "This disorientating teen tale – think Japan’s answer to Sliding Doors – follows the divided destinies of three "| __truncated__ "Shakirah Bourne’s tender-hearted adaptation of A Midsummer Night’s Dream is a refreshingly low-key palate clean"| __truncated__ "The Florida Project is the latest film from director Sean Baker, written by Baker and Chris Bergoch, starring W"| __truncated__ "A young cast give brilliantly naturalistic performances in this glorious story  about a bunch of deprived kids "| __truncated__ ...
##  $ author     : chr  "Peter Bradshaw" "Cath Clarke" NA "Peter Bradshaw" ...
##  $ date       : chr  "2017-11-10T09:00:14.000Z" "2017-11-10T06:00:11.000Z" "2017-11-09T18:47:57.000Z" "2017-11-09T15:40:43.000Z" ...
##  $ categories : chr  "Animation,World cinema,Film,Culture,Anime,Japan,Teenage" "Film adaptations,Film,Culture,William Shakespeare,Caribbean,World cinema" "Film" "The Florida Project,Culture,Film,Drama,Willem Dafoe" ...
##  $ rating     : chr  "3" "3" NA "5" ...

There are a few things to address

  • title is really a title and tagline
  • date needs to be changed to a date field and time is irrelevant
  • categories: if this was list-column we could extract genre more easily
  • rating might be better as integer for future processing
  • add the url for linking to Guardian web page


df_mini <-review_info %>% 
  separate(col=title,into=c("title","tagline"),sep="review", extra = "merge") %>% 
  mutate(tagline=str_sub(tagline,3)) %>% # removes unnecessary hyphen
  mutate(date=as.Date(date)) %>% 
  mutate(rating=as.integer(rating)) %>% 
  mutate(categories=str_split(categories,",")) 

df_mini <- cbind(df_mini,link=review_links) 

glimpse(df_mini)
## Observations: 40
## Variables: 8
## $ title       <chr> "Fireworks ", "A Caribbean Dream ", "Watch the tra...
## $ tagline     <chr> " anime romance sparkles with strangeness", " Shak...
## $ description <chr> "This disorientating teen tale – think Japan’s ans...
## $ author      <chr> "Peter Bradshaw", "Cath Clarke", NA, "Peter Bradsh...
## $ date        <date> 2017-11-10, 2017-11-10, 2017-11-09, 2017-11-09, 2...
## $ categories  <list> [<"Animation", "World cinema", "Film", "Culture",...
## $ rating      <int> 3, 3, NA, 5, 3, 3, 4, 3, 4, 4, 4, 5, 3, 2, 3, 2, 3...
## $ link        <fctr> https://www.theguardian.com/film/2017/nov/10/fire...

# the data cannot easily be saved in csv or feather format when there are list-columns
saveRDS(df_mini,"data/movieReviewsMini.rds")

Meaningful Data

The above code indicates the process but obviously is pretty limited in terms of helping choose a movie on Netflix

I have, however, collated, 8000 reviews in total, covering the latest 6+ years.