epldata Package

I have been collating data from the English Premier League since it began in 1992 and have a complete database of every players appearances in league games, details of goals scored and assists made.

This has formed the backend of both the premiersoccerstats shiny web site and the weekly blog posts on this site for the 38 rounds of thw 2017/18 season. In addition, I will shortly be publishing an interactive on-line annual based on these post, but with many additional charts and tables


I have now decided to release the raw data as an R package named epldata - at least until I’m informed I have infringed a trademark!

The package comprises of nine datasets and a couple of nominal functions. Currently, it is only on github but the intention is to release it on CRAN in due course. The data is a full 26 years worth and I plan to issue updates shortly after the end of each season

The README file on github goes into more detail and there is also a vignette that comes with th package which gives code examples of exploring and using the data including the creation of derived tables (similar to those that can be found in the code of the weekly blogs and used to speed up user ineractivity), tables, charts and animations. The vignette as of this date is attached below

Hopefully, viewing what I have produced will encourage those starting to use R or interested in producing visualizations or more extended output for the most popular world-wide sports league to utilize the package

Enjoy!

Basics

Let’s look at one of them

## Observations: 4,690
## Variables: 7
## $ player_id     <chr> "TISDALP", "TODAK", "TODDA", "TODDL", "TODOROS",...
## $ first_name    <chr> "Paul", "Kazuyuki", "Andy", "Lee", "Svetoslav", ...
## $ last_name     <chr> "Tisdale", "Toda", "Todd", "Todd", "Todorov", "T...
## $ birth_date    <dttm> 1973-01-14, 1977-12-30, 1974-09-21, 1972-03-07,...
## $ birth_city    <chr> "Valletta", "Tokyo", "Derby", "Hartlepool", "Dob...
## $ birth_country <chr> "Malta", "Japan", "England", "England", "Bulgari...
## $ position      <chr> "M", "M", "M", "D", "F", "M", "F", "F", "F", "F"...

The tables are in SQL type with a key variable for linking separate datasets. For this data, it is the unique player_id, which also appears in the player_team dataset. The data provides some basic information on each of the, approaching 5,000, players who have appeared in the League


One-Table Analysis

Let’s use the players data to obtain the percentage distribution of players born from 1990 onwards, by birth country

## # A tibble: 79 x 3
##    birth_country     n    pc
##    <chr>         <int> <dbl>
##  1 England         457 46.9 
##  2 France           53  5.44
##  3 Spain            51  5.24
##  4 Germany          33  3.39
##  5 Netherlands      33  3.39
##  6 Belgium          22  2.26
##  7 Wales            22  2.26
##  8 Brazil           18  1.85
##  9 Ireland          18  1.85
## 10 Scotland         16  1.64
## # ... with 69 more rows

Predictably, England dominates but Scotland only scrapes into the top 10


Multi-table Analyses

Much more commonly you will need to combine tables to produce interesting information

Which player has scored the most for each team?

## # A tibble: 49 x 3
##    team         name               goals
##    <chr>        <chr>              <int>
##  1 Arsenal      Thierry Henry        176
##  2 Aston Villa  Gabriel Agbonlahor    74
##  3 Barnsley     Neil Redfearn         10
##  4 Birmingham C Mikail Forssell       29
##  5 Blackburn    Alan Shearer         112
##  6 Blackpool    DJ Campbell           13
##  7 Bolton       Kevin Davies          67
##  8 Bournemouth  Joshua King           30
##  9 Bradford C   Dean Windass          12
## 10 Brighton     Glenn Murray          12
## # ... with 39 more rows

Derived tables

The above example included quite a few joins which you will probably not wish to do for every analysis For instance, you might want to have available a summary of each match played

## # A tibble: 20,252 x 4
## # Groups:   team, team_game_id [20,252]
##    team        team_game_id game_id    GF
##    <chr>              <int>   <int> <dbl>
##  1 Blackburn              1      55     0
##  2 Derby Co.              2      55     0
##  3 Coventry C             3      56     2
##  4 Chelsea                4      56     1
##  5 Everton               38      57     0
##  6 Aston Villa           39      57     0
##  7 Man. Utd.             40      58     2
##  8 Leicester C           41      58     2
##  9 Middlesbro            42      59     0
## 10 Leeds U               43      59     0
## # ... with 20,242 more rows

So we now have the goals scored by each team. The next step is to combine this table with itself to obtain the opposing team and the goals against

## # A tibble: 6 x 7
## # Groups:   team.x, team_game_id.x [3]
##   team.x     team_game_id.x game_id  GF.x team.y     team_game_id.y  GF.y
##   <chr>               <int>   <int> <dbl> <chr>               <int> <dbl>
## 1 Blackburn               1      55     0 Blackburn               1     0
## 2 Blackburn               1      55     0 Derby Co.               2     0
## 3 Derby Co.               2      55     0 Blackburn               1     0
## 4 Derby Co.               2      55     0 Derby Co.               2     0
## 5 Coventry C              3      56     2 Coventry C              3     2
## 6 Coventry C              3      56     2 Chelsea                 4     1

We have duplication and wish to remove all those where team.x= team.y. as well as tidy up column names and calculate the points accrued for each match. This takes a few seconds to run

## # A tibble: 20,252 x 7
## # Groups:   team, team_game_id [20,252]
##    team        team_game_id game_id    GF opponents      GA points
##    <chr>              <int>   <int> <dbl> <chr>       <dbl>  <dbl>
##  1 Blackburn              1      55     0 Derby Co.       0      1
##  2 Derby Co.              2      55     0 Blackburn       0      1
##  3 Coventry C             3      56     2 Chelsea         1      3
##  4 Chelsea                4      56     1 Coventry C      2      0
##  5 Everton               38      57     0 Aston Villa     0      1
##  6 Aston Villa           39      57     0 Everton         0      1
##  7 Man. Utd.             40      58     2 Leicester C     2      1
##  8 Leicester C           41      58     2 Man. Utd.       2      1
##  9 Middlesbro            42      59     0 Leeds U         0      1
## 10 Leeds U               43      59     0 Middlesbro      0      1
## # ... with 20,242 more rows

To put the results into context, we need to add the game date, arrange it sequentially and split the results into seasons. This also takes a few seconds to run

## # A tibble: 20,252 x 14
## # Groups:   season, team [526]
##    team        team_game_id game_id    GF opponents      GA points
##    <chr>              <int>   <int> <dbl> <chr>       <dbl>  <dbl>
##  1 Arsenal            20000    1313     2 Norwich C       4      0
##  2 Chelsea            20001    1314     1 Oldham          1      1
##  3 Coventry C         20002    1315     2 Middlesbro      1      3
##  4 Crystal P          20003    1316     3 Blackburn       3      1
##  5 Everton            20004    1317     1 Sheff. Wed.     1      1
##  6 Ipswich T          20005    1318     1 Aston Villa     1      1
##  7 Leeds U            20006    1319     2 Wimbledon       1      3
##  8 Sheff. Utd.        20007    1320     2 Man. Utd.       1      3
##  9 Southampton        20008    1321     0 Tottenham H     0      1
## 10 Norwich C          21000    1313     4 Arsenal         2      3
## # ... with 20,242 more rows, and 7 more variables: game_date <dttm>,
## #   crowd <int>, referee_name <chr>, year <dbl>, month <dbl>,
## #   season <chr>, year_game_order <int>

This might be a useful derived table to save as a basis for further analyses including

  1. Create standings
  2. Team Head-to-Head matchups
  3. Sequences of results and scoring

Standings

We can now create a standings data.frame for each round of matches based on points, Goal difference , and Goals For

## # A tibble: 20,252 x 8
##    season    team      round position    GF    GA    GD points
##    <chr>     <chr>     <int>    <int> <dbl> <dbl> <dbl>  <dbl>
##  1 2017/2018 Man. City    38        1   106    27    79    100
##  2 2017/2018 Man. City    37        1   105    27    78     97
##  3 2004/2005 Chelsea      38        1    72    15    57     95
##  4 2017/2018 Man. City    36        1   102    26    76     94
##  5 2004/2005 Chelsea      37        1    71    14    57     94
##  6 2017/2018 Man. City    35        1   102    26    76     93
##  7 2016/2017 Chelsea      38        1    85    33    52     93
##  8 1993/1994 Man. Utd.    42        1    80    38    42     92
##  9 2004/2005 Chelsea      36        1    68    13    55     91
## 10 1999/2000 Man. Utd.    38        1    97    45    52     91
## # ... with 20,242 more rows

It is then a simple matter to create a function to get a table for any round of any year. e.g after 20 games in 1994/1995

## # A tibble: 22 x 8
##    season    team         round position    GF    GA    GD points
##    <chr>     <chr>        <int>    <int> <dbl> <dbl> <dbl>  <dbl>
##  1 1994/1995 Blackburn       20        1    44    16    28     46
##  2 1994/1995 Man. Utd.       20        2    39    16    23     44
##  3 1994/1995 Newcastle U     20        3    39    22    17     39
##  4 1994/1995 Liverpool       20        4    36    19    17     36
##  5 1994/1995 Nottm Forest    20        5    33    20    13     36
##  6 1994/1995 Leeds U         20        6    29    25     4     33
##  7 1994/1995 Norwich C       20        7    19    17     2     30
##  8 1994/1995 Tottenham H     20        8    34    34     0     29
##  9 1994/1995 Chelsea         20        9    28    26     2     28
## 10 1994/1995 Man. City       20       10    31    34    -3     28
## # ... with 12 more rows

Obviously you can vary what is in these derived tables to suit your own requirement

For the premiersoccerstats web site, I create around thirty derived tables weekly for speedy user interaction


Output

We have previously covered basic usage of datasets within the package including combining them to produce answers to questions and creating derived tables.

We will next look at more interesting output in the form of

  • Tables
  • Charts
  • Maps

This package is particularly suited to the first two options though there is some geographic data to play around with

You will need the the data.frames created earlier so if it they are not in your environment either load a saved version or re-run the code


Tables

I tend to use the DT package, but there are other options

Head to Head

Let’s use the match_summary_full dataframe to calculate each team’s head to head record. Over and above the current data, we need to create and sum the results

This provides a sortable, searchable table


Charts

Let’s turn attention to players. Firstly I will create a data.frame for the goals and assists for a specified player

For ease of use below, I have created it as a function and provided an example player_id

## # A tibble: 6 x 3
##   player_game_order category  count
##               <int> <chr>     <dbl>
## 1                 1 tot_goals     0
## 2                 2 tot_goals     0
## 3                 3 tot_goals     1
## 4                 4 tot_goals     0
## 5                 5 tot_goals     1
## 6                 6 tot_goals     0

You can see why you might want to create a derived player table first if you want to do varied detailed analyses particularly where the raw data is only updated annually .saves time and enhances user interactivity experience

Now just choose your plotting package of choice to display the data. I will use plotly as this allows for ease of info-activity including feature such as panning/zooming, hover, tooltips etc.

Lots of customization is available within the package.


Interactivity

Lets use the data to create some interactive output

Shiny

Lets say we use the match_summary_full data to plot a histogram of the goals scored by a team in the Premier League

Please note that I have ser eval = FALSE. This app will run but appshots of Shiny app objects are not yet supported. It has a very similar output to crosstalk alternative below

Crosstalk

This is an alternative method which does not require access to a server and allows for htmlwidgets to interact with each other

Filtering

Here is the equivalent input selection(minus a default) and chart as the shiny example above produces

Animation

If you want eye-candy… We can use the standings dataset prepared earlier Let’s look at how arch-rivals, Brighton and Crystal Palace, fared in 2017/18

Brighton, a promoted club, were expected to be struggle but Crystal Palace spent more of the season in the drop zone. In the end, they both survived relegation by placing higher than 18th

Share Comments
comments powered by Disqus