Baby Names in the UK and USA

Lost in the realms of time when reshape2 and ggvis were flavour of the month (i.e 20 months ago), I apparantly created a shiny app built around Hadley Wickham’s babynames data package With the recent release of a UK equivalent from Thomas Leeper and an intriguing plot on tennis world ranked number ones, I have decided to play around with the data both old and new


First let’s load the libraries and check out the data


library(babynames)
library(ukbabynames)
library(tidyverse)
library(plotly)
library(htmltools)

uk <- ukbabynames
us <- babynames


glimpse(uk)
## Observations: 227,449
## Variables: 5
## $ year <dbl> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 199...
## $ sex  <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F...
## $ name <chr> "Sophie", "Chloe", "Jessica", "Emily", "Lauren", "Hannah"...
## $ n    <dbl> 7087, 6824, 6711, 6415, 6299, 5916, 5866, 5828, 5206, 494...
## $ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...

glimpse(us)
## Observations: 1,858,689
## Variables: 5
## $ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 188...
## $ sex  <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F...
## $ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret"...
## $ n    <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 128...
## $ prop <dbl> 0.072384329, 0.026679234, 0.020521700, 0.019865989, 0.017...

Similar data - although the US figures go much further back in time - recording the number of registered births per annum. There is a minimum limit for names, for privacy purposes, of 3 in the UK and 5 in the US

Concentration of names

Lets look at the latest year, 2015, and how diverse the names are by tracking the cumulativeshare of each name


# combine the 
uk2015 <- uk %>% 
  filter(year==2015) %>% 
  select(-rank) %>% 
  mutate(country="uk")

us2015 <- us %>% 
  filter(year==2015) %>% 
  select(-prop) %>% 
  mutate(country="us")

df <- bind_rows(uk2015,us2015)

cumData <-df %>% 
  arrange(desc(n)) %>% 
  group_by(country,sex) %>% 
  mutate(prop=n/sum(n),cumprop=round(100*cumsum(prop),2),rank=row_number())


cumData %>% 
  #filter(rank<=100) %>% 
  group_by(country,sex) %>% 
  plot_ly(x=~rank,y=~cumprop,color=~country,
          hoverinfo="text",
          text=~paste0("Names: ",rank,
                       "<br>",cumprop,"%")) %>% 
  add_lines(linetype=~sex) %>% 
  layout(title="Cumulative distribution of names in UK and US, by gender - 2015<br>(Zoom in for finer detail)",
         yaxis=list(title="Cumulative %"),
         xaxis=list(title="Number of Names")) %>%  config(displayModeBar = F,showLink = F)