Lost in the realms of time when reshape2 and ggvis were flavour of the month (i.e 20 months ago), I apparantly created a shiny app built around Hadley Wickham’s babynames data package With the recent release of a UK equivalent from Thomas Leeper and an intriguing plot on tennis world ranked number ones, I have decided to play around with the data both old and new
First let’s load the libraries and check out the data
library(babynames)
library(ukbabynames)
library(tidyverse)
library(plotly)
library(htmltools)
uk <- ukbabynames
us <- babynames
glimpse(uk)
## Observations: 227,449
## Variables: 5
## $ year <dbl> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 199...
## $ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F...
## $ name <chr> "Sophie", "Chloe", "Jessica", "Emily", "Lauren", "Hannah"...
## $ n <dbl> 7087, 6824, 6711, 6415, 6299, 5916, 5866, 5828, 5206, 494...
## $ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
glimpse(us)
## Observations: 1,858,689
## Variables: 5
## $ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 188...
## $ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F...
## $ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret"...
## $ n <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 128...
## $ prop <dbl> 0.072384329, 0.026679234, 0.020521700, 0.019865989, 0.017...
Similar data - although the US figures go much further back in time - recording the number of registered births per annum. There is a minimum limit for names, for privacy purposes, of 3 in the UK and 5 in the US
Concentration of names
Lets look at the latest year, 2015, and how diverse the names are by tracking the cumulativeshare of each name
# combine the
uk2015 <- uk %>%
filter(year==2015) %>%
select(-rank) %>%
mutate(country="uk")
us2015 <- us %>%
filter(year==2015) %>%
select(-prop) %>%
mutate(country="us")
df <- bind_rows(uk2015,us2015)
cumData <-df %>%
arrange(desc(n)) %>%
group_by(country,sex) %>%
mutate(prop=n/sum(n),cumprop=round(100*cumsum(prop),2),rank=row_number())
cumData %>%
#filter(rank<=100) %>%
group_by(country,sex) %>%
plot_ly(x=~rank,y=~cumprop,color=~country,
hoverinfo="text",
text=~paste0("Names: ",rank,
"<br>",cumprop,"%")) %>%
add_lines(linetype=~sex) %>%
layout(title="Cumulative distribution of names in UK and US, by gender - 2015<br>(Zoom in for finer detail)",
yaxis=list(title="Cumulative %"),
xaxis=list(title="Number of Names")) %>% config(displayModeBar = F,showLink = F)