Black-Capped Chickadee image credit: Public Domain Pictures

Hamilton Christmas Bird Count: Part 1

Importing and cleaning the Hamilton Christmas Bird Count data

Black-Capped Chickadee image credit: Public Domain Pictures

Hamilton Christmas Bird Count: Part 1

Importing and cleaning the Hamilton Christmas Bird Count data

Note

This is the first part of four for this dataset.

  • Part 2a contains data downloading and cleaning

  • Part 2b contains visualizations

  • Part 3 contains my Shiny app!

  • Part 4 contains a gganimated plot

Introduction

About two years ago, I was taking my dog for a walk through a park and I began to notice the birds and how fascinating they were! 🐦 I began regularly going out birding (aka “bird-watching”) and reading up on these cool little flying dinosaurs.

It turns out there’s a lot of data in the birding world as well. Birding attracts the sort of detail-oriented person who likes to count and record stuff.

So there are opportunities to get involved in citizen science projects, including a long-running project called the Christmas Bird Count (CBC). It started in 1900, when Frank Chapman, an ornithologist, came up with the idea of counting birds as an alternative to hunting them at Christmas (hunting them being the previous tradition).1

Birders have been going out every year around Christmas, to spend the day walking, biking, or driving through a census area to count all the birds they see or hear.

For the past two years, I have gone out with Hamilton’s Christmas Bird Count. I learn a lot while I’m out there and it feels like we are contributing to a larger purpose because of the data we are collecting.

So I thought I would look at the data and see what it could tell me!

Specifically, I’ve noticed birders will say things like, “the House Sparrows are getting worse every year” or, “the number of Bald Eagles has increased”, and I was wondering if the Christmas Bird Count data would agree or disagree with those statements.

To access the data, I went on the Bird Studies Canada website, clicked on Citizen Science, then Christmas Bird Count, then CBC Audubon Database, and then Historical Results by Count. I downloaded all years of data that existed for the Hamilton count.

Data import

I started by loading all of the packages I will be using:

library(dplyr)
library(janitor)
library(readr)
library(naniar)
library(lubridate)
library(stringr)
library(purrr)
library(tidyr)
library(knitr)
library(kableExtra)
library(here)

I read in the data using the readr and here packages. Click here to download the raw file on my Github page.

hamilton_cbc <- read_csv(here("content", "post", "hamilton_cbc_part_1", "hamilton-cbc-all-years-csv.csv"))

Data cleaning

As shown below, it turns out that the first row just gives information about the count name and latitude/longitude, so I extracted those two pieces of information as current_circle_name and lat_long and then sliced the file so that the first two lines were excluded from the dataset. I then used clean_names from the janitor package.

hamilton_cbc %>%
  head() %>%
  kable() %>%
  kable_styling(full_width = FALSE, position = "left")
CircleName Abbrev LatLong X4 X5 X6 X7 X8 X9
Hamilton ONHA 43.2678790000/-79.8852320000 NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
CountYear3 LowTemp HighTemp AMCloud PMClouds AMRain PMRain AMSnow PMSnow
118 -18.0 Celsius -13.0 Celsius Clear Clear None None None None
117 -2.0 Celsius 11.0 Celsius Cloudy Cloudy Light Heavy Light None None
116 -2.0 Celsius 5.0 Celsius Partly Cloudy Partly Cloudy None Light None None
current_circle_name <- hamilton_cbc[1, 1]
lat_long <- hamilton_cbc[1, 3]

hamilton_cbc <- hamilton_cbc %>%
  slice(3 : n())

hamilton_cbc <- hamilton_cbc %>%
  clean_names()

Since I played around with the data before writing this, I know that there are actually five tables in this dataset.

The first three tables contain count day weather data. A lot of the weather data is missing and possibly inconsistent. I will remove these three tables from hamilton_cbc.

Here is the end of the first table and the start of the second table. Notice the line of NAs between the two tables:

hamilton_cbc %>%
  slice(47:54) %>%
  kable() %>%
  kable_styling(full_width = FALSE, position = "left")
circle_name abbrev lat_long x4 x5 x6 x7 x8 x9
66 10 25 Clear Clear Unknown Unknown Unknown Unknown
65 18 31 Cloudy Cloudy Unknown Unknown Unknown Unknown
64 26 34 Cloudy Cloudy Unknown Unknown Unknown Unknown
63 21 36 Cloudy Cloudy Unknown Unknown Unknown Unknown
NA NA NA NA NA NA NA NA NA
CountYear5 LowTemp3 HighTemp2 AMCloud2 PMClouds2 NA NA NA NA
118 12/26/2017 85 198.75 100 NA NA NA NA
117 12/26/2016 95 216.65 97 NA NA NA NA

Here is the end of the second table and the start of the third table. Notice the line of NAs between the two tables:

hamilton_cbc %>% 
  slice(143:150) %>%
  kable() %>%
  kable_styling(full_width = FALSE, position = "left")
circle_name abbrev lat_long x4 x5 x6 x7 x8 x9
26 12/26/1925 10 NA NA NA NA NA NA
25 12/27/1924 8 NA NA NA NA NA NA
23 12/26/1922 9 NA NA NA NA NA NA
22 12/23/1921 2 8 NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
CountYear4 LowTemp2 NA NA NA NA NA NA NA
118 Hamilton Naturalists’ Club NA NA NA NA NA NA NA
117 Hamilton Naturalists’ Club NA NA NA NA NA NA NA

Here is the end of the third table and the start of the fourth table. Notice that there is a line of NAs between the two tables. The fourth table is where the bird count data actually starts!

hamilton_cbc %>%
  slice(239:246) %>%
  kable() %>%
  kable_styling(full_width = FALSE, position = "left")
circle_name abbrev lat_long x4 x5 x6 x7 x8 x9
26 NA NA NA NA NA NA NA NA
25 NA NA NA NA NA NA NA NA
23 NA NA NA NA NA NA NA NA
22 NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
COM_NAME CountYear how_manyCW NumberByPartyHours Flags NA NA NA NA
Snow Goose [Chen caerulescens] 1921 [22] Count Date: 12/23/1921 # Participants: # Species Reported: Total Hrs.: 8.00 NA NA NA NA NA NA NA
Snow Goose [Chen caerulescens] 1922 [23] Count Date: 12/26/1922 # Participants: # Species Reported: Total Hrs.: NA NA NA NA NA NA NA

I decided to programmatically separate the first three tables out from the dataset, in order to be able to easily replicate this analysis when more years’ data gets added, or if I want to repeat this analysis for another count area.

# Declare two lists, one for each table and one for the header of each table
meta_data_table <- list()
meta_data_table_headers <- list()


for (i in 1:3) {
  
  # For each of the three tables, put the first row of data into this variable (i.e., the header data)
  meta_data_table_headers[[i]] <- hamilton_cbc[1, ] %>%
    replace(is.na(.), "NA")  # If a column name is NA, replace that with the letters "NA", in case there is data in the column
  
  
  j <- 2  # Start j at 2, because we want to exclude the header row
  meta_data_row <- list()
  
  # Add each row of data to meta_data_row
  while (!is.na(hamilton_cbc[["circle_name"]][j])) {  # "circle_name" being the name of the first column and we know there is a row of NAs between each table
    meta_data_row[[j]] <- hamilton_cbc[j, ]
    j <- j + 1
  }
  
  meta_data_table[[i]] <- meta_data_row %>%
    bind_rows()

  
  meta_data_table[[i]] <- meta_data_table[[i]] %>%
    rename_all(funs(meta_data_table_headers[[i]])) %>%
    remove_empty(which = "cols") %>% 
    rename(count_year = 1) %>%  # rename knows we are referring to the first column!
    clean_names()

  
  # Remove the meta-data table from the rest of the dataset
  hamilton_cbc <- hamilton_cbc %>%
    slice((j + 1) : n())
}
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
## 
## # Before:
## funs(name = f(.)
## 
## # After: 
## list(name = ~f(.))
## This warning is displayed once per session.

Whew! So now we have a list meta_data_table that contains the three meta-data tables.

In the below code chunk, I converted the three tables in the meta_data_table list into one overall_meta_data table using the reduce function from the purrr package.

Then, using the replace_with_na_all function from the naniar package, anywhere the data was recorded as “Unknown”, I converted it to a proper NA.

I then mutated two of the variables: I used the mdy and year functions from the lubridate to convert the character variables low_temp3 (actually a date variable) and date into a Date and dbl variable, respectively.

I also used the word function from the stringr package. The word function extracts out only the first word from a character string.

# Reduce documentation: reduce(list(x1, x2, x3), f) is equivalent to f(f(x1, x2), x3)

overall_meta_data <- meta_data_table %>%
  reduce(full_join, by = "count_year")

overall_meta_data <- overall_meta_data %>%
  replace_with_na_all(condition = ~.x == "Unknown")

overall_meta_data <- overall_meta_data %>%
  mutate(date = mdy(low_temp3),
         low_temp = word(low_temp), # Extracts the first word from a string!
         high_temp = word(high_temp),
         year = year(date)) %>%
  select(year, date, count_year, low_temp:pm_snow)
# I removed some confusing variables high_temp2:pm_clouds2

We can finally clean up the bird count dataset a bit.

I used the rename_all function along with slice(-1) to remove the column names from the first row and make them the real column names.

hamilton_cbc <- hamilton_cbc %>%
  remove_empty(which = "cols")

hamilton_cbc <- hamilton_cbc %>%  # Do this line after removing the empty columns!
  rename_all(funs(hamilton_cbc[1, ])) %>%
  slice(-1) %>%
  clean_names()

hamilton_cbc <- hamilton_cbc %>%
  rename(species = com_name)

But we still need to deal with the fact that there is a fifth table below the bird count data table. Here is the end of the fourth table and the start of the fifth table. Notice that there is a line of NAs between the two tables:

hamilton_cbc %>%
  slice(23217:(hamilton_cbc %>%
          mutate(row_number = row_number()) %>%
          filter(species == "CountYear1") %>%
            pull(row_number))) %>%
  kable() %>%
  kable_styling(full_width = FALSE, position = "left")
species count_year how_many_cw number_by_party_hours flags
House Sparrow [Passer domesticus] 2016 [117] Count Date: 12/26/2016 # Participants: 1 # Species Reported: 97 Total Hrs.: 216.65 2565 11.8394 NA
House Sparrow [Passer domesticus] 2017 [118] Count Date: 12/26/2017 # Participants: 1 # Species Reported: 100 Total Hrs.: 198.75 2731 13.7409 NA
NA NA NA NA NA
CountYear1 FirstName LastName Email IsPrimary

I won’t show the data of the fifth table here, but it shows the full names of the counters. I don’t want to add this data to overall_meta_table, so I will just remove the table from the dataset.

In order to remove the fifth table, I noticed that every row of the bird count data included a [ character in the species variable (for example: “Snow Goose [Chen caerulescens]”), so I filtered out every row in the dataset that didn’t have the [ character. Notice that, in the str_detect filter, I put two \’s in front of the [. That is because the [ is a special regex character and str_detect needs to know I mean the actual character [, and putting two \’s in front of the [ is how to do that.

hamilton_cbc <- hamilton_cbc %>%
  filter(str_detect(species, "\\["))

So, where are we at the moment? We have only one dataset left, and it looks like this:

hamilton_cbc %>%
  head(n = 3) %>%
  kable() %>%
  kable_styling(full_width = FALSE, position = "left")
species count_year how_many_cw number_by_party_hours flags
Snow Goose [Chen caerulescens] 1921 [22] Count Date: 12/23/1921 # Participants: # Species Reported: Total Hrs.: 8.00 NA NA NA
Snow Goose [Chen caerulescens] 1922 [23] Count Date: 12/26/1922 # Participants: # Species Reported: Total Hrs.: NA NA NA
Snow Goose [Chen caerulescens] 1924 [25] Count Date: 12/27/1924 # Participants: # Species Reported: Total Hrs.: NA NA NA
  • species gives the species name in English and the scientific name
  • count_year data has a lot of information that we will parse out in a moment
  • how_many_cw provides the actual bird count
  • number_by_party_hours is how many birds were counted divided by the number of person-hours that year
  • flags contains values like US for “unusual” bird (as per the Christmas Bird Count documentation)

Now, we do some regex!

First, I want to split up the species variable into the common species name and the scientific species_latin name.

For the first mutate: I will use @kohske’s regex I found on StackOverflow, which, as Nettle writes:

I like @kohske’s regex, which looks behind for an open parenthesis ?<=\(, looks ahead for a closing parenthesis ?=\), and grabs everything in the middle (lazily) .+?, in other words (?<=\().+?(?=\))

For the second mutate: As you can see in the code below, there is a line break (\n) between every English name and every scientific name in species. We will use that to parse out the scientific name:

hamilton_cbc %>% 
  filter(row_number() == 1) %>% 
  pull(species)
## [1] "Snow Goose\n[Chen caerulescens]"

Here are the two mutates together:

# Putting it together: Mutating the two variables
hamilton_cbc <- hamilton_cbc %>%
  mutate(species_latin = str_extract(species, "(?<=\\[).+?(?=\\])"),
         species = word(species, start = 1, sep = fixed('\n[')))

Now we will look at the count_year variable. Let’s get a sense of what the variable looks like, using the White-Breasted Nuthatch count in 2016:

hamilton_cbc %>% 
  filter(row_number() == 15133) %>% 
  pull(count_year)
## [1] "2016 [117]\nCount Date: 12/26/2016\n# Participants: 1\n# Species Reported: 97\nTotal Hrs.: 216.65"

The count_year variable is actually several variables in one:

  • calendar year
  • [CBC count number]
  • calendar count date
  • number of participants
  • number of species reported
  • total hours spent that year on the count

This is all metadata and we can take it out of this dataset and put it in our overall_meta_data file. The only variable we will keep in the hamilton_cbc dataset is the calendar year, so that we can plot by year and so we can join to the metadata later if we want.

To join the count_participant_meta_data to the overall_meta_data, first we will parse the different variables out:

count_participant_meta_data <- hamilton_cbc %>%
  distinct(participant_info = count_year) %>%
  mutate(year = word(participant_info)) %>%
  mutate(number_of_participants = str_extract(
    participant_info, "(?<=Participants:\\s).+?(?=\\s#)")) %>%  # Gets everything between "Participants: " and " #"
  mutate(species_reported = str_extract(
    participant_info, "(?<=Reported:\\s).+?(?=\\nTotal)")) %>%  # Gets everything between "Reported: " and "Total"
  mutate(total_hours = str_extract(
    participant_info, "(?<=Hrs\\.:\\s).*$"))  # Gets everything after "Hrs.: "
# This regex is different because it goes until the end of the string:
# https://forum.sublimetext.com/t/regex-match-everything-after-this-word/20764

count_participant_meta_data %>%
  head(n = 3) %>%
  kable() %>%
  kable_styling(full_width = FALSE, position = "left")
participant_info year number_of_participants species_reported total_hours
1921 [22] Count Date: 12/23/1921 # Participants: # Species Reported: Total Hrs.: 8.00 1921 NA NA 8.00
1922 [23] Count Date: 12/26/1922 # Participants: # Species Reported: Total Hrs.: 1922 NA NA NA
1924 [25] Count Date: 12/27/1924 # Participants: # Species Reported: Total Hrs.: 1924 NA NA NA

Since number_of_participants and species_reported ended up having lots of missing data, I decided not to include them in the overall_meta_data dataset.

count_participant_meta_data <- count_participant_meta_data %>%
  select(year, total_hours)

overall_meta_data <- count_participant_meta_data %>% 
  mutate(year = as.integer(year),
         total_hours = as.double(total_hours)) %>%
  full_join(overall_meta_data, by = "year")

So, how’s our overall_meta_data looking?

overall_meta_data %>%
  tail() %>% # Because the first few years contain a lot of NAs
  glimpse()
## Observations: 6
## Variables: 12
## $ year        <dbl> 2012, 2013, 2014, 2015, 2016, 2017
## $ total_hours <dbl> 194.55, 182.20, 179.25, 221.00, 216.65, 198.75
## $ date        <date> 2012-12-26, 2013-12-26, 2014-12-26, 2015-12-26, 2...
## $ count_year  <chr> "113", "114", "115", "116", "117", "118"
## $ low_temp    <chr> "-2.0", "-4.0", "0.6", "-2.0", "-2.0", "-18.0"
## $ high_temp   <chr> "-2.0", "-1.0", "8.3", "5.0", "11.0", "-13.0"
## $ am_cloud    <chr> "Cloudy", "Cloudy", "Cloudy", "Partly Cloudy", "Cl...
## $ pm_clouds   <chr> "Cloudy", "Cloudy", "Cloudy", "Partly Cloudy", "Cl...
## $ am_rain     <chr> "None", NA, "None", "None", "Light", "None"
## $ pm_rain     <chr> "None", NA, "None", "Light", "Heavy\nLight", "None"
## $ am_snow     <chr> "Light", "Light", "None", "None", "None", "None"
## $ pm_snow     <chr> "Light", "None", "None", "None", "None", "None"

And where are we at with the hamilton_cbc dataset?

hamilton_cbc %>%
  tail() %>% # Because the first few years contain a lot of NAs
  glimpse()
## Observations: 6
## Variables: 6
## $ species               <chr> "House Sparrow", "House Sparrow", "House...
## $ count_year            <chr> "2012 [113]\nCount Date: 12/26/2012\n# P...
## $ how_many_cw           <chr> "1473", "1802", "1318", "2326", "2565", ...
## $ number_by_party_hours <chr> "7.5713", "9.8902", "7.3529", "10.5249",...
## $ flags                 <chr> NA, NA, NA, NA, NA, NA
## $ species_latin         <chr> "Passer domesticus", "Passer domesticus"...

Let’s clean up the variables a bit more:

hamilton_cbc <- hamilton_cbc %>%
  rename(participant_info = count_year,
         how_many_counted = how_many_cw) %>%
  mutate(year = as.integer(word(participant_info)),  # We will keep year and total_hours
         total_hours = as.double(
           str_extract(
             participant_info, "(?<=Hrs\\.:\\s).*$")))

We almost have a clean dataset! 🔜 ✨✨✨ 🎉

I am going to remove the flags variable. I am also going to remove number_by_party_hours and derive it myself instead.

hamilton_cbc <- hamilton_cbc %>%
  select(year, species, species_latin, how_many_counted, total_hours)

It turns out that how_many_counted also has a cw value, which means the bird was not seen on count day itself, but was seen on a day close to the count. I am going to set these bird counts to be NA, as they don’t have a specified value.

hamilton_cbc <- hamilton_cbc %>%
  mutate(how_many_counted = ifelse(how_many_counted == "cw", NA, how_many_counted),
         how_many_counted = as.integer(how_many_counted))

In the species variable, there are some rows that are identified only to the genus level (and not to the species level). I will exclude these records, as I believe eBird excludes them too.

hamilton_cbc %>%
filter(str_detect(species, "sp\\.")) %>%
distinct(species)
## # A tibble: 25 x 1
##    species        
##    <chr>          
##  1 scoter sp.     
##  2 duck sp.       
##  3 loon sp.       
##  4 Accipiter sp.  
##  5 hawk sp.       
##  6 eagle sp.      
##  7 jaeger sp.     
##  8 gull sp.       
##  9 screech-owl sp.
## 10 owl sp.        
## # ... with 15 more rows
hamilton_cbc <- hamilton_cbc %>%
  filter(!(str_detect(species, "sp\\.")))

Two final mutates:

  • Using tidyr’s replace_na function, let’s make all of the NAs equal to 0 for how_many_counted. That means we are assuming that all birds in the area were successfully counted on count day.
  • Let’s also calculate the number of birds counted (within each species) divided by the total number of count hours that happened that year.
hamilton_cbc <- hamilton_cbc %>%
  mutate(how_many_counted = replace_na(how_many_counted, 0),
         how_many_counted_by_hour = as.double(how_many_counted) / total_hours)

And that’s it! 😄 🎉 We have cleaned the dataset and are ready to do some visualizing 👀 in Part 2!

Final dataset

Here is a glimpse of our final dataset:

hamilton_cbc %>%
  tail() %>%
  kable() %>%
  kable_styling(full_width = FALSE, position = "left")
year species species_latin how_many_counted total_hours how_many_counted_by_hour
2012 House Sparrow Passer domesticus 1473 194.55 7.571318
2013 House Sparrow Passer domesticus 1802 182.20 9.890230
2014 House Sparrow Passer domesticus 1318 179.25 7.352859
2015 House Sparrow Passer domesticus 2326 221.00 10.524887
2016 House Sparrow Passer domesticus 2565 216.65 11.839372
2017 House Sparrow Passer domesticus 2731 198.75 13.740880

And thank you to the CBC! The CBC Data was provided by National Audubon Society and through the generous efforts of Bird Studies Canada and countless volunteers across the western hemisphere.

Avatar
Sharleen
Statistician