Introduction

While I was visualizing the data, I realized I still needed to do a bit more cleaning. So this is a short post outlining my steps to do so.

To start, we’ll load all of the packages and the data:

# A tibble: 20,868 x 6
    year species   species_latin  how_many_counted total_hours how_many_counted~
   <int> <chr>     <chr>                     <dbl>       <dbl>             <dbl>
 1  1921 "Snow Go~ Chen caerules~                0           8                 0
 2  1922 "Snow Go~ Chen caerules~                0          NA                NA
 3  1924 "Snow Go~ Chen caerules~                0          NA                NA
 4  1925 "Snow Go~ Chen caerules~                0          NA                NA
 5  1926 "Snow Go~ Chen caerules~                0          NA                NA
 6  1928 "Snow Go~ Chen caerules~                0          NA                NA
 7  1930 "Snow Go~ Chen caerules~                0          NA                NA
 8  1931 "Snow Go~ Chen caerules~                0          NA                NA
 9  1932 "Snow Go~ Chen caerules~                0          NA                NA
10  1933 "Snow Go~ Chen caerules~                0          NA                NA
# ... with 20,858 more rows

Final cleaning touches

Particularly, I want to:

  1. Remove hybrid birds

  2. Consolidate the names of some species that had variations in them

Let’s see how many hybrid species we have and remove them:

hamilton_cbc <- hamilton_cbc %>%
  mutate(species = str_remove(species, "\r"))  # Remove the trailing "\r"

hamilton_cbc %>%
  filter(str_detect(species, "hybrid")) %>%
  distinct(species)
# A tibble: 5 x 1
  species                                   
  <chr>                                     
1 Snow x Canada Goose (hybrid)              
2 American Black Duck x Mallard (hybrid)    
3 Mallard x Northern Pintail (hybrid)       
4 Herring x Glaucous Gull (hybrid)          
5 Herring x Great Black-backed Gull (hybrid)
hamilton_cbc <- hamilton_cbc %>%
  filter(!str_detect(species, "hybrid"))

Now, onto cleaning the trickier stuff. Sometimes, species have sub-species names or groups that have different total counts. For example, the Juncos (where total_counted is the sum of the counts over all years for that species):

hamilton_cbc %>%
  filter(str_detect(species, "Junco")) %>%
  group_by(species, species_latin) %>%
  summarise(total_counted = sum(how_many_counted)) %>%
  ungroup()
# A tibble: 4 x 3
  species                       species_latin                      total_counted
  <chr>                         <chr>                                      <dbl>
1 Dark-eyed Junco               Junco hyemalis                             14426
2 Dark-eyed Junco (Oregon)      Junco hyemalis [oreganus Group                39
3 Dark-eyed Junco (Slate-color~ Junco hyemalis hyemalis/carolinen~         46764
4 Dark-eyed Junco (White-winge~ Junco hyemalis aikeni                          1

I just want there to be one Dark-eyed Junco species in this dataset, so I am going to consolidate these four sub-species into one species. (Even though people get way more excited about seeing the Oregon sub-species in Hamilton than the Slate-colored 😄.)

The first step is to only keep the first two words of the species_latin variable:

hamilton_cbc <- hamilton_cbc %>%
  mutate(species_latin = word(species_latin, start = 1, end = 2))

We can also see who else is in this list:

hamilton_cbc %>%
  group_by(species_latin) %>%
  filter(n_distinct(species) > 1) %>%
  group_by(species, species_latin) %>%
  summarise(total_counted = sum(how_many_counted)) %>%
  ungroup()
# A tibble: 26 x 3
   species                         species_latin      total_counted
   <chr>                           <chr>                      <dbl>
 1 American Kestrel                Falco sparverius            1520
 2 American Kestrel (Northern)     Falco sparverius               4
 3 Brant                           Branta bernicla                8
 4 Brant (Atlantic)                Branta bernicla                1
 5 Common Grackle                  Quiscalus quiscula           173
 6 Common Grackle (Purple)         Quiscalus quiscula            17
 7 Dark-eyed Junco                 Junco hyemalis             14426
 8 Dark-eyed Junco (Oregon)        Junco hyemalis                39
 9 Dark-eyed Junco (Slate-colored) Junco hyemalis             46764
10 Dark-eyed Junco (White-winged)  Junco hyemalis                 1
# ... with 16 more rows

The second step is to sum up the counts for each year across all of the sub-species so the counts are the same, and then filter to only keep the first instance of each species (which, when arranged alphabetically, is the shortest species name):

hamilton_cbc <- hamilton_cbc %>%
  group_by(year, species_latin) %>%
  mutate(how_many_counted = sum(how_many_counted)) %>%
  arrange(year, species) %>%
  filter(row_number() == 1) %>%
  ungroup()

hamilton_cbc %>%
  filter(str_detect(species, "Junco")) %>%
  group_by(species, species_latin) %>%
  summarise(total_counted = sum(how_many_counted)) %>%
  ungroup()
# A tibble: 1 x 3
  species         species_latin  total_counted
  <chr>           <chr>                  <dbl>
1 Dark-eyed Junco Junco hyemalis         61230

Perfect! No more sub-species. The last group of species to deal with is species where the name has either a ( or a /:

hamilton_cbc %>%
  group_by(species, species_latin) %>%
  summarise(total_counted = sum(how_many_counted)) %>%
  ungroup() %>%
  filter(str_detect(species, "\\(|/"))  # The "|" is an "or" within the regex
# A tibble: 11 x 3
   species                             species_latin               total_counted
   <chr>                               <chr>                               <dbl>
 1 Barn Owl (American)                 Tyto alba                               1
 2 Bullock's/Baltimore Oriole          Icterus bullockii/galbula               1
 3 Great Blue Heron (Blue form)        Ardea herodias                        362
 4 Greater/Lesser Scaup                Aythya marila/affinis               26558
 5 Pacific/Winter Wren                 Troglodytes pacificus/hiem~           498
 6 Palm Warbler (Western)              Setophaga palmarum                      1
 7 Rock Pigeon (Feral Pigeon)          Columba livia                       60114
 8 Spotted/Eastern Towhee (Rufous-sid~ Pipilo maculatus/erythroph~            28
 9 Western/Eastern Meadowlark          Sturnella neglecta/magna               49
10 Wilson's/Common Snipe               Gallinago delicata/gallina~            13
11 Yellow-rumped Warbler (Myrtle)      Setophaga coronata                     65

I am going to make some executive decisions about what to do with these species:

  1. Delete species guess: Greater/Lesser Scaup
  2. Assume super-rare species were in fact the more common species:
    • Bullock’s/Baltimore Oriole were Baltimore Orioles
    • Western/Eastern Meadowlark were Eastern Meadowlarks
    • Wilson’s/Common Snipe were Common Snipes
    • Spotted/Eastern Towhee (Rufous-sided Towhee) were Eastern Towhees
    • Pacific/Winter Wren were Winter Wrens
  3. Remove parentheses on the remaining species for neatness
hamilton_cbc <- hamilton_cbc %>%
  filter(!(species == "Greater/Lesser Scaup")) %>%
  mutate(species = case_when(species == "Bullock's/Baltimore Oriole" ~ "Baltimore Oriole",
                             species == "Western/Eastern Meadowlark" ~ "Eastern Meadowlark",
                             species == "Wilson's/Common Snipe" ~ "Common Snipe",
                             species == "Spotted/Eastern Towhee (Rufous-sided Towhee)" ~ "Eastern Towhee",
                             species == "Pacific/Winter Wren" ~ "Winter Wren",
                             TRUE ~ species),
         species_latin = case_when(species_latin == "Icterus bullockii/galbula" ~ "Icterus galbula",
                             species_latin == "Sturnella neglecta/magna" ~ "Sturnella magna",
                             species_latin == "Gallinago delicata/gallinago" ~ "Gallinago gallinago",
                             species_latin == "Pipilo maculatus/erythrophthalmus" ~ "Pipilo erythrophthalmus",
                             species_latin == "Troglodytes pacificus/hiemalis" ~ "Troglodytes hiemalis",
                             TRUE ~ species_latin),
         species = case_when(species == "Barn Owl (American)" ~ "Barn Owl",
                             species == "Great Blue Heron (Blue form)" ~ "Great Blue Heron",
                             species == "Palm Warbler (Western)" ~ "Palm Warbler",
                             species == "Rock Pigeon (Feral Pigeon)" ~ "Rock Pigeon",
                             species == "Yellow-rumped Warbler (Myrtle)" ~ "Yellow-rumped Warbler",
                             TRUE ~ species))

# Consolidate the counts between the species whose names were just updated
# This is the same step as was done in the earlier sub-species section
hamilton_cbc <- hamilton_cbc %>%
  group_by(year, species) %>%
  mutate(how_many_counted = sum(how_many_counted)) %>%
  arrange(year, species) %>%
  filter(row_number() == 1) %>%
  ungroup()

Finally, I am going to recalculate the how_many_counted_by_hour variable that depends on how_many_counted:

hamilton_cbc <- hamilton_cbc %>%
  mutate(how_many_counted_by_hour = as.double(how_many_counted) / total_hours)

Number of species counted each year

In the course of creating a plot, I believe there was a error in the total hours recorded for 1982, where the total number of hours was only 64, but there was no drop in the number of species counted that year. I think it should have actually been 164 hours, because, in 1981, there were 167 hours, and in 1983, there were 168 hours. So, in the below chunk, I’ve mutated 1982 to have 164 total hours.

# Mutating total_hours and how_many_counted_by_hour that depends on it

hamilton_cbc <- hamilton_cbc %>%
  mutate(total_hours = ifelse(year == 1982, 164, total_hours),
         how_many_counted_by_hour = as.double(how_many_counted) / total_hours)

If you would like to download this final, cleaned dataset in .rds format, you can do so here.

We are now ready to visualize! Please look at the next post in this series for the visualizing!

And thank you to the Christmas Bird Count! The Christmas Bird Count Data was provided by National Audubon Society and through the generous efforts of Bird Studies Canada and countless volunteers across the western hemisphere.


Session info

- Session info ---------------------------------------------------------------
 setting  value                       
 version  R version 4.0.2 (2020-06-22)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RTerm                       
 language (EN)                        
 collate  English_Canada.1252         
 ctype    English_Canada.1252         
 tz       America/New_York            
 date     2020-09-04                  

- Packages -------------------------------------------------------------------
 ! package     * version    date       lib source                     
 P assertthat    0.2.1      2019-03-21 [?] CRAN (R 4.0.0)             
 P backports     1.1.8      2020-06-17 [?] CRAN (R 4.0.0)             
 P blogdown      0.20       2020-06-23 [?] CRAN (R 4.0.2)             
 P bookdown      0.20       2020-06-23 [?] CRAN (R 4.0.0)             
 P callr         3.4.3      2020-03-28 [?] CRAN (R 4.0.0)             
 P cli           2.0.2      2020-02-28 [?] CRAN (R 4.0.0)             
 P crayon        1.3.4      2017-09-16 [?] CRAN (R 4.0.0)             
 P desc          1.2.0      2018-05-01 [?] CRAN (R 4.0.0)             
 P devtools    * 2.3.1      2020-07-21 [?] CRAN (R 4.0.2)             
 P digest        0.6.25     2020-02-23 [?] CRAN (R 4.0.0)             
 P dplyr       * 1.0.1      2020-07-31 [?] CRAN (R 4.0.2)             
 P ellipsis      0.3.1      2020-05-15 [?] CRAN (R 4.0.2)             
 P emo         * 0.0.0.9000 2020-07-07 [?] Github (hadley/[email protected])
 P evaluate      0.14       2019-05-28 [?] CRAN (R 4.0.0)             
 P fansi         0.4.1      2020-01-08 [?] CRAN (R 4.0.0)             
 P fs            1.5.0      2020-07-31 [?] CRAN (R 4.0.2)             
 P generics      0.0.2      2018-11-29 [?] CRAN (R 4.0.0)             
 P glue          1.4.1      2020-05-13 [?] CRAN (R 4.0.2)             
 P here        * 0.1        2017-05-28 [?] CRAN (R 4.0.2)             
 P hms           0.5.3      2020-01-08 [?] CRAN (R 4.0.0)             
 P htmltools     0.5.0      2020-06-16 [?] CRAN (R 4.0.2)             
 P knitr         1.29       2020-06-23 [?] CRAN (R 4.0.2)             
 P lifecycle     0.2.0      2020-03-06 [?] CRAN (R 4.0.0)             
 P lubridate     1.7.9      2020-06-08 [?] CRAN (R 4.0.2)             
 P magrittr      1.5        2014-11-22 [?] CRAN (R 4.0.0)             
 P memoise       1.1.0      2017-04-21 [?] CRAN (R 4.0.0)             
 P pillar        1.4.6      2020-07-10 [?] CRAN (R 4.0.2)             
 P pkgbuild      1.1.0      2020-07-13 [?] CRAN (R 4.0.2)             
 P pkgconfig     2.0.3      2019-09-22 [?] CRAN (R 4.0.0)             
 P pkgload       1.1.0      2020-05-29 [?] CRAN (R 4.0.2)             
 P prettyunits   1.1.1      2020-01-24 [?] CRAN (R 4.0.0)             
 P processx      3.4.3      2020-07-05 [?] CRAN (R 4.0.2)             
 P ps            1.3.4      2020-08-11 [?] CRAN (R 4.0.2)             
 P purrr         0.3.4      2020-04-17 [?] CRAN (R 4.0.0)             
 P R6            2.4.1      2019-11-12 [?] CRAN (R 4.0.0)             
 P Rcpp          1.0.5      2020-07-06 [?] CRAN (R 4.0.2)             
 P readr       * 1.3.1      2018-12-21 [?] CRAN (R 4.0.0)             
 P remotes       2.2.0      2020-07-21 [?] CRAN (R 4.0.2)             
   renv          0.11.0     2020-06-26 [1] CRAN (R 4.0.2)             
 P rlang         0.4.7      2020-07-09 [?] CRAN (R 4.0.2)             
 P rmarkdown     2.3        2020-06-18 [?] CRAN (R 4.0.2)             
 P rprojroot     1.3-2      2018-01-03 [?] CRAN (R 4.0.0)             
 P sessioninfo   1.1.1      2018-11-05 [?] CRAN (R 4.0.0)             
 P stringi       1.4.6      2020-02-17 [?] CRAN (R 4.0.0)             
 P stringr     * 1.4.0      2019-02-10 [?] CRAN (R 4.0.0)             
 P testthat      2.3.2      2020-03-02 [?] CRAN (R 4.0.0)             
 P tibble        3.0.3      2020-07-10 [?] CRAN (R 4.0.2)             
 P tidyselect    1.1.0      2020-05-11 [?] CRAN (R 4.0.2)             
 P usethis     * 1.6.1      2020-04-29 [?] CRAN (R 4.0.2)             
 P utf8          1.1.4      2018-05-24 [?] CRAN (R 4.0.0)             
 P vctrs         0.3.2      2020-07-15 [?] CRAN (R 4.0.2)             
 P withr         2.2.0      2020-04-20 [?] CRAN (R 4.0.0)             
 P xfun          0.16       2020-07-24 [?] CRAN (R 4.0.2)             
 P yaml          2.2.1      2020-02-01 [?] CRAN (R 4.0.0)             

[1] C:/Users/shw/Desktop/blog2/renv/library/R-4.0/x86_64-w64-mingw32
[2] C:/Users/shw/AppData/Local/Temp/Rtmp6dkm3x/renv-system-library

 P -- Loaded and on-disk path mismatch.