TL;DR

Today we will be making this in a PDF, starting from a tidy data frame of city population data:

Image of the pdf that we will be making

Introduction

As part of my job, there is a report I create every year.

Last year, I wrote the report without using R. My workflow went like this:

  • Import the CSV file into SPSS
  • Manually clean the data and create the tables
  • Paste the tables into Excel for formatting
  • Paste the tables into Word for more formatting
  • Send it to another team to professionally graphic-design the report
  • Send the PDF to be published on the website

Here is what last year’s report looked like: http://ccrnr.ca/assets/2018-nclex-rn-2017-canadian-and-international-results-ccrnr-en.pdf

I mean, it’s pretty good! But it’s not very reproducible…

Since then, at my work, we’ve been converting all of our processes to be more reproducibly written in R! This is much more efficient, and more fun!

So this year, while writing the report using R, my workflow went like this:

  • Import the CSV file into R
  • Run my .R script to clean the data
  • Run my .Rmd script to create the PDF
  • Commit any changes, because the report is now version-controlled thanks to Jenny Bryan’s Happy Git and GitHub for the useR! 😍
  • Send the PDF to be published on the website

This is much more reproducible and will make the creation of next year’s report so much faster! Click here to see what this year’s report looked like: http://ccrnr.ca/assets/2018-ccrnr-nclex-canadian-and-international-results-report-en.pdf

How did I create the tables in PDF?

After posting on Twitter about this updated process, some people asked how I created the tables in PDF using R.

Well, the first few set-up steps in order to create tables in PDF are as follows:

  • First, I downloaded the tinytex package in R
  • I also installed the bookdown package in R, in order to be able to cross-reference tables (then the YAML output document type is bookdown::pdf_document2)
  • I also installed the knitr and kableExtra packages, in order to be able to create the tables themselves using the kable() function

🍁 A Canadian example! 🍁

Now I’ll go through an example that uses all of the techniques I used while I was writing the report!

The Wikipedia page List of the 100 largest population centres in Canada has a table containing the largest 100 population centres (“cities”) in Canada, along with their populations in 2011 and 2016.

My goal is to make a table that shows:

  • the number of cities in each province that are in the top 100
  • the population of each province in 2011 and 2016 (including the populations of the top 100 cities only)
  • a “Total” row for Canada that shows the total number of cities (this number should equal 100)
  • the population of Canada in 2011 and 2016 (including the populations of the top 100 cities only)

I used the website Convert Wiki Tables to CSV to turn the table on the Wikipedia page into a CSV file. Click here to see the full raw file on my Github page.

Data import

Here is the R code where I import the CSV file I created ("wiki_data.csv"). Underneath the code, I’ve displayed what the raw data file looks like.

library(dplyr)
library(janitor)
library(stringr)
library(knitr)
library(kableExtra)
library(readr)

wiki_data_raw <- read_csv(here::here("content",
                                     "post",
                                     "2019-09-01-tables-in-pdf",
                                     "wiki_data.csv")) %>%
  clean_names()

wiki_data_raw %>%
  head()
# A tibble: 6 x 7
   rank population_cent~ province population_in_2~ population_in_2~
  <dbl> <chr>            <chr>               <dbl>            <dbl>
1     1 "Toronto"        Ontario           5429524          5144412
2     2 "Montreal"       Quebec            3519595          3387653
3     3 "Vancouver"      British~          2264823          2124443
4     4 "Calgary"        Alberta           1237656          1094379
5     5 "Edmonton"       Alberta           1062643           935361
6     6 "Ottawa\x96Gati~ Ontario~           989657           945592
# ... with 2 more variables: percent_change <chr>, class <chr>

Data cleaning

I am only interested in the province of each city and what its population was in 2011 and 2016, so my first step in cleaning will be to select only those three columns. I will then use group_by and summarize to get the number of cities and populations on a per-province basis.

wiki_data_by_province <- wiki_data_raw %>%
  select(province,
         population_2016 = population_in_2016,
         population_2011 = population_in_2011) %>%
   # Since some provinces were actually two provinces put together
   # (e.g., "Alberta/Saskatchewan"), I used regex code from this website
   # (https://www.perlmonks.org/?node_id=908348) to get everything before
   # the first forward slash in the string
  mutate(province = str_extract(province, "^([^\\/]+)")) %>%
  group_by(province) %>%
  summarize(number_of_cities_in_top_100 = n(),
            pop_of_largest_cities_2011 = sum(population_2011),
            pop_of_largest_cities_2016 = sum(population_2016))

wiki_data_by_province
# A tibble: 10 x 4
   province        number_of_cities_in_~ pop_of_largest_cit~ pop_of_largest_cit~
   <chr>                           <int>               <dbl>               <dbl>
 1 Alberta                            13             2538547             2891712
 2 British Columb~                    14             3254203             3471292
 3 Manitoba                            2              715649              760249
 4 New Brunswick                       4              244910              250811
 5 Newfoundland a~                     1              172312              178427
 6 Nova Scotia                         2              335154              346605
 7 Ontario                            39            10140286            10659522
 8 Prince Edward ~                     1               41613               44739
 9 Quebec                             20             5140554             5337846
10 Saskatchewan                        4              479228              527638

Since I also want a “Total” row for all of Canada, I will take the above wiki_data_by_province tibble and I will summarize the three columns in a new tibble to get the total sums for the number of cities and their populations in 2011 and 2016. Since using summarize means I lose the province variable, I will recreate it using mutate to have a value of “Canada”.

wiki_data_total_row <- wiki_data_by_province %>%
  summarize(number_of_cities_in_top_100 = sum(number_of_cities_in_top_100),
            pop_of_largest_cities_2011 = sum(pop_of_largest_cities_2011),
            pop_of_largest_cities_2016 = sum(pop_of_largest_cities_2016)) %>%
  mutate(province = "Canada")

wiki_data_total_row
# A tibble: 1 x 4
  number_of_cities_in_to~ pop_of_largest_cities~ pop_of_largest_cities~ province
                    <int>                  <dbl>                  <dbl> <chr>   
1                     100               23062456               24468841 Canada  

Now I want to merge both the wiki_data_by_province tibble and the wiki_data_total_row tibble on top of one another (using bind_rows). This will be the table that I will save and then read into my .Rmd file in order to create the table in PDF.

wiki_data_final_table <- wiki_data_by_province %>%
  bind_rows(wiki_data_total_row) %>%
  # This mutate_at
  # (created using code from https://suzan.rbind.io/2018/02/dplyr-tutorial-2/#mutate-at-to-change-specific-columns)
  # converts all variables containing the word "pop" to have commas separating the thousands.
  mutate_at(vars(contains("pop")),
            list(. %>%
                   scales::comma()))

wiki_data_final_table
# A tibble: 11 x 4
   province        number_of_cities_in_~ pop_of_largest_cit~ pop_of_largest_cit~
   <chr>                           <int> <chr>               <chr>              
 1 Alberta                            13 2,538,547           2,891,712          
 2 British Columb~                    14 3,254,203           3,471,292          
 3 Manitoba                            2 715,649             760,249            
 4 New Brunswick                       4 244,910             250,811            
 5 Newfoundland a~                     1 172,312             178,427            
 6 Nova Scotia                         2 335,154             346,605            
 7 Ontario                            39 10,140,286          10,659,522         
 8 Prince Edward ~                     1 41,613              44,739             
 9 Quebec                             20 5,140,554           5,337,846          
10 Saskatchewan                        4 479,228             527,638            
11 Canada                            100 23,062,456          24,468,841         

Now I will save the wiki_data_final_table tibble as an .rds file so that I can read it into my .Rmd file.

wiki_data_final_table %>%
  saveRDS(here::here("content",
                     "post",
                     "2019-09-01-tables-in-pdf",
                     "cleaned_wiki_data_for_table.rds"))

Creating the tables in PDF

Below is the .Rmd file that reads in the wiki_data_final_table tibble and uses the kable and kableExtra packages in order to get the table to look the way I want it to.

Also, notice that in the YAML, my output format is bookdown::pdf_document2. This allows me to cross-reference my tables with the text of my document.

So, what’s the real secret to creating tables in PDF from RMarkdown?

The secret is…

The document Create Awesome LaTeX Table with knitr::kable and kableExtra, written by Hao Zhu! It’s got everything you need in order to make fantastic tables in PDF using RMarkdown.

To see the final PDF of the below .Rmd file, click ✨here✨!

---
output: 
  bookdown::pdf_document2:
    toc: no
    number_sections: FALSE
    keep_tex: TRUE
always_allow_html: yes
geometry: "left=1.5cm,right=7cm,top=2cm,bottom=2cm"

---

```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = FALSE,
                      warning = FALSE,
                      message = FALSE,
                      out.width= "8.5in")

library(dplyr)
library(knitr)
library(kableExtra)

# Colours for the table
blue_table_colour <- "#9BD4F5"

orange_table_colour <- "#FF9000"

light_striping_table_colour <- "#DDDDDD"

```

```{r import-cleaned-data}

wiki_data_final_table <- readRDS(here::here("content",
                                            "post",
                                            "2019-09-01-tables-in-pdf",
                                            "cleaned_wiki_data_for_table.rds"))

```

# The top 100 cities in Canada

Table \@ref(tab:table-population-by-province) shows the populations of each province and of Canada in total for the years 2011 and 2016 (this data includes only the top 100 largest cities in Canada).

```{r table-population-by-province}

wiki_data_final_table %>%
  knitr::kable(
    "latex",
    booktabs = TRUE,
    linesep = "",
    caption = "Number of cities and population of each province (including only the largest 100 cities in Canada)",
    col.names = c("Province", "Number of largest 100 cities in this province", rep(c("2011", "2016"), 1)),
    align = c("l", rep("r", 3))) %>%
  kable_styling(latex_options = "HOLD_position") %>%  # This line holds the table where you want it, so LaTeX won't move it around
  add_header_above(
    c(" " = 1,  # There has to be a space here, like this " ", and not like this ""
      " " = 1,
      "Population" = 2),
    bold = TRUE,
    line = FALSE,
    background = blue_table_colour
  ) %>%
  column_spec(1,
              width = "6cm") %>%
  column_spec(2:4,
              width = "3cm") %>%
  footnote(general = "There are other cities in Canada not included in this table; the excluded cities are those smaller than the 100 largest cities in Canada.",
           threeparttable = TRUE,
           general_title = "Footnote:") %>%
  row_spec(
    row = 0,
    background = blue_table_colour,
    bold = TRUE,
    align = "c"
  ) %>%
  row_spec(
    row = c(2,4,6,8,10),
    background = light_striping_table_colour
  ) %>%
  row_spec(
    row = 11,
    background = orange_table_colour,
    bold = TRUE
  ) %>%
  row_spec(
    row = 10,
    hline_after = TRUE)  # This hline unfortunately gets hidden by the orange colouring of the final row, so this line of code doesn't really do anything :(

```

Session info

- Session info ---------------------------------------------------------------
 setting  value                       
 version  R version 4.0.2 (2020-06-22)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RTerm                       
 language (EN)                        
 collate  English_Canada.1252         
 ctype    English_Canada.1252         
 tz       America/New_York            
 date     2020-09-04                  

- Packages -------------------------------------------------------------------
 ! package     * version    date       lib source                     
 P assertthat    0.2.1      2019-03-21 [?] CRAN (R 4.0.0)             
 P backports     1.1.8      2020-06-17 [?] CRAN (R 4.0.0)             
 P blogdown      0.20       2020-06-23 [?] CRAN (R 4.0.2)             
 P bookdown      0.20       2020-06-23 [?] CRAN (R 4.0.0)             
 P callr         3.4.3      2020-03-28 [?] CRAN (R 4.0.0)             
 P cli           2.0.2      2020-02-28 [?] CRAN (R 4.0.0)             
 P colorspace    1.4-1      2019-03-18 [?] CRAN (R 4.0.0)             
 P crayon        1.3.4      2017-09-16 [?] CRAN (R 4.0.0)             
 P desc          1.2.0      2018-05-01 [?] CRAN (R 4.0.0)             
 P devtools    * 2.3.1      2020-07-21 [?] CRAN (R 4.0.2)             
 P digest        0.6.25     2020-02-23 [?] CRAN (R 4.0.0)             
 P dplyr       * 1.0.1      2020-07-31 [?] CRAN (R 4.0.2)             
 P ellipsis      0.3.1      2020-05-15 [?] CRAN (R 4.0.2)             
 P emo         * 0.0.0.9000 2020-07-07 [?] Github (hadley/[email protected])
 P evaluate      0.14       2019-05-28 [?] CRAN (R 4.0.0)             
 P fansi         0.4.1      2020-01-08 [?] CRAN (R 4.0.0)             
 P fs            1.5.0      2020-07-31 [?] CRAN (R 4.0.2)             
 P generics      0.0.2      2018-11-29 [?] CRAN (R 4.0.0)             
 P glue          1.4.1      2020-05-13 [?] CRAN (R 4.0.2)             
 P here          0.1        2017-05-28 [?] CRAN (R 4.0.2)             
 P hms           0.5.3      2020-01-08 [?] CRAN (R 4.0.0)             
 P htmltools     0.5.0      2020-06-16 [?] CRAN (R 4.0.2)             
 P httr          1.4.2      2020-07-20 [?] CRAN (R 4.0.2)             
 P janitor     * 2.0.1      2020-04-12 [?] CRAN (R 4.0.0)             
 P kableExtra  * 1.1.0      2019-03-16 [?] CRAN (R 4.0.0)             
 P knitr       * 1.29       2020-06-23 [?] CRAN (R 4.0.2)             
 P lifecycle     0.2.0      2020-03-06 [?] CRAN (R 4.0.0)             
 P lubridate     1.7.9      2020-06-08 [?] CRAN (R 4.0.2)             
 P magrittr      1.5        2014-11-22 [?] CRAN (R 4.0.0)             
 P memoise       1.1.0      2017-04-21 [?] CRAN (R 4.0.0)             
 P munsell       0.5.0      2018-06-12 [?] CRAN (R 4.0.0)             
 P pillar        1.4.6      2020-07-10 [?] CRAN (R 4.0.2)             
 P pkgbuild      1.1.0      2020-07-13 [?] CRAN (R 4.0.2)             
 P pkgconfig     2.0.3      2019-09-22 [?] CRAN (R 4.0.0)             
 P pkgload       1.1.0      2020-05-29 [?] CRAN (R 4.0.2)             
 P prettyunits   1.1.1      2020-01-24 [?] CRAN (R 4.0.0)             
 P processx      3.4.3      2020-07-05 [?] CRAN (R 4.0.2)             
 P ps            1.3.4      2020-08-11 [?] CRAN (R 4.0.2)             
 P purrr         0.3.4      2020-04-17 [?] CRAN (R 4.0.0)             
 P R6            2.4.1      2019-11-12 [?] CRAN (R 4.0.0)             
 P Rcpp          1.0.5      2020-07-06 [?] CRAN (R 4.0.2)             
 P readr       * 1.3.1      2018-12-21 [?] CRAN (R 4.0.0)             
 P remotes       2.2.0      2020-07-21 [?] CRAN (R 4.0.2)             
   renv          0.11.0     2020-06-26 [1] CRAN (R 4.0.2)             
 P rlang         0.4.7      2020-07-09 [?] CRAN (R 4.0.2)             
 P rmarkdown     2.3        2020-06-18 [?] CRAN (R 4.0.2)             
 P rprojroot     1.3-2      2018-01-03 [?] CRAN (R 4.0.0)             
 P rstudioapi    0.11       2020-02-07 [?] CRAN (R 4.0.0)             
 P rvest         0.3.6      2020-07-25 [?] CRAN (R 4.0.2)             
 P scales        1.1.1      2020-05-11 [?] CRAN (R 4.0.2)             
 P sessioninfo   1.1.1      2018-11-05 [?] CRAN (R 4.0.0)             
 P snakecase     0.11.0     2019-05-25 [?] CRAN (R 4.0.0)             
 P stringi       1.4.6      2020-02-17 [?] CRAN (R 4.0.0)             
 P stringr     * 1.4.0      2019-02-10 [?] CRAN (R 4.0.0)             
 P testthat      2.3.2      2020-03-02 [?] CRAN (R 4.0.0)             
 P tibble        3.0.3      2020-07-10 [?] CRAN (R 4.0.2)             
 P tidyselect    1.1.0      2020-05-11 [?] CRAN (R 4.0.2)             
 P usethis     * 1.6.1      2020-04-29 [?] CRAN (R 4.0.2)             
 P utf8          1.1.4      2018-05-24 [?] CRAN (R 4.0.0)             
 P vctrs         0.3.2      2020-07-15 [?] CRAN (R 4.0.2)             
 P viridisLite   0.3.0      2018-02-01 [?] CRAN (R 4.0.0)             
 P webshot       0.5.2      2019-11-22 [?] CRAN (R 4.0.2)             
 P withr         2.2.0      2020-04-20 [?] CRAN (R 4.0.0)             
 P xfun          0.16       2020-07-24 [?] CRAN (R 4.0.2)             
 P xml2          1.3.2      2020-04-23 [?] CRAN (R 4.0.2)             
 P yaml          2.2.1      2020-02-01 [?] CRAN (R 4.0.0)             

[1] C:/Users/shw/Desktop/blog2/renv/library/R-4.0/x86_64-w64-mingw32
[2] C:/Users/shw/AppData/Local/Temp/Rtmp4azd69/renv-system-library

 P -- Loaded and on-disk path mismatch.