TL;DR
Today we will be making this in a PDF, starting from a tidy data frame of city population data:
Image of the pdf that we will be making
Introduction
As part of my job, there is a report I create every year.
Last year, I wrote the report without using R. My workflow went like this:
- Import the CSV file into SPSS
- Manually clean the data and create the tables
- Paste the tables into Excel for formatting
- Paste the tables into Word for more formatting
- Send it to another team to professionally graphic-design the report
- Send the PDF to be published on the website
Here is what last year’s report looked like: http://ccrnr.ca/assets/2018-nclex-rn-2017-canadian-and-international-results-ccrnr-en.pdf
I mean, it’s pretty good! But it’s not very reproducible…
Since then, at my work, we’ve been converting all of our processes to be more reproducibly written in R! This is much more efficient, and more fun!
So this year, while writing the report using R, my workflow went like this:
- Import the CSV file into R
- Run my
.R
script to clean the data - Run my
.Rmd
script to create the PDF - Commit any changes, because the report is now version-controlled thanks to Jenny Bryan’s Happy Git and GitHub for the useR! 😍
- Send the PDF to be published on the website
This is much more reproducible and will make the creation of next year’s report so much faster! Click here to see what this year’s report looked like: http://ccrnr.ca/assets/2018-ccrnr-nclex-canadian-and-international-results-report-en.pdf
How did I create the tables in PDF?
After posting on Twitter about this updated process, some people asked how I created the tables in PDF using R.
Well, the first few set-up steps in order to create tables in PDF are as follows:
- First, I downloaded the
tinytex
package in R - I also installed the
bookdown
package in R, in order to be able to cross-reference tables (then the YAML output document type isbookdown::pdf_document2
) - I also installed the
knitr
andkableExtra
packages, in order to be able to create the tables themselves using thekable()
function
🍁 A Canadian example! 🍁
Now I’ll go through an example that uses all of the techniques I used while I was writing the report!
The Wikipedia page List of the 100 largest population centres in Canada has a table containing the largest 100 population centres (“cities”) in Canada, along with their populations in 2011 and 2016.
My goal is to make a table that shows:
- the number of cities in each province that are in the top 100
- the population of each province in 2011 and 2016 (including the populations of the top 100 cities only)
- a “Total” row for Canada that shows the total number of cities (this number should equal 100)
- the population of Canada in 2011 and 2016 (including the populations of the top 100 cities only)
I used the website Convert Wiki Tables to CSV to turn the table on the Wikipedia page into a CSV file. Click here to see the full raw file on my Github page.
Data import
Here is the R code where I import the CSV file I created ("wiki_data.csv"
). Underneath the code, I’ve displayed what the raw data file looks like.
library(dplyr)
library(janitor)
library(stringr)
library(knitr)
library(kableExtra)
library(readr)
wiki_data_raw <- read_csv(here::here("content",
"post",
"2019-09-01-tables-in-pdf",
"wiki_data.csv")) %>%
clean_names()
wiki_data_raw %>%
head()
# A tibble: 6 x 7
rank population_cent~ province population_in_2~ population_in_2~
<dbl> <chr> <chr> <dbl> <dbl>
1 1 "Toronto" Ontario 5429524 5144412
2 2 "Montreal" Quebec 3519595 3387653
3 3 "Vancouver" British~ 2264823 2124443
4 4 "Calgary" Alberta 1237656 1094379
5 5 "Edmonton" Alberta 1062643 935361
6 6 "Ottawa\x96Gati~ Ontario~ 989657 945592
# ... with 2 more variables: percent_change <chr>, class <chr>
Data cleaning
I am only interested in the province of each city and what its population was in 2011 and 2016, so my first step in cleaning will be to select only those three columns. I will then use group_by
and summarize
to get the number of cities and populations on a per-province basis.
wiki_data_by_province <- wiki_data_raw %>%
select(province,
population_2016 = population_in_2016,
population_2011 = population_in_2011) %>%
# Since some provinces were actually two provinces put together
# (e.g., "Alberta/Saskatchewan"), I used regex code from this website
# (https://www.perlmonks.org/?node_id=908348) to get everything before
# the first forward slash in the string
mutate(province = str_extract(province, "^([^\\/]+)")) %>%
group_by(province) %>%
summarize(number_of_cities_in_top_100 = n(),
pop_of_largest_cities_2011 = sum(population_2011),
pop_of_largest_cities_2016 = sum(population_2016))
wiki_data_by_province
# A tibble: 10 x 4
province number_of_cities_in_~ pop_of_largest_cit~ pop_of_largest_cit~
<chr> <int> <dbl> <dbl>
1 Alberta 13 2538547 2891712
2 British Columb~ 14 3254203 3471292
3 Manitoba 2 715649 760249
4 New Brunswick 4 244910 250811
5 Newfoundland a~ 1 172312 178427
6 Nova Scotia 2 335154 346605
7 Ontario 39 10140286 10659522
8 Prince Edward ~ 1 41613 44739
9 Quebec 20 5140554 5337846
10 Saskatchewan 4 479228 527638
Since I also want a “Total” row for all of Canada, I will take the above wiki_data_by_province
tibble and I will summarize
the three columns in a new tibble to get the total sums for the number of cities and their populations in 2011 and 2016. Since using summarize
means I lose the province
variable, I will recreate it using mutate
to have a value of “Canada”.
wiki_data_total_row <- wiki_data_by_province %>%
summarize(number_of_cities_in_top_100 = sum(number_of_cities_in_top_100),
pop_of_largest_cities_2011 = sum(pop_of_largest_cities_2011),
pop_of_largest_cities_2016 = sum(pop_of_largest_cities_2016)) %>%
mutate(province = "Canada")
wiki_data_total_row
# A tibble: 1 x 4
number_of_cities_in_to~ pop_of_largest_cities~ pop_of_largest_cities~ province
<int> <dbl> <dbl> <chr>
1 100 23062456 24468841 Canada
Now I want to merge both the wiki_data_by_province
tibble and the wiki_data_total_row
tibble on top of one another (using bind_rows
). This will be the table that I will save and then read into my .Rmd
file in order to create the table in PDF.
wiki_data_final_table <- wiki_data_by_province %>%
bind_rows(wiki_data_total_row) %>%
# This mutate_at
# (created using code from https://suzan.rbind.io/2018/02/dplyr-tutorial-2/#mutate-at-to-change-specific-columns)
# converts all variables containing the word "pop" to have commas separating the thousands.
mutate_at(vars(contains("pop")),
list(. %>%
scales::comma()))
wiki_data_final_table
# A tibble: 11 x 4
province number_of_cities_in_~ pop_of_largest_cit~ pop_of_largest_cit~
<chr> <int> <chr> <chr>
1 Alberta 13 2,538,547 2,891,712
2 British Columb~ 14 3,254,203 3,471,292
3 Manitoba 2 715,649 760,249
4 New Brunswick 4 244,910 250,811
5 Newfoundland a~ 1 172,312 178,427
6 Nova Scotia 2 335,154 346,605
7 Ontario 39 10,140,286 10,659,522
8 Prince Edward ~ 1 41,613 44,739
9 Quebec 20 5,140,554 5,337,846
10 Saskatchewan 4 479,228 527,638
11 Canada 100 23,062,456 24,468,841
Now I will save the wiki_data_final_table
tibble as an .rds
file so that I can read it into my .Rmd
file.
Creating the tables in PDF
Below is the .Rmd
file that reads in the wiki_data_final_table
tibble and uses the kable
and kableExtra
packages in order to get the table to look the way I want it to.
Also, notice that in the YAML, my output format is bookdown::pdf_document2
. This allows me to cross-reference my tables with the text of my document.
So, what’s the real secret to creating tables in PDF from RMarkdown?
The secret is…
The document Create Awesome LaTeX Table with knitr::kable and kableExtra, written by Hao Zhu! It’s got everything you need in order to make fantastic tables in PDF using RMarkdown.
To see the final PDF of the below .Rmd
file, click ✨here✨!
---
output:
bookdown::pdf_document2:
toc: no
number_sections: FALSE
keep_tex: TRUE
always_allow_html: yes
geometry: "left=1.5cm,right=7cm,top=2cm,bottom=2cm"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE,
warning = FALSE,
message = FALSE,
out.width= "8.5in")
library(dplyr)
library(knitr)
library(kableExtra)
# Colours for the table
blue_table_colour <- "#9BD4F5"
orange_table_colour <- "#FF9000"
light_striping_table_colour <- "#DDDDDD"
```
```{r import-cleaned-data}
wiki_data_final_table <- readRDS(here::here("content",
"post",
"2019-09-01-tables-in-pdf",
"cleaned_wiki_data_for_table.rds"))
```
# The top 100 cities in Canada
Table \@ref(tab:table-population-by-province) shows the populations of each province and of Canada in total for the years 2011 and 2016 (this data includes only the top 100 largest cities in Canada).
```{r table-population-by-province}
wiki_data_final_table %>%
knitr::kable(
"latex",
booktabs = TRUE,
linesep = "",
caption = "Number of cities and population of each province (including only the largest 100 cities in Canada)",
col.names = c("Province", "Number of largest 100 cities in this province", rep(c("2011", "2016"), 1)),
align = c("l", rep("r", 3))) %>%
kable_styling(latex_options = "HOLD_position") %>% # This line holds the table where you want it, so LaTeX won't move it around
add_header_above(
c(" " = 1, # There has to be a space here, like this " ", and not like this ""
" " = 1,
"Population" = 2),
bold = TRUE,
line = FALSE,
background = blue_table_colour
) %>%
column_spec(1,
width = "6cm") %>%
column_spec(2:4,
width = "3cm") %>%
footnote(general = "There are other cities in Canada not included in this table; the excluded cities are those smaller than the 100 largest cities in Canada.",
threeparttable = TRUE,
general_title = "Footnote:") %>%
row_spec(
row = 0,
background = blue_table_colour,
bold = TRUE,
align = "c"
) %>%
row_spec(
row = c(2,4,6,8,10),
background = light_striping_table_colour
) %>%
row_spec(
row = 11,
background = orange_table_colour,
bold = TRUE
) %>%
row_spec(
row = 10,
hline_after = TRUE) # This hline unfortunately gets hidden by the orange colouring of the final row, so this line of code doesn't really do anything :(
```
Session info
- Session info ---------------------------------------------------------------
setting value
version R version 4.0.2 (2020-06-22)
os Windows 10 x64
system x86_64, mingw32
ui RTerm
language (EN)
collate English_Canada.1252
ctype English_Canada.1252
tz America/New_York
date 2020-09-04
- Packages -------------------------------------------------------------------
! package * version date lib source
P assertthat 0.2.1 2019-03-21 [?] CRAN (R 4.0.0)
P backports 1.1.8 2020-06-17 [?] CRAN (R 4.0.0)
P blogdown 0.20 2020-06-23 [?] CRAN (R 4.0.2)
P bookdown 0.20 2020-06-23 [?] CRAN (R 4.0.0)
P callr 3.4.3 2020-03-28 [?] CRAN (R 4.0.0)
P cli 2.0.2 2020-02-28 [?] CRAN (R 4.0.0)
P colorspace 1.4-1 2019-03-18 [?] CRAN (R 4.0.0)
P crayon 1.3.4 2017-09-16 [?] CRAN (R 4.0.0)
P desc 1.2.0 2018-05-01 [?] CRAN (R 4.0.0)
P devtools * 2.3.1 2020-07-21 [?] CRAN (R 4.0.2)
P digest 0.6.25 2020-02-23 [?] CRAN (R 4.0.0)
P dplyr * 1.0.1 2020-07-31 [?] CRAN (R 4.0.2)
P ellipsis 0.3.1 2020-05-15 [?] CRAN (R 4.0.2)
P emo * 0.0.0.9000 2020-07-07 [?] Github (hadley/[email protected])
P evaluate 0.14 2019-05-28 [?] CRAN (R 4.0.0)
P fansi 0.4.1 2020-01-08 [?] CRAN (R 4.0.0)
P fs 1.5.0 2020-07-31 [?] CRAN (R 4.0.2)
P generics 0.0.2 2018-11-29 [?] CRAN (R 4.0.0)
P glue 1.4.1 2020-05-13 [?] CRAN (R 4.0.2)
P here 0.1 2017-05-28 [?] CRAN (R 4.0.2)
P hms 0.5.3 2020-01-08 [?] CRAN (R 4.0.0)
P htmltools 0.5.0 2020-06-16 [?] CRAN (R 4.0.2)
P httr 1.4.2 2020-07-20 [?] CRAN (R 4.0.2)
P janitor * 2.0.1 2020-04-12 [?] CRAN (R 4.0.0)
P kableExtra * 1.1.0 2019-03-16 [?] CRAN (R 4.0.0)
P knitr * 1.29 2020-06-23 [?] CRAN (R 4.0.2)
P lifecycle 0.2.0 2020-03-06 [?] CRAN (R 4.0.0)
P lubridate 1.7.9 2020-06-08 [?] CRAN (R 4.0.2)
P magrittr 1.5 2014-11-22 [?] CRAN (R 4.0.0)
P memoise 1.1.0 2017-04-21 [?] CRAN (R 4.0.0)
P munsell 0.5.0 2018-06-12 [?] CRAN (R 4.0.0)
P pillar 1.4.6 2020-07-10 [?] CRAN (R 4.0.2)
P pkgbuild 1.1.0 2020-07-13 [?] CRAN (R 4.0.2)
P pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 4.0.0)
P pkgload 1.1.0 2020-05-29 [?] CRAN (R 4.0.2)
P prettyunits 1.1.1 2020-01-24 [?] CRAN (R 4.0.0)
P processx 3.4.3 2020-07-05 [?] CRAN (R 4.0.2)
P ps 1.3.4 2020-08-11 [?] CRAN (R 4.0.2)
P purrr 0.3.4 2020-04-17 [?] CRAN (R 4.0.0)
P R6 2.4.1 2019-11-12 [?] CRAN (R 4.0.0)
P Rcpp 1.0.5 2020-07-06 [?] CRAN (R 4.0.2)
P readr * 1.3.1 2018-12-21 [?] CRAN (R 4.0.0)
P remotes 2.2.0 2020-07-21 [?] CRAN (R 4.0.2)
renv 0.11.0 2020-06-26 [1] CRAN (R 4.0.2)
P rlang 0.4.7 2020-07-09 [?] CRAN (R 4.0.2)
P rmarkdown 2.3 2020-06-18 [?] CRAN (R 4.0.2)
P rprojroot 1.3-2 2018-01-03 [?] CRAN (R 4.0.0)
P rstudioapi 0.11 2020-02-07 [?] CRAN (R 4.0.0)
P rvest 0.3.6 2020-07-25 [?] CRAN (R 4.0.2)
P scales 1.1.1 2020-05-11 [?] CRAN (R 4.0.2)
P sessioninfo 1.1.1 2018-11-05 [?] CRAN (R 4.0.0)
P snakecase 0.11.0 2019-05-25 [?] CRAN (R 4.0.0)
P stringi 1.4.6 2020-02-17 [?] CRAN (R 4.0.0)
P stringr * 1.4.0 2019-02-10 [?] CRAN (R 4.0.0)
P testthat 2.3.2 2020-03-02 [?] CRAN (R 4.0.0)
P tibble 3.0.3 2020-07-10 [?] CRAN (R 4.0.2)
P tidyselect 1.1.0 2020-05-11 [?] CRAN (R 4.0.2)
P usethis * 1.6.1 2020-04-29 [?] CRAN (R 4.0.2)
P utf8 1.1.4 2018-05-24 [?] CRAN (R 4.0.0)
P vctrs 0.3.2 2020-07-15 [?] CRAN (R 4.0.2)
P viridisLite 0.3.0 2018-02-01 [?] CRAN (R 4.0.0)
P webshot 0.5.2 2019-11-22 [?] CRAN (R 4.0.2)
P withr 2.2.0 2020-04-20 [?] CRAN (R 4.0.0)
P xfun 0.16 2020-07-24 [?] CRAN (R 4.0.2)
P xml2 1.3.2 2020-04-23 [?] CRAN (R 4.0.2)
P yaml 2.2.1 2020-02-01 [?] CRAN (R 4.0.0)
[1] C:/Users/shw/Desktop/blog2/renv/library/R-4.0/x86_64-w64-mingw32
[2] C:/Users/shw/AppData/Local/Temp/Rtmp4azd69/renv-system-library
P -- Loaded and on-disk path mismatch.