Toronto image credit: Unsplash

Making tables in PDF using R

A guide on making tables in PDF using RMarkdown and LaTeX.

Toronto image credit: Unsplash

Making tables in PDF using R

A guide on making tables in PDF using RMarkdown and LaTeX.

Introduction

As part of my job, there is a report I create every year.

Last year, I wrote the report without using R. My workflow went like this:

  • Import the CSV file into SPSS
  • Manually clean the data and create the tables
  • Paste the tables into Excel for formatting
  • Paste the tables into Word for more formatting
  • Send it to another team to professionally graphic-design the report
  • Send the PDF to be published on the website

Here is what last year’s report looked like: http://ccrnr.ca/assets/2018-nclex-rn-2017-canadian-and-international-results-ccrnr-en.pdf

I mean, it’s pretty good! But it’s not reproducible…

Since then, at my work, we’ve been converting all of our processes to be reproducibly written in R! This is much more efficient, and more fun! 😄

So this year, while writing the report using R, my workflow went like this:

  • Import the CSV file into R
  • Run my .R script to clean the data
  • Run my .Rmd script to create the PDF
  • Commit any changes, because the report is now version-controlled thanks to Jenny Bryan’s Happy Git and GitHub for the useR! 😍
  • Send the PDF to be published on the website

This is totally reproducible and will make the creation of next year’s report so much faster! Here is what this year’s report looked like: http://ccrnr.ca/assets/2018-ccrnr-nclex-canadian-and-international-results-report-en.pdf

How did I create the tables in PDF?

After posting on Twitter about this updated process, some people asked how I created the tables in PDF using R.

Well, the first few set-up steps in order to create tables in PDF are as follows:

  • First, I downloaded the tinytex package in R
  • I also installed the bookdown package in R, in order to be able to cross-reference tables (then the YAML output document type is bookdown::pdf_document2)
  • I also installed the knitr and kableExtra packages, in order to be able to create the tables themselves using the kable() function

🍁 A Canadian example! 🍁

Now I’ll go through an example that uses all of the techniques I used while I was writing the report!

The Wikipedia page List of the 100 largest population centres in Canada has a table containing the largest 100 population centres (“cities”) in Canada, along with their populations in 2011 and 2016.

My goal is to make a table that shows:

  • the number of cities in each province that are in the top 100
  • the population of each province in 2011 and 2016 (including the populations of the top 100 cities only)
  • a “Total” row for Canada that shows the total number of cities (this number should equal 100)
  • the population of Canada in 2011 and 2016 (including the populations of the top 100 cities only)

I used the website Convert Wiki Tables to CSV to turn the table on the Wikipedia page into a CSV file. Click here to see the full raw file on my Github page.

Data import

Here is the R code where I import the CSV file I created ("wiki_data.csv"). Underneath the code, I’ve displayed what the raw data file looks like.

library(dplyr)
library(janitor)
library(stringr)
library(knitr)
library(kableExtra)
library(readr)

wiki_data_raw <- read_csv(here::here("content",
                                     "post",
                                     "tables_in_pdf",
                                     "wiki_data.csv")) %>%
  clean_names()

wiki_data_raw %>%
  head() %>%
  kable() %>%
  kable_styling(full_width = FALSE, position = "left")
rank population_centre province population_in_2016 population_in_2011 percent_change class
1 Toronto Ontario 5429524 5144412 +5.5% Large urban
2 Montreal Quebec 3519595 3387653 +3.9% Large urban
3 Vancouver British Columbia 2264823 2124443 +6.6% Large urban
4 Calgary Alberta 1237656 1094379 +13.1% Large urban
5 Edmonton Alberta 1062643 935361 +13.6% Large urban
6 Ottawa㤼㸶Gatineau Ontario/Quebec 989657 945592 +4.7% Large urban

Data cleaning

I am only interested in the province of each city and what its population was in 2011 and 2016, so my first step in cleaning will be to select only those three columns. I will then use group_by and summarize to get the number of cities and populations on a per-province basis.

wiki_data_by_province <- wiki_data_raw %>%
  select(province,
         population_2016 = population_in_2016,
         population_2011 = population_in_2011) %>%
  mutate(province = str_extract(province, "^([^\\/]+)")) %>%  # Since some provinces were actually two provinces put together (e.g., "Alberta/Saskatchewan"), I used regex code from this website (https://www.perlmonks.org/?node_id=908348) to get everything before the first forward slash in the string
  group_by(province) %>%
  summarize(number_of_cities_in_top_100 = n(),
            pop_of_largest_cities_2011 = sum(population_2011),
            pop_of_largest_cities_2016 = sum(population_2016))

wiki_data_by_province %>%
  kable() %>%
  kable_styling(full_width = FALSE, position = "left")
province number_of_cities_in_top_100 pop_of_largest_cities_2011 pop_of_largest_cities_2016
Alberta 13 2538547 2891712
British Columbia 14 3254203 3471292
Manitoba 2 715649 760249
New Brunswick 4 244910 250811
Newfoundland and Labrador 1 172312 178427
Nova Scotia 2 335154 346605
Ontario 39 10140286 10659522
Prince Edward Island 1 41613 44739
Quebec 20 5140554 5337846
Saskatchewan 4 479228 527638

Since I also want a “Total” row for all of Canada, I will take the above wiki_data_by_province tibble and I will summarize the three columns in a new tibble to get the total sums for the number of cities and their populations in 2011 and 2016. Since using summarize means I lose the province variable, I will recreate it using mutate to have a value of “Canada”.

wiki_data_total_row <- wiki_data_by_province %>%
  summarize(number_of_cities_in_top_100 = sum(number_of_cities_in_top_100),
            pop_of_largest_cities_2011 = sum(pop_of_largest_cities_2011),
            pop_of_largest_cities_2016 = sum(pop_of_largest_cities_2016)) %>%
  mutate(province = "Canada")

wiki_data_total_row %>%
  kable() %>%
  kable_styling(full_width = FALSE, position = "left")
number_of_cities_in_top_100 pop_of_largest_cities_2011 pop_of_largest_cities_2016 province
100 23062456 24468841 Canada

Now I want to merge both the wiki_data_by_province tibble and the wiki_data_total_row tibble on top of one another (using bind_rows). This will be the table that I will save and then read into my .Rmd file in order to create the table in PDF.

wiki_data_final_table <- wiki_data_by_province %>%
  bind_rows(wiki_data_total_row) %>%
  mutate_at(vars(contains("pop")),
            list(. %>%
                   scales::comma()))  # This mutate_at (created using code from https://suzan.rbind.io/2018/02/dplyr-tutorial-2/#mutate-at-to-change-specific-columns) converts all variables containing the word "pop" to have commas separating the thousands.

wiki_data_final_table %>%
  kable() %>%
  kable_styling(full_width = FALSE, position = "left")
province number_of_cities_in_top_100 pop_of_largest_cities_2011 pop_of_largest_cities_2016
Alberta 13 2,538,547 2,891,712
British Columbia 14 3,254,203 3,471,292
Manitoba 2 715,649 760,249
New Brunswick 4 244,910 250,811
Newfoundland and Labrador 1 172,312 178,427
Nova Scotia 2 335,154 346,605
Ontario 39 10,140,286 10,659,522
Prince Edward Island 1 41,613 44,739
Quebec 20 5,140,554 5,337,846
Saskatchewan 4 479,228 527,638
Canada 100 23,062,456 24,468,841

Now I will save the wiki_data_final_table tibble as an .rds file so that I can read it into my .Rmd file.

wiki_data_final_table %>%
  saveRDS(here::here("content",
                     "post",
                     "tables_in_pdf",
                     "cleaned_wiki_data_for_table.rds"))

Creating the tables in PDF

Below is the .Rmd file that reads in the wiki_data_final_table tibble and uses the kable and kableExtra packages in order to get the table to look the way I want it to.

Also, notice that in the YAML, my output format is bookdown::pdf_document2. This allows me to cross-reference my tables with the text of my document.

So, what’s the real secret to creating tables in PDF from RMarkdown?

The secret is…

The document ✨Create Awesome LaTeX Table with knitr::kable and kableExtra✨, written by Hao Zhu! It’s got everything you need in order to make fantastic tables in PDF using RMarkdown.

To see the final PDF of the below .Rmd file, click here!

And if you’d like to see the actual .Rmd file on my Github page, click here!




---
output: 
  bookdown::pdf_document2:
    toc: no
    number_sections: FALSE
    keep_tex: TRUE
always_allow_html: yes
geometry: "left=1.5cm,right=7cm,top=2cm,bottom=2cm"

---

```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = FALSE,
                      warning = FALSE,
                      message = FALSE,
                      out.width= "8.5in")

library(dplyr)
library(knitr)
library(kableExtra)

# Colours for the table
blue_table_colour <- "#9BD4F5"

orange_table_colour <- "#FF9000"

light_striping_table_colour <- "#DDDDDD"

```

```{r import-cleaned-data}

wiki_data_final_table <- readRDS(here::here("content",
                                            "post",
                                            "tables_in_pdf",
                                            "cleaned_wiki_data_for_table.rds"))

```

# The top 100 cities in Canada

Table \@ref(tab:table-population-by-province) shows the populations of each province and of Canada in total for the years 2011 and 2016 (this data includes only the top 100 largest cities in Canada).

```{r table-population-by-province}

wiki_data_final_table %>%
  knitr::kable(
    "latex",
    booktabs = TRUE,
    linesep = "",
    caption = "Number of cities and population of each province (including only the largest 100 cities in Canada)",
    col.names = c("Province", "Number of largest 100 cities in this province", rep(c("2011", "2016"), 1)),
    align = c("l", rep("r", 3))) %>%
  kable_styling(latex_options = "HOLD_position") %>%  # This line holds the table where you want it, so LaTeX won't move it around
  add_header_above(
    c(" " = 1,  # There has to be a space here, like this " ", and not like this ""
      " " = 1,
      "Population" = 2),
    bold = TRUE,
    line = FALSE,
    background = blue_table_colour
  ) %>%
  column_spec(1,
              width = "6cm") %>%
  column_spec(2:4,
              width = "3cm") %>%
  footnote(general = "There are other cities in Canada not included in this table; the excluded cities are those smaller than the 100 largest cities in Canada.",
           threeparttable = TRUE,
           general_title = "Footnote:") %>%
  row_spec(
    row = 0,
    background = blue_table_colour,
    bold = TRUE,
    align = "c"
  ) %>%
  row_spec(
    row = c(2,4,6,8,10),
    background = light_striping_table_colour
  ) %>%
  row_spec(
    row = 11,
    background = orange_table_colour,
    bold = TRUE
  ) %>%
  row_spec(
    row = 10,
    hline_after = TRUE)  # This hline unfortunately gets hidden by the orange colouring of the final row, so this line of code doesn't really do anything :(

```
Avatar
Sharleen
Statistician

Related