Join two data sets


rtlas

Problem

marks.xlsx has the following columns:

# A tibble: 30 × 6
      id midterm final  lab1  lab2  lab3
   <int>   <int> <int> <int> <int> <int>
 1     2      33    15    NA    NA    NA
 2     3      16     5    NA    NA    NA
 3     7      13    27    NA    NA    NA
 4    10      22    11    NA    NA    NA
 5    13      33    17    NA    NA    NA
 6    14      28    50    NA    NA    NA
 7    18      25    35    NA    NA    NA
 8    20      18    43    NA    NA    NA
 9    22       8    35    NA    NA    NA
10    23      39    23    NA    NA    NA
# … with 20 more rows

Columns lab1, lab2, lab3 are currently empty. Their values are stored in another file labs.xlsx,

# A tibble: 100 × 4
      id  lab1  lab2  lab3
   <int> <int> <int> <int>
 1     1     3     9     1
 2     2     9    10     7
 3     3     1     5     7
 4     4     7     6     7
 5     5     4     7     1
 6     6     9     6    10
 7     7     2     2     0
 8     8     1     4     1
 9     9     9     0     4
10    10    10     6     4
# … with 90 more rows

Solution

The students included in marks.xlsx is a subset of labs.xlsx, evidenced by their different number of rows. To grab the lab1 - lab3 data from labs.xlsx and fill them in marks.xlsx, I can use dplyr::left_join().

marks <- dplyr::left_join(marks[, 1:3], labs, by = "id")
marks
# A tibble: 30 × 6
      id midterm final  lab1  lab2  lab3
   <int>   <int> <int> <int> <int> <int>
 1     2      33    15     9    10     7
 2     3      16     5     1     5     7
 3     7      13    27     2     2     0
 4    10      22    11    10     6     4
 5    13      33    17     2    10     0
 6    14      28    50     0     6     6
 7    18      25    35     8     3     7
 8    20      18    43     8     1    10
 9    22       8    35     5     6     8
10    23      39    23     3    10     1
# … with 20 more rows

I can then export marks

readr::write_csv(tmp, file = "path/to/marks_new.csv")

Epilogue

When I implemented this solution, there was one more complication: the labs.xlsx was originally not a single file, but separated into 9 files. I imported and concatenated them in one go with purrr.

# assuming the the 9 files are names "lab#.xlsx"
labs_file <- list.files(path = "./marks", pattern = "^lab", full.names = TRUE)
labs_data <- purrr::map_dfr(labs_file, ~readxl::read_excel(.x))