Acknowledgements
Thanks to Dorothea Lesche, Peter Mac postdoc and R course attendee (May 2019) for the idea.
Challenge
You’ve collected some information on patients for your research study using REDCap and now you want to do some analysis of the data. You need to separate the patient information using the information collected. You need one file containing the information for all patients who had a single tumour occurance and one file containing the information for all patients who had tumour recurrances. You also need a file containing the information for recurrance patients who had a tumour recur in the same location.
The aim of this challenge is to separate the REDCap patient information into files based on the information collected.
Steps
- Read in the csv file called
redcap_patients.csv
, save it as an object called patients
.
library(tidyverse)
patients <- read_csv("/training/r-intro-tidyverse/data/redcap_patients.csv")
Parsed with column specification:
cols(
.default = col_character(),
rep_inst = [32mcol_double()[39m,
age_diag = [32mcol_double()[39m,
size_dcis = [32mcol_double()[39m
)
See spec(...) for full column specifications.
patients
- The data extracted from REDCap is a bit messy (see above). For example, information for the date of birth (DOB) and last communication date (last_coms) is on a separate row to the tumour information for each patient. However, you have figured out how to format it better with tidyverse (nice work!). You’ve used tidyr’s fill function to fill the DOB and last_coms columns down and tidyr’s drop_na to remove rows that are NA for the rep_inst column (run the two commands below to clean the data).
# copy DOB and last_coms down by one row
patients <- fill(patients, DOB, last_coms, .direction = c("down"))
# delete rows where the rep_inst column is NA
patients <- drop_na(patients, rep_inst)
patients
- Now you need to create the separate files for the patients with a single tumour occurance and those with tumour recurrances. The StudyID column contains the patient id. StudyIDs that appear more than once are the patients with recurrances. First get the ids of the patients that appear only once.
- Count how many times each patient appears. Hint: use dplyr’s count function.
- From the count result extract the patient ids that appear once. Hint: use filter.
- Make a vector of these single occurance patients ids. Hint: you can use dplyr’s pull function as we did in the volcano plot tutorial
- Use the vector of patients ids to extract the information for single occurance patients from the
patients
object. Write out the file as redcap_single.csv
. Hint: use filter and %in%
- Use the vector of patients ids to extract the information for multiple occurance patients from the
patients
object. Write out the file as redcap_recur.csv
. Hint: use filter, %in% and ! (! means not).
- Email the instructor your code and the csv files.
- Extra task is to identify the recurrance patients who have had at least two tumours occur in the same location. The location is given in the column called breast_side. Hint: use dplyr’s group_by function followed by filter. Write out the file as
redcap_recur_same_loc.csv
LS0tCnRpdGxlOiAiSW50cm9kdWN0aW9uIHRvIFI6IFdlZWsgNCBQcmFjdGljZSIKYXV0aG9yOiAiTWFyaWEgRG95bGUiCmRhdGU6ICJgciBmb3JtYXQoU3lzLnRpbWUoKSwgJyVkICVCICVZJylgIgpvdXRwdXQ6IAogIGh0bWxfbm90ZWJvb2s6CiAgICB0b2M6IHllcwogICAgdG9jX2Zsb2F0OiB5ZXMKICAgIHRvY19kZXB0aDogMgpzdWJ0aXRsZTogUkVEQ2FwIGNoYWxsZW5nZQotLS0KCiMjIyMgQWNrbm93bGVkZ2VtZW50cwpUaGFua3MgdG8gRG9yb3RoZWEgTGVzY2hlLCBQZXRlciBNYWMgcG9zdGRvYyBhbmQgUiBjb3Vyc2UgYXR0ZW5kZWUgKE1heSAyMDE5KSBmb3IgdGhlIGlkZWEuCgojIyBDaGFsbGVuZ2UKCllvdSd2ZSBjb2xsZWN0ZWQgc29tZSBpbmZvcm1hdGlvbiBvbiBwYXRpZW50cyBmb3IgeW91ciByZXNlYXJjaCBzdHVkeSB1c2luZyBbUkVEQ2FwXShodHRwczovL3d3dy5wcm9qZWN0LXJlZGNhcC5vcmcvKSBhbmQgbm93IHlvdSB3YW50IHRvIGRvIHNvbWUgYW5hbHlzaXMgb2YgdGhlIGRhdGEuIFlvdSBuZWVkIHRvIHNlcGFyYXRlIHRoZSBwYXRpZW50IGluZm9ybWF0aW9uIHVzaW5nIHRoZSBpbmZvcm1hdGlvbiBjb2xsZWN0ZWQuIFlvdSBuZWVkIG9uZSBmaWxlIGNvbnRhaW5pbmcgdGhlIGluZm9ybWF0aW9uIGZvciBhbGwgcGF0aWVudHMgd2hvIGhhZCBhIHNpbmdsZSB0dW1vdXIgb2NjdXJhbmNlIGFuZCBvbmUgZmlsZSBjb250YWluaW5nIHRoZSBpbmZvcm1hdGlvbiBmb3IgYWxsIHBhdGllbnRzIHdobyBoYWQgdHVtb3VyIHJlY3VycmFuY2VzLiBZb3UgYWxzbyBuZWVkIGEgZmlsZSBjb250YWluaW5nIHRoZSBpbmZvcm1hdGlvbiBmb3IgcmVjdXJyYW5jZSBwYXRpZW50cyB3aG8gaGFkIGEgdHVtb3VyIHJlY3VyIGluIHRoZSBzYW1lIGxvY2F0aW9uLgoKVGhlIGFpbSBvZiB0aGlzIGNoYWxsZW5nZSBpcyB0byBzZXBhcmF0ZSB0aGUgUkVEQ2FwIHBhdGllbnQgaW5mb3JtYXRpb24gaW50byBmaWxlcyBiYXNlZCBvbiB0aGUgaW5mb3JtYXRpb24gY29sbGVjdGVkLgoKCiMjIyBTdGVwcwoKKiBSZWFkIGluIHRoZSBjc3YgZmlsZSBjYWxsZWQgYHJlZGNhcF9wYXRpZW50cy5jc3ZgLCBzYXZlIGl0IGFzIGFuIG9iamVjdCBjYWxsZWQgYHBhdGllbnRzYC4KYGBge3J9CmxpYnJhcnkodGlkeXZlcnNlKQpwYXRpZW50cyA8LSByZWFkX2NzdigiL3RyYWluaW5nL3ItaW50cm8tdGlkeXZlcnNlL2RhdGEvcmVkY2FwX3BhdGllbnRzLmNzdiIpCnBhdGllbnRzCmBgYAoKKiBUaGUgZGF0YSBleHRyYWN0ZWQgZnJvbSBSRURDYXAgaXMgYSBiaXQgbWVzc3kgKHNlZSBhYm92ZSkuIEZvciBleGFtcGxlLCBpbmZvcm1hdGlvbiBmb3IgdGhlIGRhdGUgb2YgYmlydGggKERPQikgYW5kIGxhc3QgY29tbXVuaWNhdGlvbiBkYXRlIChsYXN0X2NvbXMpIGlzIG9uIGEgc2VwYXJhdGUgcm93IHRvIHRoZSB0dW1vdXIgaW5mb3JtYXRpb24gZm9yIGVhY2ggcGF0aWVudC4gSG93ZXZlciwgeW91IGhhdmUgZmlndXJlZCBvdXQgaG93IHRvIGZvcm1hdCBpdCBiZXR0ZXIgd2l0aCB0aWR5dmVyc2UgKG5pY2Ugd29yayEpLiBZb3UndmUgdXNlZCB0aWR5cidzIGZpbGwgZnVuY3Rpb24gdG8gZmlsbCB0aGUgRE9CIGFuZCBsYXN0X2NvbXMgY29sdW1ucyBkb3duIGFuZCB0aWR5cidzIGRyb3BfbmEgdG8gcmVtb3ZlIHJvd3MgdGhhdCBhcmUgTkEgZm9yIHRoZSByZXBfaW5zdCBjb2x1bW4gKHJ1biB0aGUgdHdvIGNvbW1hbmRzIGJlbG93IHRvIGNsZWFuIHRoZSBkYXRhKS4KCmBgYHtyfQojIGNvcHkgRE9CIGFuZCBsYXN0X2NvbXMgZG93biBieSBvbmUgcm93CnBhdGllbnRzIDwtIGZpbGwocGF0aWVudHMsIERPQiwgbGFzdF9jb21zLCAuZGlyZWN0aW9uID0gYygiZG93biIpKQoKIyBkZWxldGUgcm93cyB3aGVyZSB0aGUgcmVwX2luc3QgY29sdW1uIGlzIE5BCnBhdGllbnRzIDwtIGRyb3BfbmEocGF0aWVudHMsIHJlcF9pbnN0KQpwYXRpZW50cwpgYGAKKiBOb3cgeW91IG5lZWQgdG8gY3JlYXRlIHRoZSBzZXBhcmF0ZSBmaWxlcyBmb3IgdGhlIHBhdGllbnRzIHdpdGggYSBzaW5nbGUgdHVtb3VyIG9jY3VyYW5jZSBhbmQgdGhvc2Ugd2l0aCB0dW1vdXIgcmVjdXJyYW5jZXMuIFRoZSBTdHVkeUlEIGNvbHVtbiBjb250YWlucyB0aGUgcGF0aWVudCBpZC4gU3R1ZHlJRHMgdGhhdCBhcHBlYXIgbW9yZSB0aGFuIG9uY2UgYXJlIHRoZSBwYXRpZW50cyB3aXRoIHJlY3VycmFuY2VzLiBGaXJzdCBnZXQgdGhlIGlkcyBvZiB0aGUgcGF0aWVudHMgdGhhdCBhcHBlYXIgb25seSBvbmNlLgogICAgMS4gQ291bnQgaG93IG1hbnkgdGltZXMgZWFjaCBwYXRpZW50IGFwcGVhcnMuICBIaW50OiB1c2UgZHBseXIncyBjb3VudCBmdW5jdGlvbi4KICAgIDIuIEZyb20gdGhlIGNvdW50IHJlc3VsdCBleHRyYWN0IHRoZSBwYXRpZW50IGlkcyB0aGF0IGFwcGVhciBvbmNlLiBIaW50OiB1c2UgZmlsdGVyLgogICAgMy4gTWFrZSBhIHZlY3RvciBvZiB0aGVzZSBzaW5nbGUgb2NjdXJhbmNlIHBhdGllbnRzIGlkcy4gSGludDogeW91IGNhbiB1c2UgZHBseXIncyBwdWxsIGZ1bmN0aW9uIGFzIHdlIGRpZCBpbiB0aGUgdm9sY2FubyBwbG90IHR1dG9yaWFsCiogVXNlIHRoZSB2ZWN0b3Igb2YgcGF0aWVudHMgaWRzIHRvIGV4dHJhY3QgdGhlIGluZm9ybWF0aW9uIGZvciBzaW5nbGUgb2NjdXJhbmNlIHBhdGllbnRzIGZyb20gdGhlIGBwYXRpZW50c2Agb2JqZWN0LiBXcml0ZSBvdXQgdGhlIGZpbGUgYXMgYHJlZGNhcF9zaW5nbGUuY3N2YC4gSGludDogdXNlIGZpbHRlciBhbmQgJWluJQoqIFVzZSB0aGUgdmVjdG9yIG9mIHBhdGllbnRzIGlkcyB0byBleHRyYWN0IHRoZSBpbmZvcm1hdGlvbiBmb3IgbXVsdGlwbGUgb2NjdXJhbmNlIHBhdGllbnRzIGZyb20gdGhlIGBwYXRpZW50c2Agb2JqZWN0LiBXcml0ZSBvdXQgdGhlIGZpbGUgYXMgYHJlZGNhcF9yZWN1ci5jc3ZgLiBIaW50OiB1c2UgZmlsdGVyLCAlaW4lIGFuZCAhICghIG1lYW5zIG5vdCkuCiogRW1haWwgdGhlIGluc3RydWN0b3IgeW91ciBjb2RlIGFuZCB0aGUgY3N2IGZpbGVzLgoqIEV4dHJhIHRhc2sgaXMgdG8gaWRlbnRpZnkgdGhlIHJlY3VycmFuY2UgcGF0aWVudHMgd2hvIGhhdmUgaGFkIGF0IGxlYXN0IHR3byB0dW1vdXJzIG9jY3VyIGluIHRoZSBzYW1lIGxvY2F0aW9uLiBUaGUgbG9jYXRpb24gaXMgZ2l2ZW4gaW4gdGhlIGNvbHVtbiBjYWxsZWQgYnJlYXN0X3NpZGUuIEhpbnQ6IHVzZSBkcGx5cidzIGdyb3VwX2J5IGZ1bmN0aW9uIGZvbGxvd2VkIGJ5IGZpbHRlci4gV3JpdGUgb3V0IHRoZSBmaWxlIGFzIGByZWRjYXBfcmVjdXJfc2FtZV9sb2MuY3N2YAoK