Redcap to cBioPortal ETL¶
---¶
Step-by-step break down of setting up a Redcap study for data to be exposed on cBioPortal, pulling data from Redcap, and creating configuration files for transforming data into a cBioPortal data format
There are three (3) configuration files that can be used for exporting data from Redcap and transforming the data into the cBioPortal format: - API mapping file (Required) - Key variable map file (Required) - Data transformation file (Optional)
Setting up Redcap for the ETL¶
Required from Redcap before starting¶
- A Redcap API token for the study
- At least one (1) Redcap report
- These variables in the study codebook:
- DOB column
- record id column
- de-identified ID column
- IMPACT sample id column
Requirements for a Redcap report¶
Data on cBioPortal cannot contain PHI such as dates, accession numbers, or identifiable patient IDs.
Therefore, the following is needed for each Redcap reports: - The date of birth column, or another reference date. This is only used when the report contains a date attribute that needs to be de-identified. - The de-identified patient ID column. This will map to the IDs used when initially creating a cBioPortal study. - The IMPACT sample id column. Only required for sample-level data.
Ideal setup of Redcap reports¶
Data that will be imported into cBioPortal will need to fit these three categories: - Patient-level summaries (data_clinical_patient.txt) - Sample-level summaries (data_clinical_sample.txt) - Timeline files (data_timeline_specimen.txt)
Therefore, Redcap reports will ideally be constructed in a similar way: - Non-repeating instruments, patients (Every row contains data for each patient) - Repeating instruments, samples (Every row contains data for an sample id) - Repeating instruments, events (Every row contains data for a date/time stamp)
If Redcap report data does not fit this ideal setup, that's okay! Steps below can be taken to format the data. Once your Redcap study is setup, the following configuration files need to be created.
Configuration files for exporting data from Redcap¶
API mapping file¶
Purpose¶
This file is used as high-level instructions for code to perform these functions: - Export the Redcap reports of interest onto your server. - Decide which reports are needed for transfer and which are to be imported onto cbioPortal - Indicate how Redcap report data should be transformed to fit the cBioPortal format
Columns¶
- REPORT_NAME
- Name of report on redcap, but can be named anything.
- API_ID
- API ID number located in the Redcap report summary/edit page
- FOR_CBIOPORTAL
- Indication if Redcap report will be loaded onto cbioportal or if left as a csv at destination
- TIMELINE_LEVEL
- Column indicates that the Redcap report will be converted into a cBioPortal timeline file according to the specifications in the data transformation files. Entry for this column should be the filename for the data transformation file.
- PATIENT_LEVEL
- Column indicates that the Redcap report will be converted into a cBioPortal patient summary file. For direct mapping into a summary file, enter (x). Otherwise, entry for this column should be the filename for the data transformation file.
- SAMPLE_LEVEL
- Column indicates that the Redcap report will be converted into a cBioPortal sample summary file. For direct mapping into a summary file, enter (x). Otherwise, entry for this column should be the filename for the data transformation file.
- TIMELINE_TYPE
- Label for the expected cBioPortl timeline file type that the Redcap report will be converted to. Each cBioPortal format file will have to adhere to a specific format. The format types are: treatment, toxicity, and diagnosis (please see the cbioportal docs for more info: cBioPortal Docs)
Redcap key variable map file¶
Purpose¶
This file is used to identify the patient and sample ID mapping (record_id to a de-identified id), while also using the DOB as a marker for de-identifying dates to age, in days
Columns¶
- variable
- DOB column (col_dte_birth)
- record id column (col_darwin_id)
- de-identified ID column (col_sample_id)
- IMPACT sample id column (col_record_id)
- redcap_field
- Entries for each attribute defined in Redcap study
Data transformation files¶
Purpose¶
This file is used to determine the specifications for how data will be transformed and aggregated to fit the patient-, sample-, or timeline-level data files.
Columns¶
- COLUMN_NAME_REDCAP
- Column name used in Redcap
- COLUMN_NAME_CBIOPORTAL
- Column name that will be used in cbioportal ("heading" name). This name should be all CAPS. For timeline transformations, use START_DATE and END_DATE
- KEEP_FOR_CBIOPORTAL
- (This column will be deprecated) should be all (x).
- TIMELINE
- Marker if column shoud be included in a timeline file type. Mark as (x) to include.
- AGGREGATE_PATIENT
- Marker if column shoud be included in a patient-level file type. Mark as (x) to include. If report transformed with this file is a repeated instrument, notes can be used here to aggregate by other functions (max, min, mean, etc).
- AGGREGATE_SAMPLE
- Marker if column shoud be included in a sample-level file type. Mark as (x) to include. If report transformed with this file is a repeated instrument, unaligned with sample-level data, notes can be used here to aggregate by other functions (max, min, mean, etc).
- MELT_ID_VARS
- When melting a Redcap report, (ie. transforming data from patient-level to timeline values) mark the row (x) that indicates the patient ID that will be used.
- MELT_VAL_VARS
- When melting data, indicate the Redcap report columns to be used as date events. Note that COLUMN_NAME_CBIOPORTAL must be labeled as START_DATE or END_DATE when using.
- PIVOT_IDX
- (This column will be deprecated) When pivoting a Redcap report (ie. transforming data from repeated-instrument to patient-level values), mark the row (x) that indicates the patient ID that will be used.
- PIVOT_SUMMARY
- (This column will be deprecated) When pivoting data, mark the rows (x) that will be converted from multiple selection (dropdown, checkbox) into a pivot table of multiple columns of Yes/No
Exporting reports from your Redcap study¶
Once the API mapping configuration file has been created, Redcap reports can be pulled to a destination using:
data-curation/redcap_tools/redcap_api_report_pull.py
Requirements¶
Create a virtual environment containing these packages: - redcap - pandas - numpy - re - argparse
Alternatively, after creating a virtual environment, install the required packages using the requirements.txt file using
pip install -r /path/to/requirements.txt
Command Line¶
1 2 3 4 5 6 |
|
Example¶
1 2 3 4 5 6 |
|
Formatting Redcap reports in the cBioPortal format¶
Once data and codebook is exported from Redcap, data must be transformed and formatted into the cBioPortal format before importing. The cBioPortal format consists of three file types: - Patient-level summary - Sample-level summary - Timeline (event) data
Depending on the complexity of the Redcap reports being exported (i.e. reports from one instrument [simple] vs. reports from multiple instruments [complex]), built-in or custom made functions can be used.
Built-in functionality¶
Requirements¶
Create a virtual environment containing these packages: - pandas - numpy
Alternatively, after creating a virtual environment, install the required packages using the requirements.txt file using
pip install -r /path/to/requirements.txt
Python Function¶
1 2 3 4 5 |
|
Parameters:¶
1 2 3 4 5 6 7 8 9 |
|
Attributes:¶
1 2 3 4 5 |
|
Returns:¶
1 2 3 4 |
|
Example¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Custom Summary Data Formatting¶
If Redcap report data, or any data file requires custom aggregation, cBioPortal formatting can be done manually. Two files need to be generated:
1) The datafile containing summary data 2) The cBioPortal summary header file
In addition, one other file needs to be modified or created: 3) summary_manifest_patient.csv and/or summary_manifest_sample.csv
Header file¶
The header file is a dataframe containing 5 columns:
1) label
: The text shown on the portal summary page
2) comment
: The "hover-over" text shown when mouse hovers over widet
3) data_type
: Data type of columns (STRING or NUMBER)
4) visible
: Mark 0 or 1 to make visible on summary page
5) heading
: Column names contained in the datafile (below). These columns MUST match the datafile!
For summary and header files, the heading
column must be PATIENT_ID
or SAMPLE_ID
. In addition, the label
column must contain #Patient Identifier
or #Sample Identifier
.
Datafile¶
The datafile contains data to be imported. The column header (first row) must match the heading
values in the header file.
For summary files, the first column must be PATIENT_ID
or SAMPLE_ID
Modifying the manifest file¶
The patient and sample manifest files contain three columns:
1) REPORT_NAME
: The name of the summary file created. Typically taken from API map file, but can be named anything.
2) SUMMARY_FILENAME
: Complete path, filename of the summary datafile.
3) SUMMARY_HEADER_FILENAME
: Complete path, filename of the corresponding header file.
For each set of data and header files, a new row must be entered into the manifest file. This data is required when merging summary data!!
Merging formatted files for datahub import¶
Once summary data and header files are created, a Python script is required to merge the summary files.
Built-in merge functionality¶
Requirements¶
Create a virtual environment containing these packages: - pandas
Alternatively, after creating a virtual environment, install the required packages using the requirements.txt file using
pip install -r /path/to/requirements.txt
Python Function¶
1 2 3 |
|
Parameters:¶
1 2 3 4 5 |
|
Attributes:¶
1 2 3 4 5 |
|
Example¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Manual summary merging¶
Manual merging can be done using the class cBioPortalSummaryMergeTool
Source
More under construction...