Leiden-CID: How to organize a large multimodal longitudinal dataset
During the Leiden-CID study, an immense amount of data was collected. Two cohorts, twelve waves, over 120 measures, three different types of informants, and many different types of data (hair samples, saliva samples, video data, MRI and EEG data, questionnaires, etc.). Add three different processing stages, and metadata files for each separate data file and you end up with more than 10.000(!) files that need to be organized.
So we needed a structured plan to make sure that anyone could easily work with our data, and that even 10 years from now everyone, including us, would still be able to understand it.
To get to this point, we had to face many challenges. For one,
What standardized structure to use?
We wanted to use a standardized format for our data that was intuitive and logical. However, the standardized data structures we came across did not work for our multimodal longitudinal dataset.
To illustrate, we looked at Brain Imaging Data Structure (BIDS). This is a standardized way to structure MRI data and participant data. The MRI data is organized on an individual level, which means that every participant gets their own folder and files. There is also a format to add demographics or behavioral data on a measurement level. Yet with the large amount of data that we have over a number of waves we have required some deviations from the standardized procedure. As such, we decided to use a separate structure for our brain imaging data and one for our behavioral data.
We used roughly the same naming conventions (in line with BIDS), and made sure to always use the same naming structure for subject id, so our ‘bids-data’ and ‘behavioral data’ could always be linked to each other.
But also getting to these naming conventions is not that straight forward, there are many decisions to make there. It was clear to us that naming of files and variables needed to be as consistent as possible. Without consistent variable and file naming, it would not be possible to merge files; for example, questionnaire data from several waves. We also wanted to make sure that the structure of names would facilitate automated processes. This is why we chose to use a bids-adapted naming convention.
File names: cohort_ses-[session]_task_respondent
Variable names: cohort_session_task_subpart task_q[question #]_respondent
The rules that we applied to the namings are: No capital letters, distinct parts or chunks separated by an underscore, for file names: separate words within a chunk distinguished by a hyphen. But another challenge on the road: we couldn’t use a hyphen variable name in our SPSS data files (part of the data was already in this format), as SPSS does not allow hyphens. While we will share .tsv files, it is important that anyone can use our data, including the people who use SPSS and want to convert .tsv files to .sav files. Therefore, we decided to only use underscores in variable names.
Another decision had to be made regarding processing stages. What kind of data were we going to share? It was clear to us that raw data (i.e., data without any adjustments such exported Qualtrics files) would never be shared, as this includes personal information such as names and addresses, and privacy of participants is our top priority. We decided to make the distinction between raw data, processed data, and derivatives, and only share the latter.
Raw data is data that has not gone through any sort of processing.
Processed data, for us, consists of raw data that has been ‘cleaned’, so personal information is removed, a data check has been performed, data cleaning remarks are saved, etc.
Derivative data will have gone through the full cleaning processes: our standard variables are added, including subject ids so all data can be linked, all labels and values are checked and consistently named, variables are recoded, and sum/scale scores are computed. Additionally, all data is converted to long format so files can easily be merged.
So, we were at the point that we had a good, logical data structure. We had consistent naming conventions, we had decided on processing stages and what data to share. This increased accessibility to our data.
But we also needed information about the data, metadata, so that people can work with our data in a flexible way. This means that we wanted researchers to look through our data in whatever way they liked, to structure our data in whatever way works for them, and also to use whatever data extension works for them. Just because a structure is intuitive for one person, doesn’t mean it is the case for someone else.
We decided that JSON files were a good option for storing metadata, as they give us the flexibility and accessibility of our data that we were looking for. Within these JSON files we save general and specific information about data (see Template JSON file here).
- General information. The tags in the general information part gives us flexibility in data structure and facilitates automatic processes on our data, such as making a codebook or metadata explorer.
- Specific information. The specific information gives us flexibility when it comes to using different data extensions. Not all data extensions (.tsv, .csv) retain variable specific information such as labels, values, value labels, etc. The extensions that do, e.g., .sav are not inline with our open science vision of accessibility, as they do not originate from an open source platform.
Therefore, all of this metadata is saved in json files that accompany every single data file we share. The variable information can then be attached to the data file again through R for example.
In addition to JSON files, we also decided to use README files to save information that did not necessarily need to be used for automated processes, data analysis, etc. In our READMEs, we therefore saved information about data processing, including steps taken, special cases, exclusions, general data cleaning remarks. We also include information about researchers who worked on the datafile and the timeframe of processing.
How to organize your data structure?
We are not telling you our data structure is the only way to go, there is no one way of doing this. But it may save you a lot of time if we share our ideas, lessons learned, and templates on structuring our data.
Leiden Consortium on Individual Development - a template of our data structure
I think the most important thing, I learned, is to be consistent within your structure and to make sure there is room for flexibility. Therefore, I would advise to make metadata files, such as JSON files, and a clear measure overview, from the start. You do not need to worry about things changing later on (adding measures, changing names, etc.) because when structure and naming conventions are consistent, these changes can be processed quickly and automatically over all files, with help of JSON files.
To realize all this, it may be helpful to get help from a datamanager or data specialist from the start of your project. That isn’t to say that researchers are not capable of implementing the aforementioned aspects but it takes time, dedication, and specialization, and over the course of the project we learned that it is more efficient to have one person specifically assigned to this.
Another lesson that helped me a lot in working with the data, is something a colleague of mine said to me once “Be kind to your future self”. Any steps you take now to organize your data properly will help you and other researchers tremendously in the future and will ultimately improve the quality of research. And isn’t that what we all want?