It is October 26th 2022 when the last filled out questionnaire of the last participants in the last wave of data collection arrives in our mailbox. More than 10 years after the founders of the study started planning, followed by recruiting participants, 8 years of full-time data collection by a team of over 200 researchers and students resulting in more than 10.000 data files, the last planned data is collected. An amazing moment that we all looked forward to for so long. Let’s start processing, analyzing and publishing this data! At the same time we are well aware of the incredible value of this study, there are so many research questions to explore! Too many to even explore ourselves even if we had a 1000 years to do so.
Within our team we quickly agreed that we would love to share the study with our colleagues in the field. We are all strong advocates of open science, we have published open access for years now, and are now publishing preprints and registering our plans as pre-registrations. And we all have positive experiences with collaborations with (inter)national colleagues in the field. Yet, we soon realized that sharing data openly is a different game. Where do we even start? This was quite a journey for us, and we like to share what we have learned and the decisions we took to get where we are now in our aim to make the study publicly available in May 2023.
Our open science vision
In our exploration phase on how we could share the data of the LCID study, we first noticed that there are many different reasons for why researchers conduct open science. So as a start we had a number of brainstorm sessions within the team on why open science is specifically important to us. There were three common themes in our drives to contribute to open science. First, we all believe that being transparent in all our steps of the study process will improve data and science quality. One of the doubts that was mentioned was that transparency also makes you vulnerable: sharing your findings can result in unsupportive comments on social media, or the dreaded feedback from reviewers. But just acknowledging this hesitation and our conversation about it, soon made us realize this disadvantage does not outweigh the advantages. After all,we all want to answer scientific questions the best way we can, driven by curiosity and help understand well-being in children and adolescents. Sharing our practices will facilitate reproducibility and replication. But moreover, sharing all the steps in the scientific process will help other scientists to conduct their research by learning from our practices. These practices often cannot be fully reported in scientific publications, although that would be very helpful for other researchers. We learned a lot in all these years of data collection, and we would like to prevent researchers from inventing the wheel all over again. This will double the value of the costly and time-consuming data collection. In addition to this, we are so grateful for our amazing participants that have shared so much of their time and life with us. Sharing data will therefore also reduce participant burden.
Accessibility and findability
We also had a specific challenge: although we all have programming experience to smaller or larger degrees, we simply do not have the time to keep up with all these new tools available. So some of these tools are inaccessible to us because of their complexity, for which we simply do not have the skills and time to learn the skills to find out. So, it was really important to us that our open science products are accessible to many different scientists from various backgrounds herewith appreciating diversity in training, talents, and profiles (read more on this topic here [link to VSNU article on diversity]). Therefore we figured we would like to include tutorials on data processing and analysis that are accessible for researchers with different levels of programming expertises. And make the steps to get access to our protocols and data as accessible as possible. Third, all of these open science ideas wouldn’t have much impact if no one would even know we even exist. Therefore, findability is another important aspect of our vision.
Challenges from vision to practice
So yes, we have now developed a vision where transparency, findability and accessibility are central. We realized that we do not only want to share the final product of our study (data), but also our experiences and best practices in data collection and management (process). But in doing so we came across a number of challenges. Some are specific to our study, as it is so large, including many different data types that were longitudinally collected. So for some of our aims there are not suitable OS solutions out there (yet).
Challenge 1: Centrality
First, bits of our study are all over the internet, scripts on Github, publications with journals, data packages on DataVerse, pre-registrations and protocols on OSF. We needed a central place where everything comes together, a central hub that would be easy to find. This was where the idea for this website was born.
Challenge 2: Privacy of our participants
The privacy of our participants is our highest priority. Therefore, we will not be able to share the full dataset with collaborators. But what we can do is share anonymized parts of the data, which highly reduces trackability of our subjects. Therefore, we were looking for a data sharing platform that would enable us to do so, and that could accommodate our enormous dataset (over 100 different types of measures of 1000 children and their parents over 6 waves of data collection) . This requires specific solutions, such as storage capacities, archiving, and restricted permission settings. But moreover, we want data to be accessible to our users, also the ones with limited programming skills. No platform currently available is able to solve this for us. As such, we are developing our own. In close collaboration with a surf and our university support staff. We are building add-ons to existing platforms that meet our needs. Early test phase September 2022, stay tuned.
Challenge 3: Organizing an accessible/findable database
I think by now you understand the extensiveness of the Leiden-CID dataset. For such a large dataset, it is relatively well organized according to plans created by data managers who were involved at the start of the study. Nevertheless, the research team that worked on this data was large, and over the course of the years many researchers and assistants came and went. You can imagine there are inconsistencies in the way the data is stored. So after all the data was collected, we needed a structure to save the data. Yet, no existing method perfectly fitted our multimodal longitudinal dataset. So, we developed an adapted version of the existing Brain Imaging Data structure BIDS. This allows for easy access to the data, in a way that allows a user to decide how to organize the data. For example, say you’d like to know what assessments were used in measuring ‘cognitive control’, then you may want to organize the data by assessment type. Yet, if you want to know what measures we collected in the 6 waves in the middle childhood cohort, you might prefer an organization based on waves in a timeline kind of way. To achieve flexibility we have created .json files that contain all relevant metadata for each measure that we assessed. We are developing an accessible way to include the metadata of this study in an interactive metadata explorer. Another advantage of .json files is that we can connect them to the data as well as other metadata platforms, such as CD2. This will at the same time increase our visibility. Read more about the CD2 metadata initiative.
Our planning
So what are our plans for the next year? With the launch of the website in March 2022, we will start with posting our content online. We will start with sharing blog posts (including tutorials and lessons learned), protocols, and scripts. In May 2022, we plan to add an extensive publication overview where you will be able to find all the articles that have been published about the Leiden-CID data so far. This overview will also contain links to DOIs of publications, pre-registrations, pre-prints, scripts, and data packages. In July of 2022 we will share a publication package guide. As we consider publication packages to be crucial in open science, we feel that it is important to give researchers a helping-hand. We also aim to upload a preview of the codebook. In September 2022 we will test a first version of the data sharing platform we are currently developing. By October 2022, all Leiden-CID data will be fully organized. Hence, at this moment we will publish the complete version of the codebook. As our data will become open access in May 2023, DRFs can be submitted from March 2023 onward. In May 2023, a large part of the processed data of Leiden-CID will become open access! Simultaneously, the platform we will use to share our data will be introduced. Note that not all data can be shared due to privacy regulations.
Get in touch!
We hope that by sharing our journey and products we can help researchers in the field. All of our tools are open to use. If you have any questions or suggestions please connect with us on neurostars or by e-mail.