In September 2022, the Wellcome Trust commissioned the King’s College London team to conduct a global mapping of large-scale longitudinal datasets with the potential to facilitate transformative mental health research. For this project, we searched for and identified longitudinal datasets on mental health and other topics from across the world and various sectors. In this process, we identified richness in the datasets, reviewing each dataset individually and collecting information about the dataset’s sample, geographical coverage, data sharing policies, data linkage and more. We conducted this work in partnership with charities (MQ Mental Health Research), non-profit organisations (the Open Data Institute), and lived experience expert (LEE) groups, and worked with a range of national and international collaborators throughout. We shared our findings in a report for the Wellcome Trust in July 2023, which details our search strategy in detail.
Throughout this process, we identified over 3,000 longitudinal datasets from across the world. The Atlas of Longitudinal Datasets shares information about these datasets to make them more discoverable to researchers from around the world. This list of datasets includes a range of observational longitudinal studies and datasets designed to capture diverse health outcomes, life events, and social factors. Key study designs include cohort studies, such as birth cohorts, twin cohorts, caregiver and child cohorts, household and community panel studies, and registry studies using linked administrative records and biobanks. We continue work towards adding all of these datasets to the Atlas, and have so far prioritised datasets focused on mental health, large-scale datasets, and datasets from low- and middle-income countries.
Our list of longitudinal datasets is constantly growing as we identify more datasets through active and passive approaches. Since launching the report, we have added over 600 new datasets, primarily through a snowballing process, where reviewing datasets on our existing list leads us to discover additional datasets. Conferences have also been valuable, with team members finding new consortia and initiatives featuring longitudinal datasets. Study teams have also contacted us via email or our website to share datasets that they manage or are involved with, or other datasets that they know about.
In addition to these passive approaches, we continue to actively search for datasets by exploring nearly 300 repositories, which we regularly update and unpack for longitudinal data. We also conduct internet searches using platforms like Google to locate more datasets. Details on other methods used to identify the initial 3,000 datasets are available in the Landscaping report.
For each longitudinal dataset that we have identified, we search for and then extract information (metadata) about the datasets of interest to researchers in a process we call reviewing.
Reviewing refers to searching for and identifying metadata from various sources such as journal articles, study websites, aggregate websites, repositories, and reports.
We collect information about these datasets using manual methods enhanced by artificial intelligence (AI) processes. Our search methods include:
Our core approach is manual, focusing on accuracy, completeness, and verification for each review. This process allows us to gain valuable insight into the landscape of available data and understand how best to make it discoverable. While AI is a valuable tool, it cannot replace the depth of insight we achieve through hands-on, manual work.
We use a custom-made data entry system to allocate, review, and check reviews. Each step of the data entry process corresponds to the following three roles: data manager, reviewer, and data checker. We rotate these these roles periodically.
We use AI to enhance our manual data-gathering process. We have found that it can help with our initial searches for documentation, extracting information from documents, and validating the information that we have manually extracted.
We use Microsoft Copilot as our primary AI tool. We developed and fine-tuned a specific set of prompts to direct Copilot to consistently extract key information about the datasets from sources across the internet. The prompts ensure that every answer includes a source so that we can check the accuracy of its responses. While we use Copilot at this stage, AI technologies, particularly generative AI, are constantly evolving, so the methods we use will adapt accordingly.
We acknowledge that responses from generative AI can be unreliable, so we always check its answers manually using the sources it provides. As a result, we use it more like a second pair of eyes alongside our mostly manual process because, although generative AI models can make mistakes, they can also find information that we might miss when using purely manual methods.
We maintain accuracy, completeness, and clarity in the Atlas by assigning each dataset review to a “data checker” in the team. This role involves thoroughly reviewing each dataset entry to ensure all information is accurate, concise, and current. If any information is missing or inconsistent, the data checker returns it to the reviewer for updates. Once finalised, the data checker publishes the dataset to the Atlas platform. We also employ AI after manual data entry to confirm that all information is legible, and that no sources have been missed.
When reviewers cannot locate important details—such as data access information, funding sources, or primary institutions—we contact the study teams for verification.
We also conduct a seasonal cleaning process in which all entries are harmonised. This ensures consistency across the Atlas and provides an extra layer of error checking.
The Atlas is co-produced by our team of researchers, charity partners and Lived Experience Experts (LEEs) recruited through MQ Mental Health Research.
Our LEE advisory group are based around the world, with representation from Nigeria, Sweden, the United Kingdom, and Zimbabwe, and bring expertise from personal or caretaking experience of mental health conditions, neurodiversity or sensory impairments.
Since the beginning of the project, we have all met to plan the creation and development of the Atlas. This collaboration has been integral to our process, and our LEEs have provided expert advice on aspects such as the Atlas' layout, language, and user-friendliness, ensuring its accessibility to a global audience.
This co-production continues to guide our work.