Atlas - How we work

Our team systematically identifies and records information about longitudinal datasets from around the world and share this information on the Atlas. We describe our approach below.

How we find longitudinal datasets

In September 2022, the Wellcome Trust commissioned the King’s College London team to conduct a global mapping of large-scale longitudinal datasets with the potential to facilitate transformative mental health research. For this project, we searched for and identified longitudinal datasets on mental health and other topics from across the world and various sectors. In this process, we identified richness in the datasets, reviewing each dataset individually and collecting information about the dataset’s sample, geographical coverage, data sharing policies, data linkage and more. We conducted this work in partnership with charities (MQ Mental Health Research), non-profit organisations (the Open Data Institute), and lived experience expert (LEE) groups, and worked with a range of national and international collaborators throughout. We shared our findings in a report for the Wellcome Trust in July 2023, which details our search strategy in detail.

Throughout this process, we identified over 3,000 longitudinal datasets from across the world. The Atlas of Longitudinal Datasets shares information about these datasets to make them more discoverable to researchers from around the world. This list of datasets includes a range of observational longitudinal studies and datasets designed to capture diverse health outcomes, life events, and social factors. Key study designs include cohort studies, such as birth cohorts, twin cohorts, caregiver and child cohorts, household and community panel studies, and registry studies using linked administrative records and biobanks. We continue work towards adding all of these datasets to the Atlas, and have so far prioritised datasets focused on mental health, large-scale datasets, and datasets from low- and middle-income countries.

Our list of longitudinal datasets is constantly growing as we identify more datasets through active and passive approaches. Since launching the report, we have added over 600 new datasets, primarily through a snowballing process, where reviewing datasets on our existing list leads us to discover additional datasets. Conferences have also been valuable, with team members finding new consortia and initiatives featuring longitudinal datasets. Study teams have also contacted us via email or our website to share datasets that they manage or are involved with, or other datasets that they know about.

In addition to these passive approaches, we continue to actively search for datasets by exploring nearly 300 repositories, which we regularly update and unpack for longitudinal data. We also conduct internet searches using platforms like Google to locate more datasets. Details on other methods used to identify the initial 3,000 datasets are available in the Landscaping report.

How we gather information about datasets

For each longitudinal dataset that we have identified, we search for and then extract information (metadata) about the datasets of interest to researchers in a process we call reviewing.

Reviewing refers to searching for and identifying metadata from various sources such as journal articles, study websites, aggregate websites, repositories, and reports.

We collect information about these datasets using manual methods enhanced by artificial intelligence (AI) processes. Our search methods include:

Manually extracting information from documentation that is available on the internet, such as research papers, study websites, and data repositories.
Using advanced web searches to help narrow down results when searching for documentation.
Using AI to search for and extract information from documentation.
Contacting data custodians when we are unable to find key information, or if documentation is unclear.

Our core approach is manual, focusing on accuracy, completeness, and verification for each review. This process allows us to gain valuable insight into the landscape of available data and understand how best to make it discoverable. While AI is a valuable tool, it cannot replace the depth of insight we achieve through hands-on, manual work.

Our workflow

We use a custom-made data entry system to allocate, review, and check reviews. Each step of the data entry process corresponds to the following three roles: data manager, reviewer, and data checker. We rotate these these roles periodically.

Data manager: allocates datasets to reviewers and is responsible for keeping our list of datasets up to date.
Reviewer: searches the internet for information about each dataset and enters this information into our data entry system.
Data checker: checks the accuracy and completeness of each review (the information filled out by the reviewer). If they notice any errors, the data checker sends the dataset back to the reviewer for them to make corrections. If the dataset is accurate and complete, the data checker publishes the dataset on the Atlas website

Flow chart illustrating our workflow. The data manager first identifies new repositories, then extracts longitudinal datasets, and prioritises and allocates datasets. Reviewers then review longitudinal datasets for platform content. The data checker then checks reviews for accuracy and clarity, and returns datasets to reviewers for corrections when necessary. When a dataset is reviewed and checked, the data checker then publishes it to the live website. The whole team periodically conducts spot checks and cleans the website.

How we use AI

We use AI to enhance our manual data-gathering process. We have found that it can help with our initial searches for documentation, extracting information from documents, and validating the information that we have manually extracted.

We use Microsoft Copilot as our primary AI tool. We developed and fine-tuned a specific set of prompts to direct Copilot to consistently extract key information about the datasets from sources across the internet. The prompts ensure that every answer includes a source so that we can check the accuracy of its responses. While we use Copilot at this stage, AI technologies, particularly generative AI, are constantly evolving, so the methods we use will adapt accordingly.

We acknowledge that responses from generative AI can be unreliable, so we always check its answers manually using the sources it provides. As a result, we use it more like a second pair of eyes alongside our mostly manual process because, although generative AI models can make mistakes, they can also find information that we might miss when using purely manual methods.

How we ensure the Atlas is accurate, complete, and clean

We maintain accuracy, completeness, and clarity in the Atlas by assigning each dataset review to a “data checker” in the team. This role involves thoroughly reviewing each dataset entry to ensure all information is accurate, concise, and current. If any information is missing or inconsistent, the data checker returns it to the reviewer for updates. Once finalised, the data checker publishes the dataset to the Atlas platform. We also employ AI after manual data entry to confirm that all information is legible, and that no sources have been missed.

When reviewers cannot locate important details—such as data access information, funding sources, or primary institutions—we contact the study teams for verification.

We also conduct a seasonal cleaning process in which all entries are harmonised. This ensures consistency across the Atlas and provides an extra layer of error checking.

Co-production with lived experience experts

The Atlas is co-produced by our team of researchers, charity partners and Lived Experience Experts (LEEs) recruited through MQ Mental Health Research.

Our LEE advisory group are based around the world, with representation from Nigeria, Sweden, the United Kingdom, and Zimbabwe, and bring expertise from personal or caretaking experience of mental health conditions, neurodiversity or sensory impairments.

Since the beginning of the project, we have all met to plan the creation and development of the Atlas. This collaboration has been integral to our process, and our LEEs have provided expert advice on aspects such as the Atlas' layout, language, and user-friendliness, ensuring its accessibility to a global audience.

This co-production continues to guide our work.

FAQS

Privacy

Platform by Delosis

How we work

Who we are

FAQ

Blog