Infectious, Allergic, and Immune-Mediated Disease Data Resources: a Landscape Overview and Subset Assessment
Darya Pokutnaya, Lisa M. Mayer, Sydney Foote, Meghan Hartwick, Sepideh Mazrouee, Willem G. Van Panhuis, Reed Shabman

TL;DR
This paper provides an overview of data resources for infectious, allergic, and immune-mediated diseases to help researchers comply with NIH data sharing requirements.
Contribution
The paper compiles and assesses data resources for infectious, allergic, and immune-mediated diseases to support NIH Data Management and Sharing Plans.
Findings
58 data resources focused on infectious, allergic, and immune-mediated diseases were identified.
19 of these resources support data submission and are suitable for NIH DMS Plans.
Inconsistent data submission requirements and management practices were observed across resources.
Abstract
The Data Management and Sharing (DMS) Policy issued by the National Institutes of Health (NIH) requires most grant applications to include a DMS Plan, detailing data type(s), resources (e.g., data repositories, knowledgebases, portals) for data sharing, and a dissemination timeline. Researchers face challenges navigating the complex data landscape to identify data resources to fulfill the DMS Policy requirements. The National Institute of Allergy and Infectious Diseases (NIAID) aims to support researchers in preparing DMS Plans for applications that align with its mission areas. To support depositing and accessing infectious, allergic, and immune-mediated disease (IID) data, we compiled a list of IID data resources. The list was developed by reviewing online resources and collecting recommendations from subject matter experts. Additionally, we developed a questionnaire based on NIH…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3- —National Institute of Allergy and Infectious Diseases (NIAID)
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsResearch Data Management Practices · Biomedical Text Mining and Ontologies · Scientific Computing and Data Management
Background
To promote the transparency, accessibility, and usability of scientific data, the National Institutes of Health (NIH) implemented the Data Management and Sharing (DMS) Policy (1). The policy requires a DMS Plan for most grant applications that describes the data resource(s) (e.g., data repositories, knowledgebases, portals) where data derived from the corresponding project will be deposited. Navigating the landscape of data resources, including generalist resources that accept a wide variety of data types and domain-specific resources that accept data from a particular field, is a time-consuming task (2).
The NIH provides several recommendations for choosing appropriate data resources to include in a DMS Plan, including considerations of scientific discipline, data type, volume, and long-term access, storage, security, and reuse (3–5). These characteristics align with frameworks such as the FAIR (Findable, Accessible, Interoperable, Reusable) and TRUST (Transparency, Responsibility, User Community, Sustainability, and Technology) Principles, CoreTrustSeal requirements, Office of Science and Technology Policy (OSTP) Desirable Characteristics of Data Repositories for Federally Funded Research, as well as other domain-specific repository evaluations (3, 6–9). The NIH also offers several discipline-specific tools, including the National Cancer Institute Data Catalog and the National Institute of Child Health and Human Development Data Repository Finder. Despite these recommendations and tools, there is currently no dedicated tool for identifying data resources specific to infectious and immune-mediated disease (IID) research. As a result, IID researchers, including principal investigators, data managers, and grant writers, continue to face challenges in identifying data resources that align with the NIH guidelines outlined in the DMS Plan.
Given this lack of targeted tools for IID research, deciding where to submit data can be an obstacle for researchers. These decisions are often influenced by multiple factors, including data use limitations, the types of data being shared, sustainability of the resource, and the ease with which others can discover and use the data. Many biomedical researchers lack formal training in informatics, making it difficult to evaluate resources effectively (10). To help mitigate this challenge, our curated list emphasizes resources that are commonly used and recommended within the IID resource community, reflecting real-world use cases and practical relevance. We also categorize resources by data submission acceptance, scientific content, and access features to help users better understand and navigate data resources. The assessment and questionnaire presented in this manuscript can serve as a valuable example of how researchers might assess potential resources for hosting their data, and could also inform NIH DMS guidance, for example through an updated resource finder with disease-specific filters and richer information on resource characteristics.
Our goal is to characterize data resources to assist IID researchers in deciding where to deposit data. Recognizing the complexity of this decision, shaped by factors such as data submission requirements, metadata standards, and access features, we aim to (1) describe data resources that store IID data, (2) provide resources to help researchers develop DMS Plans, and (3) highlight resources that contain datasets suitable for secondary analyses.
We identified 58 IID-specific data resources and conducted a comprehensive assessment of 19 that support data submission using a questionnaire developed in this study. Our assessment focused on key attributes that support long-term data access, storage, security, and data reuse, as these features enhance data accessibility, usability, and interoperability. Together, these findings provide an overview of the current IID data resource landscape and establish a foundation for guiding researchers in selecting resources that meet their needs while supporting future IID data sharing and reuse.
Methods
Initial Landscape Review of Infectious, Allergic, and Immune-mediated Disease Data Resources
Data resources were identified between November 2022 and March 2025 through a curated, expert-informed review of publicly available websites and in consultation with NIAID-affiliated subject-matter experts (SMEs) including data scientists, NIAID Program staff who oversee and coordinate extramural grants and contracts, as well as NIAID-funded researchers involved in the generation, management, or storage of IID data (Fig. 1). SMEs were consulted via email or during regularly scheduled meetings and asked to share data resources commonly used by their teams. These recommendations were supplemented through a review of publicly available websites that listed IID-related data resources. Through SME consultation and website reviews conducted over several years, we added resources until new suggestions largely repeated existing entries, at which point the list was considered saturated using our expert-informed approach. The resources included were not limited to NIH- or NIAID-funded resources (Supplementary Table 1).Fig. 1. Infectious and immune-mediated (IID) data resource inclusion and exclusion criteria workflow
Data resources classified as having primarily IID data were evaluated for exclusion criteria. Resources were excluded if they were nested within larger resources, if they were no longer accessible, lacked information about data access, or if the resource only contained reference materials. The remaining resources were described further, identifying features such as the primary diseases or pathogens captured, scientific content, data access requirements, and data submission capabilities. Scientific content was categorized using terms and definitions based on the National Library of Medicine Medical Subject Headings (Table 1). Data access requirements were classified as either open, registration required, or controlled access based on established definitions (11). Each data resource was identified as either accepting or not accepting data submissions. Resources that accepted data submissions were subset for the questionnaire assessment.Table 1. Data resource scientific content terms, definitions, and NLM MeSH entries referenced when developing the definitionsScientific Content TermDefinitionMeSH EntriesBiological assay dataConsists of data that measure the effects of a biologically active substance using an intermediate in vivo or in vitro tissue or cell model under controlled conditionsBiological AssayBiospecimensRepresents tissue samples retained from their initial research or medical purpose in a biorepositoryBiological Specimen BanksClinical dataComprises patient data collected through medical care such as medical records and results from diagnostic techniques and proceduresMedical RecordsEpidemiological dataIncludes data from or related to studies in epidemiology (e.g., data related to causes, incidence, and characteristics behaviors of disease outbreaks affecting human populations)EpidemiologyImaging dataConsists of image data, such as outputs from diagnostic imaging or microscopyDiagnostic ImagingLaboratory chemicalsDescribes chemicals used or produced in laboratory research, such as reagents or drug formulationsLaboratory ChemicalsMetadata catalogContains metadata describing datasets which are stored elsewhereMetadataOmics dataContains data from genomics, proteomics, metabolomics, multi-omics, and related disciplinesGenomics, Proteomics, Metabolomics, MultiomicsSoftwareIncludes downloadable software or computational toolsSoftwareAbbreviations: NLM, National Library of Medicine; MeSH, Medical Subject HeadingsNote: All definitions for scientific content terms were sourced from the MeSH database
Questionnaire Development and Subset Assessment
Following our external assessment, a 23-item questionnaire was developed based on key criteria from the FAIR and TRUST Principles, Office of Science and Technology Policy (OSTP) Desirable Characteristics of Data Repositories, CoreTrustSeal requirements, and several domain-specific repository evaluations (3, 6–9). We sought consensus across these sources to reflect commonly recognized elements of high-quality data resources. Furthermore, to promote consistency and objectivity, questions were designed to avoid subjective language and be answerable with a clear yes/no based on publicly available documentation. The questionnaire was developed to help researchers identify suitable resources as part of their DMS Plans and report on key characteristics that may influence resource selection.
The questionnaire items were categorized into four groups: (1) Data access and submission, (2) Identification, provenance and quality assurance, (3) Data retrieval and analytical tools, and (4) Documentation and compliance. Co-authors LM and DP each independently reviewed all IID data resources that accept submissions by completing the questionnaire using publicly available documentation on the data resources’ websites. This included content accessible without logging in, as well as information available to users who created a free account using an email address. No direct communication with the data resource staff occurred. Discrepancies between reviewers were resolved through discussion.
Data Access and Submission
For each resource, data access and submission were classified using three categories. “Submission Allowed with Registration/Account” included resources that required users to register for an account or sign up for the platform before submitting data. “Submission Allowed with Additional Approval or Contracts” defined resources that require additional steps beyond registration, such as contracts, agreements, or formal approval like those from an Institutional Review Board (IRB). “Submission Allowed with Membership” applied to data resources that require users to join a specific network or program prior to data submission. In addition to these classifications and previously described data access requirements, the questionnaire captured whether resources provided open metadata, supported authentication of data submitters, enforced formatting and size limitations for data submission, and whether fees were associated with data deposition.
Identification, Provenance, and Quality Assurance
Resources were categorized as using persistent, local, or no identifiers to describe their data. Persistent identifiers were defined as stable, long-term references to digital objects, which can include Digital Object Identifiers (DOIs) or Internationalized Resource Identifiers/Uniform Resource Locators (IRIs/URLs) (12). Local identifiers were defined as those that are only guaranteed to be unique within the resource itself. The questionnaire also assessed whether each data resource tracks the provenance of metadata and data, and whether there is expert curation or quality assurance support once the data are submitted. We considered a resource to support provenance tracking if it publicly documented any aspect of data versioning, including submission dates, version history, or automated update mechanisms. This broad approach meant that we considered a resource to have provenance tracking in place if it documented any practice along the spectrum, ranging from systems where users manually updated files with version numbers and deposit dates to those with automated provenance tracking.
Curation and quality assurance was considered present when documentation indicated that the data underwent additional review prior to release. This included mention of data curation or harmonization steps (e.g., processing data into standardized formats or integrating with existing datasets), data quality checks or preprocessing pipelines, or review teams that would follow up with submitters to verify data and metadata.
Data Retrieval and Analytical Tools
The third section of the questionnaire assessed the presence of data retrieval and analytical tools. Resources were evaluated based on whether the data or metadata could be accessed through an Application Programming Interface (API) or downloaded onto the user’s machine. It also evaluated whether the site provided any analytical tools or a dedicated workspace for analysis. If a workspace was available, additional questions addressed associated costs with maintaining or analyzing data and whether researchers could use their own tools within the workspace.
Documentation and Compliance
The final section of the questionnaire assessed whether the data resources provided documentation on risk management, data retention policies, and security measures to protect against unauthorized access or modification based on data sensitivity. It also evaluated whether the resource outlined its terms of data use.
This review identified and described IID data resources, supported researchers in selecting appropriate repositories for DMS Plans, and highlighted resources that may have data suitable for secondary analyses. The assessment of IID resources that allow data submission focused on attributes that promote long-term data access, storage, security, and reuse to enhance overall data accessibility, usability, and interoperability.
Results
Initial Landscape Assessment of Infectious, Allergic, and Immune-mediated Disease Data Resources
We performed a landscape assessment to identify data resources considered IID-specific. The complete list of the 303 data resources and website URLs is provided in Supplementary Table 1. Of these, 197 were excluded because the data they contained were not related to IID. The remaining 106 resources were then screened against additional exclusion criteria, which removed an additional 48 resources, including 22 nested resources, four that were inaccessible due to broken links, 11 that lacked available data or access guidelines, and 11 that contained only reference materials. This process yielded a ncurated set of 58 IID-specific data resources.
We summarized the subset of 58 IID data resources, including the primary disease or pathogens captured, scientific content categories, data access categories, and data submission acceptance (Table 2). Additional information such as resource abbreviations and URLs are found in Supplementary Table 2. Of the 58 data resources, most were categorized by their primary disease or pathogen as General Infectious Diseases and Pathogens (n = 29, 50%), Respiratory Pathogens (n = 10, 17%), and HIV/AIDS (n = 8, 14%) (Fig. 2 and Supplementary Table 3). The ChemDB HIV, Opportunistic Infection and Tuberculosis Therapeutics Database was categorized as having both Respiratory Pathogen and HIV/AIDS data. Five (9%) resources were categorized under Immunological and Autoimmune Diseases, and three (5%) under Hemorrhagic Fever Viruses. Arthropod-borne Pathogens, Aspergillosis, Papillomaviruses, and Hepatitis C were each classified as the primary disease or pathogen for a single data resource (2% each).Table 2. Core characteristics of identified infectious and immune-mediated (IID) data resources (n = 58) ordered by data submission acceptance and alphabeticallyData Resource NamePrimary Disease or PathogenScientific ContentData AccessData AcceptedAccessClinicalData@NIAIDRespiratory PathogensBiological Assay; Clinical; EpidemiologicalControlledYesCenter for International Blood & Marrow Transplant ResearchImmunological and Autoimmune DiseasesBiospecimens; ClinicalControlled; OpenYesClinEpiDBGeneral Infectious Diseases and PathogensClinical; EpidemiologicalControlled; OpenYesCOVID RADx Data HubRespiratory PathogensBiological Assay; Clinical; Epidemiological; OmicsControlledYesDatabase of Genotypes and PhenotypesGeneral Infectious Diseases and PathogensOmicsControlledYesGlobal Initiation Sharing All Influenza DataRespiratory PathogensClinical; Epidemiological; OmicsRegistrationYesHIV Prevention Trials NetworkHIV/AIDSBiological Assay; Biospecimens; Clinical; Epidemiological; OmicsControlled; OpenYesImmPortGeneral Infectious Diseases and PathogensBiological Assay; Clinical; OmicsControlled; RegistrationYesInfectious Diseases Data ObservatoryGeneral Infectious Diseases and PathogensClinical; Epidemiological; OmicsControlledYesInternational Committee Taxonomy of VirusesGeneral Infectious Diseases and PathogensMetadata CatalogOpenYesMalaria Genomic Epidemiology NetworkArthropod-borne PathogensEpidemiological; OmicsControlled; OpenYesmapMECFSImmunological and Autoimmune DiseasesOmics; Epidemiological; Biological AssayControlledYesNational Center for Biotechnology Information VirusGeneral Infectious Diseases and PathogensOmicsOpenYesNational COVID Cohort CollaborativeRespiratory PathogensBiological Assay; Clinical; EpidemiologicalControlledYesPathoplexusHemorrhagic Fever VirusesOmicsOpenYesQiitaGeneral Infectious Diseases and PathogensOmicsRegistrationYesStructural Database of Allergenic ProteinsImmunological and Autoimmune DiseasesOmicsOpenYesTB PortalsRespiratory PathogensClinical; Epidemiological; Imaging; OmicsControlledYesUniversity of Santa Cruz Genome BrowserGeneral Infectious Diseases and PathogensOmicsOpenYesVDJServerGeneral Infectious Diseases and PathogensOmicsOpenYesACTG/IMPAACT Specimen RepositoryHIV/AIDSBiospecimens; ClinicalControlledNoAspergillus Genome DatabaseAspergillosisOmicsOpenNoBacDiveGeneral Infectious Diseases and PathogensBiological AssayOpenNoBacterial and Viral Bioinformatics Resource CenterGeneral Infectious Diseases and PathogensOmicsOpenNoBEIResourcesGeneral Infectious Diseases and PathogensLaboratory ChemicalsControlled; OpenNoBioCycGeneral Infectious Diseases and PathogensOmicsOpenNoBiological General Repository for Interaction DatasetsGeneral Infectious Diseases and PathogensOmicsOpenNoCenter for Viral Systems BiologyHemorrhagic Fever VirusesBiological Assay; Biospecimens; Clinical; Epidemiological; OmicsOpenNoChemDB HIV, Opportunistic Infection and Tuberculosis Therapeutics DatabaseHIV/AIDS; Respiratory PathogensLaboratory ChemicalsOpenNoCOVID-19 Research DatabaseRespiratory PathogensClinical; Epidemiological; OmicsControlledNoDatabase of Antimicrobial Activity and Structure of PeptidesGeneral Infectious Diseases and PathogensOmicsOpenNoData Discovery Engine-registered DatasetsGeneral Infectious Diseases and PathogensMetadata CatalogOpenNoThe Global Health ObservatoryGeneral Infectious Diseases and PathogensEpidemiologicalOpenNoHemorrhagic Fever Viruses Database ProjectHemorrhagic Fever VirusesBiological Assay; OmicsOpenNoHepatitis C Virus Database ProjectHepatitis CBiological Assay; OmicsOpenNoHeterogeneity in Human Immune CellsGeneral Infectious Diseases and PathogensBiological AssayOpenNoHIV DatabasesHIV/AIDSBiological Assay; OmicsOpenNoHIV Vaccine Trials NetworkHIV/AIDSBiospecimens; ClinicalControlledNoHuman Microbiome Project PortalGeneral Infectious Diseases and PathogensOmicsOpenNoImmune Epitope DatabaseGeneral Infectious Diseases and PathogensBiological AssayOpenNoImmuneSpaceGeneral Infectious Diseases and PathogensBiological Assay; OmicsOpenNoImmune Tolerance Network TrialShareGeneral Infectious Diseases and PathogensBiospecimens; ClinicalControlledNoImmunological Genome ProjectGeneral Infectious Diseases and PathogensBiological Assay; OmicsOpenNoThe Institute for Genome Sciences at the University of Maryland School of Medicine Genomic Center for Infectious DiseasesGeneral Infectious Diseases and PathogensOmicsOpenNoiReceptorGeneral Infectious Diseases and PathogensBiological AssayRegistrationNoMACS/WIHS Combined Cohort StudyHIV/AIDSBiospecimens; Clinical; EpidemiologicalControlledNoMTB Network PortalRespiratory PathogensOmics; SoftwareOpenNoMicrobicide Trials NetworkHIV/AIDSClinicalControlledNoMycobrowserRespiratory PathogensOmicsOpenNoNCATS Open Data PortalRespiratory PathogensBiological Assay; ClinicalOpenNoOpen Germline Receptor DatabaseImmunological and Autoimmune DiseasesOmicsOpenNoPapillomavirus EpistemePapillomavirusesOmicsOpenNoProject TYCHOGeneral Infectious Diseases and PathogensEpidemiologicalOpenNoStanford University HIV Drug Resistance DatabaseHIV/AIDSBiological Assay; Clinical; OmicsOpenNoUnited States Immunodeficiency NetworkImmunological and Autoimmune DiseasesBiospecimens; Clinical; OmicsControlledNoVaccine Investigation and Online Information NetworkGeneral Infectious Diseases and PathogensBiological Assay; Metadata Catalog; OmicsOpenNoVDJbaseGeneral Infectious Diseases and PathogensOmicsOpenNoVEuPathDBGeneral Infectious Diseases and PathogensBiological Assay; Clinical; Epidemiological; OmicsOpenNoCharacteristics for each data resource including the resource name, primary disease or pathogen and scientific content (i.e., categories or types) of hosted data, data access status (either open, controlled, or registration indicating that registering an account with the data resource is necessary to view the data), and indication whether data can be deposited.Abbreviations: ACTG AIDS Clinical Trials Group, IMPAACT International Maternal Pediatric Adolescent AIDS Clinical Trial Network, HIV Human Immunodeficiency Virus, MACS Multicenter AIDS Cohort Study, WIHS Women’s Interagency HIV Study, TB Tuberculosis. Primary Infectious Disease: General Infectious Diseases and Pathogens indicates data resources include data relevant to various diseases and conditions, including infectious and immune-mediated disease dataFig. 2Data submission acceptance, scientific content categories, and data access categories of infectious and immune-mediated data resources (n = 58) Resource classification by data access
Resource Classification by Data Access
Thirty-four (59%) of the 58 IID resources provided only open access data (Fig. 2. and Supplementary Table 4). Fifteen (26%) resources contained controlled access data; access required additional measures beyond registration. Five (9%) resources offered both open and controlled access data. Three (5%) were classified as having registration only data, indicating that an account registration is required. One (2%) resource was categorized as having both registration and controlled access data, signifying that some data are available upon user registration, while others require additional steps for controlled access.
Resource Classification by Scientific Content and Data Submission Acceptance
Data resources contained scientific content from one or more of the categories. Thirty-eight (66%) included “omics” data (e.g., genomics, proteomics, metabolomics, multi-omics, and related disciplines) with 15 (26%) allowing for data submission. Twenty-one (36%) contained clinical data (e.g., medical records and results from diagnostic techniques and procedures), of which ten (17%) accepted submissions. Biological assay data appeared in 20 (34%) resources, with six (10%) enabling user submission and 16 (28%) had epidemiological data, ten (17%) of which allowed for submission (Fig. 3 and Supplementary Table 5). Additionally, eight (14%) resources featured biospecimens, with two (3%) accepting data deposition. Three (5%) were categorized as metadata catalogs, two (3%) featured laboratory chemicals, one (2%) provided a list of software, and one (2%) contained imaging data, which was the only resource among these that supported data submission.Fig. 3. Counts of data resources (n = 58) by scientific content categories and data submission acceptance. Percentages reflect the proportion of the total cell. Some resources are represented in multiple categories
Questionnaire and Subset Assessment
Out of the 58 IID data resources, we identified 19 (33%) that allow for data submission. Among them, eight (14%) required registration or an account prior to submission (Table 3). Seven (12%) required additional contracts or approvals, such as a data use agreement or IRB approval prior to accessing the data. Four (7%) required researchers to be part of a specific collaborative network or consortium to submit data.Table 3. Aggregate evaluation results of infectious and immune-mediated disease data resources that accept data depositsCategory#QuestionResponseSubmission Allowed with Registration/AccountSubmission Allowed with Additional Approvals or ContractsSubmission Allowed with Membership1) Data access and submission1.1Does the resource accept data submission?874YesNoNA1.2Does the data resource provide open access data?811-1.3Does the data resource require registration (e.g., email) for data access?712-1.4Does the data resource provide controlled access data?136-1.5Does the data resource provide open access metadata?154-1.6Does the data resource support authentication of data submitters?190-1.7Does the data resource have formatting requirement for data submission?109-1.8Does the data resource have size limit requirements for data submission?118-1.9Are there costs associated with depositing the data?118-2) Identification, provenance, and quality assurancePersistent IdentifierLocal IdentifierNo Identifier2.1Does the data resource assign each dataset an identifier? If yes, is it a persistent or local identifier?5113YesNoNA2.2Does the data resource have a system in place to track provenance to the (meta)data?145-2.3Does the data resource support expert curation or quality assurance to improve the accuracy and integrity of datasets and metadata?145-3) Data retrieval and analytical tools3.1Can the (meta)data be accessed through an API?127-3.2Can the user download the data to their local machine?190-3.3Does the data resource provide data analytical tools?1453.4Does the data resource provide a workspace?1093.5If there is a workspace, are these costs associated with maintaining data in the workspace?1993.6If there is a workspace, are users able to utilize their own analytical tools within the workspace?2893.7If there is a workspace, are there costs associated with analyzing the data in the workspace?01094) Documentation and compliance4.1Does the data resource provide documentation on risk management (e.g., data breach, natural disasters)?109-4.2Does the data resource provide documentation on its data retention policies?514-4.3Does the data resource have security policies in place that ensure protection against unauthorized access, modification, or release of data, with appropriate security levels based on data sensitivity?172-4.4Does the data resource provide documentation for its terms for data use?181-
Data Access and Submission
Seven (37%) of the 19 data resources accept data submissions with registration or an account, seven (37%) accept data submission with additional approval or contracts, and five (26%) require membership before submission. Six (32%) resources provide open access data, six (32%) require users to register, and 12 (63%) offer controlled access data. Regardless of data access, most data resources (n = 14, 74%) provide access to at least some metadata. All 19 data resources support authentication of the data submitter. Nine (47%) of the resources have public-facing documentation on formatting requirements for data submissions. One (5%) data resource specifies a size limit requirement, and one (5%) charges a fee for data deposition.
Identification, Provenance, and Quality Assurance
Five (26%) data resources assign persistent identifiers (e.g., DOIs) to each dataset, 11 (58%) resources use local identifiers specific to their platform, and three (16%) resources do not assign any dataset identifier. Most data resources (n = 14, 74%) have a system in place to track provenance of the metadata or data. Most resources (n = 14, 74%) also support expert curation and quality assurance to improve the accuracy and integrity of the data and metadata.
Data Retrieval and Analytical Tools
Out of the 19 data resources evaluated, 12 (63%) provide access to the metadata or data through an API. All 19 resources allow users to download data to their local machine. Thirteen (68%) offer at least one analytical tool, while nine (47%) provide a workspace. Among the data resources with workspaces, one (5%) requires payment to maintain data and two (11%) allow users to utilize their own analytical tools within the workspace. Notably, none charge a fee to analyze data in the workspace.
Documentation and Compliance
Nine (47%) data resources provide documentation on risks such as data breaches and natural disasters. Documentation on data retention policies is provided by five (26%) resources. Security policies ensuring protection against unauthorized access, modification, and release of data, with appropriate security levels based on data sensitivity, are in place for 16 (84%) data resources. All 19 resources provide some documentation outlining their terms for data use.
Discussion
Our assessment highlights the diversity and complexity of IID research, reflected in the wide range of data resources available. Selecting appropriate data resources for a DMS Plan is challenging and requires consideration of data types and formats, security, storage, retention policies, and the trade-offs between using a single or multiple resources for data deposition (2). These decisions affect data reuse, particularly if the resource is not widely recognized or lacks interoperability features (13). Early resource selection in the study process influences data and metadata formatting, access, and sharing policies. For example, repository requirements may influence informed consent language (14, 15). These considerations underscore the need to integrate data resource selection not only during DMS Plan preparation, but even earlier during the study design phase so that data management strategies are aligned from the outset and can support future sharing and reuse effectively.
Given the importance of selecting appropriate data resources early in the study design, our assessment highlights an imbalance in the availability of resources for IID research. While “omics” and clinical data are well-represented, other categories including imaging and biospecimens are notably underrepresented. These imbalances may stem from differences in investment, data sharing culture, and technical or ethical challenges. For example, the bioinformatics community was among the first to embrace data sharing, leading to the development of specialized “omics” resources (9, 16). In contrast, imaging data requires significant storage space and is difficult to de-identify (17). Sharing biospecimen data presents unique challenges due to need for strict ethical oversight, governance structures, and compliance with clinical and laboratory standards (18). As a result of these complexities, fields outside of “omics” may have fewer specialized resources available to support data deposition and long-term access. Although generalist resources such as Zenodo and FigShare remain options, domain-specific resources are better suited to support IID researchers by organizing data in ways that maximize discovery and utility.
We observed considerable variation in submission processes, access controls, metadata practices, and documentation quality in our subset assessment of 19 IID data resources that support data deposition. Data access models ranged from open to controlled, and authentication requirements varied from email registration to institutional approval. These variations in access and authentication have implications for both data reuse and DMS Plan development. While important to ensure data is appropriately protected, controlled access may delay reuse and secondary analyses. Complex authentication or institutional restrictions on data deposition can also complicate resource selection and should be considered early by researchers to ensure compliance and feasibility (19). Despite variation in data access, 15 of the 19 resources (Table 3) provided open-access metadata, enabling researchers to assess data relevance, structure, and quality before initiating access requests. This transparency is especially valuable for planning secondary analyses and selecting suitable resources during proposal development.
Data submission requirements varied across resources. In some cases, resources did not publicly provide guidance on file formatting and size. Researchers are asked in the DMS Plan to specify which standards, if any, will be applied to the scientific data and associated metadata, including data formats, data dictionaries, unique identifiers, and other documentation (1). However, this can be difficult to address when a data resource does not provide clear guidance on formatting, as the researcher is trying to align the data and metadata required for their study with resource guidelines. Beyond formatting, additional barriers included limited user support, unclear documentation, and administrative hurdles such as data use agreements. These barriers were not uniform with some providing straightforward submission processes with transparent guidance, while others required additional steps that may make it difficult for researchers to meet data sharing expectations. This lack of consistency reflects a broader fragmentation across the data sharing ecosystem. Addressing these barriers will necessitate more standardized submission guidelines, stronger metadata requirements and templates to support consistency, expanded user support and training, and a broader adoption of community standards (20). Increased coordination across resources could also address variability and make the submission process easier to navigate for IID researchers.
Practices for assigning dataset identifiers varied across resources. While some resources issued globally unique identifiers such as DOIs, others used local identifiers that may not be resolvable outside their original context or interoperable across platforms. In contrast, DOIs support consistent data citation, long-term accessibility, and integration across resources. Aligning with the FAIR Principles and recent NIH and OSTP guidance, resources are increasingly expected to assign unique, citable, persistent identifiers to support access and tracking of federally funded research (3, 9, 21). These differences highlight the importance of reviewing resource documentation before submitting a DMS Plan, and when needed, engaging directly with resource staff to ensure alignment with data sharing goals (22).
Provenance, or the origin and history of the data, differed by resource. We considered a resource to support provenance tracking if it publicly documented any aspect of data versioning This broad approach meant that we considered a resource to have provenance tracking in place if it documented any practice along the spectrum, ranging from systems where users manually updated files with version numbers and deposit dates to those with automated provenance tracking. Automated systems are significantly more reliable and consistent than manual methods, which are prone to human error (7). While variation in identifiers, tracking systems, and submission requirements may present challenges, they also offer researchers flexibility in selecting resources that best align with their data types, access needs, and management goals.
The variability in features highlights the importance of active support within resources to help researchers manage their data efficiently and prepare for DMS Plan submission. Fourteen of the 19 resources provided expert curation or quality assurance practices that support improvements to data and metadata post-submission (Table 3) (23). These practices not only support compliance with data sharing policies but also promote greater confidence in the reliability and reusability of shared data.
All resources allowed users to download data locally, and a majority supported access through APIs, offering flexibility in how metadata and data are accessed and integrated into workflows. Fourteen resources provided at least one built-in analytical tool, while over half offered workspace environments. Only one resource reported costs associated with maintaining data in the workspace, and none required payment for data analysis. These findings suggest that while workspace availability is not universal, they are generally low-cost and accessible when offered. However, only two of the resources with workspaces allowed researchers to utilize their own analytical tools. This limited flexibility in tool integration may influence resource selection for DMS Plans based on project-specific needs.
Documentation on risk management, data retention, and security policies was often difficult to locate and interpret across the 19 resources. While some level of risk management documentation, covering potential threats such as data breaches and natural disasters, was available, nine resources did not offer any. This disparity suggests that researchers may need to conduct additional assessments of risk management or reach out to data resource staff directly to ensure adequate protection against unforeseen events that could impact their data. Only five resources documented data retention policies, while the remaining 14 provided no clear guidance. This gap is important, as understanding data retention terms supports long-term project planning and data accessibility. Furthermore, DMS Plans require researchers to provide a timeline specifying how long scientific data will be available to others (1). In contrast, most resources provided some documentation on policies designed to protect data from unauthorized access. However, the phrase “appropriate security levels” in our assessment was interpreted broadly; we assumed that each resource’s verification process met the necessary security requirements unless documentation indicated otherwise. Most resources also included terms for data use, helping ensure that legal and ethical considerations for data sharing are clearly addressed.
Limitations
Limitations in this assessment arose from reliance on public documentation, flexible handling of variation between resources, and changes in resources over the course of the evaluation. The SMEs consulted to develop the list of resources were predominantly U.S.-based and NIAID-funded, potentially biasing the list toward NIH/NIAID-associated IID resources and limiting global coverage. Future work could include a more systematic review of resources or engaging in a broader, international pool of SMEs.
Our review relied solely on publicly available documentation, which may not capture all information about each resource (24). This limitation is inherent in the process that researchers also face when selecting a data resource for their DMS Plan or secondary analysis. Additionally, due to inconsistencies in documentation, the reviewers took a flexible approach, giving credit to the data resource if any publicly available documentation was found for each question. Future work could assess each question along a spectrum rather than a binary response, to better understand the nuances in documentation and implementation of each resource Another limitation relates to professional backgrounds and experiences of the reviewers. Our training in data science may have influenced the categorization of data types and interpretation of technical documentation. While this perspective likely impacted our review, it was applied consistently across resources, supporting comparability of results. In some instances, through conversations with SMEs, we were aware that certain resources included specific features. However, to reduce bias and ensure consistency, all information was verified by reviewers using only publicly available documentation. Future reviews would benefit from inclusion of reviewers from a broader range of institutions and disciplinary backgrounds.
Although the purpose of this review is to provide IID-specific information, we acknowledge that scientific discovery often requires integration from multiple domains. We have not included data resources with other scientific focuses that may help researchers develop new tools for diagnosis, treatment, or prevention of disease. For example, environmental data is highly relevant to allergic disease management and environmentally transmitted infections like nontuberculous mycobacteria, but no environmental datasets were included in this review. Exploring the integration of data resources from related areas of research would be an important direction for future work (25, 26).
Finally, changes in funding or infrastructure may result in inactive links or outdated resources provided in the tables. For example, during the assessment, we evaluated the original version of VDJ Server, which was later deprecated and replaced by VDJ Server 2. We intend to create a dynamic, publicly available list of IID resources, which will be maintained through a NIAID GitHub repository currently in development. This platform will support continuous updates, ensuring users have access to the latest information as changes occur.
Conclusions
Our assessment differs from prior studies in two ways: (1) it focuses specifically on IID data resources, and (2) it assesses each resource by describing the presence of relevant features for DMS Plan development and secondary data analysis. The findings highlight the diversity and flexibility of resources available to researchers, spanning “omics,” clinical, epidemiological, and biological assay data, but also underscoring the significant challenges posed by variability in submission requirements and data management practices. These challenges emphasize the need for greater transparency and standardization across data resources. Our assessment calls for efforts to simplify and standardize this information, enabling researchers to more easily evaluate and select appropriate resources when developing DMS Plans or seeking data for secondary analyses. Such improvement would enhance data findability and streamline data sharing in IID research.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary Material 1 (DOCX 41.1 KB)
Supplementary Material 2 (DOCX 73.4 KB)
Supplementary Material 3 (DOCX 41.8 KB)
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Office of The Director, National Institutes of Health. Final NIH Policy for Data Management and Sharing [Internet]. Jan 25, 2023. Available from: https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html
- 2White House. Desirable characteristics of data repositories [Internet]. May, 2022. Available from: https://www.whitehouse.gov/wp-content/uploads/2022/05/05-2022-Desirable-Characteristics-of-Data-Repositories.pdf
- 3National Institutes of Health. Selecting a Data Repository [Internet]. 2024 Oct. Available from: https://grants.nih.gov/policy-and-compliance/policy-topics/sharing-policies/dms/selecting-a-data-repository#selecting-a-data-repository
- 4National Institutes of Health. NOT-OD-21-016: Supplemental Information to the NIH Policy for Data Management and Sharing: Selecting a Repository for Data Resulting from NIH-Supported Research [Internet]. 2024 [cited 2024 Aug 27]. Available from: https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-016.html
- 5Core Trust Seal [Internet]. 2017 [cited 2024 Aug 28]. Core Trust Seal Data Repositories Requirements. Available from: https://www.coretrustseal.org/why-certification/requirements/
- 6Murphy F, Bar-Sinai M, Martone ME. A tool for assessing alignment of biomedical data repositories with open, FAIR, citation and trustworthy principles. Naudet F, editor. PLOS ONE. 2021;16(7):e 0253538.10.1371/journal.pone.0253538 PMC 827016834242248 · doi ↗ · pubmed ↗
- 7National Institutes of Health. NIH Genomic Data Sharing Policy [Internet]. 2014 Aug. Available from: https://grants.nih.gov/grants/guide/notice-files/NOT-OD-14-124.html
- 8Inter-university Consortium for Political and Social Research (ICPSR). Guide to Social Science Data Preparation and Archiving: Best Practice Throughout the Data Life Cycle. [Internet]. 2020. Available from: https://www.icpsr.umich.edu/web/pages/deposit/guide/index.html
