Data submission, use, and access
POLICIES AND PLAN FOR DATA SUBMITTED TO THE DCC AND SHARED ON THE T2D KNOWLEDGE PORTAL
The Broad Institute will serve as the Data Coordinating Center (DCC) for AMP-T2D. On behalf of AMP-T2D, the DCC will aggregate data, support analyses, and continue to update capabilities to disseminate results relevant to the genetics of T2D (T2D) and related traits, while coordinating collaboration within the AMP-T2D Project. The DCC will also be responsible for sharing the results from the data coordination and analysis activities on the AMP-T2D Knowledge Portal (T2DKP).
Data Aggregation, Analysis, and Resource Distribution
As the DCC for the AMP-T2D Consortium, the Broad Institute (along with its AMP funded partners) intends to (a) serve as the gateway to a large (and growing) aggregation of data relevant to the genetics of type 2 diabetes and its complications; (b) perform and automate analyses required to interpret those data; and (c) communicate results to diverse audiences via an open access Web Portal (T2DKP), presentation and publication. Each of these goals involves distinct categories of resources and activities that we will share and/or manage:
We will aggregate data on behalf of AMP-T2D. Data aggregated under this effort will not be generated by the Knowledge Portal development work funded through the Portal-specific grants, but rather obtained from other investigators and repositories who wish to collaborate on and contribute to this effort, including other investigators funded by the wider AMP-T2D program. Moreover, the role of the Portal will not be to redistribute individual-level data, but rather to generate results attained via standard and customized queries that can be widely shared with the scientific community. Because the primary, individual-level data are neither generated by this project, nor redistributed to other users, the role of the Portal is limited to secure intake, storage and management, automated analyses, and dissemination of results in summary (i.e., not individual-level) form while complying with intended use of the data and all relevant regulations.
We will focus on three classes of data: individual-level genotypes, individual-level phenotypes, and external precomputed results or annotations (e.g., results from individual studies or meta-analyses of multiple studies, processed annotations).
The data and results currently stored in the Portal, as of September 2016, have either been generated at the Broad Institute as part of IRB-approved secondary use protocols or, in the case of meta-analysis results from published GWAS datasets, obtained in such a form that Broad was determined to be "not engaged in human subjects research" (per the criteria described in the U.S. Health and Human Services Office for Human Research Protection's 2008 Guidance on Engagement of Institutions in Human Subjects Research). All data have been de-identified prior to being sent to Broad; at no time or under no circumstances will investigators funded by this grant have information linking data back to subject identifiers.
For datasets subsequently added to the Portal, raw data will be obtained through formal NIH systems for data-sharing such as dbGaP, or directly from investigators who collected the data. It will be the decision of NIH and the AMP-T2D Steering Committee whether to accept into the Portal data that are not in dbGaP and, if so, the terms on which the data can be made accessible for analyses by other parties. For the raw data transferred to the DCC for representation on the Knowledge Portal, in all cases, we will obtain a DTA between the Submitter and the Broad Institute (outlined below). The DTA outlinens the use and protection of the transferred data. The Submitting site will be responsible for ensuring that the datasets transferred to the Broad are consented for transfer, genetic analysis, and representation on the T2D Knowledge Portal. At the DCC, we will develop software for storing and managing datasets, and will not redistribute the raw data to any outside third parties. However, as part of our analysis process, we may send array data to the University of Michigan (an AMP-T2D funded site) imputation server for imputation purposes only. See Appendix C of our data transfer agreement for additional details. For bulk data (e.g., raw and harmonized individual-level genotype data), we will use object storage systems, with access controlled through application programming interfaces (APIs). Only authenticated and authorized users can access data; all such access is logged and auditable.
Additionally, we are currently building a database tool that captures data use restrictions for each dataset electronically using an ontology-based consent database, and can match those restrictions against potential research usage to ensure that only appropriate users can query specific datasets. Web-based tools, currently in development, support both the entry of data-use restrictions and the review of access requests. This will enable the aggregation of additional and more diverse datasets for the Portal.
Examples of datasets
- Current and future genetic studies of type 2 diabetes
- Current and future genetic studies of related quantitative traits
- Current and future studies of type 2 diabetes-related complications
- Annotations of function
Classes of datasets for storage and analysis in the Portal
- Those that do not require ethical and regulatory approval
- Results from publicly available datasets
- Summary statistics
- Access controlled export (dbGaP, EGA, etc.) of individual-level data where we will serve results and summary statistics
- Data generated directly at the Broad on de-identified DNA samples
- All de-identified individual-level genetic and phenotypic data generated externally
Data Transfer Agreement
For all de-identified individual-level genetic and phenotypic data generated externally (genotype, phenotypes, annotations, etc.) received by the Broad Institute as DCC for the AMP-T2D Knowledge Portal, we will execute a Data Transfer Agreement (DTA) with the submitting institution. We will ensure that the usage of the data is compliant with the Data Use Restrictions associated with the dataset. It will be the responsibility of the submitting institution to outline the Data Use Restrictions for the data coming to the DCC for the Knowledge Portal. It will be the responsibility of the submitting site or Institution to outline the appropriate Data Use Restrictions as part of the executed DTA for the DCC. Below we outline the data use and analysis plan for the DCC.
Find complete information on data submission in our AMP T2D Knowledge Portal Submitter and Analysis Guide for Data at the DCC.
Data use and analysis
For all de-identified individual-level genetic and phenotypic data generated externally (genotype, phenotypes, annotations, etc.) received by the Broad Institute as DCC for the AMP-T2D Knowledge Portal we will perform quality control assessment, harmonization, and association analysis for T2D and related traits. This process will be a collaborative "handshake" process with the submitter. At the completion of each major phase we will share a report and results with the submitter. All results will be approved by the submitting site before results are available through the T2DKP. We have outlined our procedures for the data use, and analysis steps in a document entitled "AMP T2D Knowledge Portal Submitter and Analysis Guide for Data at the DCC," which may be downloaded here.
We request a number of traits for association analysis at the DCC for the purposes of deposition in the AMP-T2D Knowledge Portal, including: Type 2 Diabetes, Fasting Insulin, Fasting Glucose, Lipids (HDL, LDL, TG, Total Cholesterol), Blood pressure, Creatinine, BMI (height and weight), waist circumference, lipid medications. We encourage submitters to share additional phenotypes, where approved by their local IRB/Ethics Committee (based on the cohort-level patient consent forms), as outlined in the AMP T2D Submitter’s Guide to Sending Data to the DCC.
Assuming the submitting site has approved all these traits for analysis and display on the T2DKP, we will perform association analysis in the following tiered manner:
- Phase 1: T2D status, fasting glucose, fasting insulin
- Phase 2: 2hr glucose, 2hr insulin, HbA1C, HOMA-B, HOMA-IR
- Phase 3: BMI, HDL, LDL, triglycerides, total cholesterol, diastolic blood pressure, systolic blood pressure, WHR, waist circumference, hip circumference, height
- Phase 4: Longitudinal and Complications data: specifications for these trait types have yet to be defined by the AMP-T2D Phenotype Working Group
The DCC will share the results of all approved analyses directly with the submitter upon completion for review. We will partner to address any quality control matters or confounders in the data before deposition in the T2DKP. Once the results are finalized, the DCC will make the data available for query via the T2DKP.
The individual-level data sent by data submitters, stored, and analyzed by the DCC will never be shared with Portal users; only results will be shared. The individual-level and summary data will reside in one or several data vaults behind a secure firewall. User-activated analytical modules will be deployed behind the firewall to analyze the data or query precomputed results. The Portal will provide results in response to queries for information, obtained from genetic analyses performed on the data. The purpose of the AMP-T2D web Portal will be to enable broad access to the comprehensive results of genetic studies of type 2 diabetes, related traits, and diabetic complications. To ensure that the web Portal is effective in allowing access to results and data – both within AMP-T2D and with the broader biomedical research community – we will develop an interface to provide access to results in a form designed to meet user needs while maintaining the individual data privacy requirements, and will engage Portal users in assessing the value of these features.
Results from studies included in the Portal will be available genome-wide (i.e., not limited to "top hits"), and results from different studies and types will be integrated and presented simultaneously. In most cases the results on the T2DKP will be queryable by study. For studies representing multiple cohorts with differential data sharing approvals, we will display the approved results by cohort. Where a study represents multiple ethnicities, we will also allow query of the results by cohort-reported ethnicity. Metadata and other technical details (e.g., analysis parameters, explanations of terms, documentation of methods) will be available at lower levels of drilldown.
Resource Distribution and Sharing
We will share software, methods, and code developed as part of consortium efforts. Specifically, we envision three types of sharing: (a) sharing of software source code; (b) sharing services; and (c) sharing of effort between groups with the intention of maintaining or extending existing software.
Sharing of software source code
We are producing open-source software under the terms of the BSD 3 open-source license. As such, this code will be freely available for use by any other parties; the software will be supplied “AS IS” with no implied warranty or promises of support. We will maintain a Github repository from which interested parties can download the source code. The code source of the AMP-T2D Portal, entitled Framework for investigating Genetic Associations (FGA), is located here.
Our software will be constructed as a distributed system in which computers communicate using standard protocols (HTTP for transport, with REST as an organizing principle and data payloads defined with JSON), with well-defined interfaces specific to the computational topics addressed by each computer system. These services will in principle be accessible by any other party willing to adopt the conventions used by our services. To the extent that data on these services may be under privacy and use restrictions, these running services will be designed to provide information only in forms that protect privacy and security, or in secure, encrypted mode for other parties with permission to access and receive the information.
Sharing of effort between groups to maintain or extend existing software
The Portal architecture will be designed to facilitate front-end contributions (e.g., extensions of existing widgets for data exploration) from a wide community of developers. Data or computations for REST servers will be encapsulated as loosely coupled "plug-in" modules that may be written in different languages (e.g., Python, JVM-based languages, shell scripting). This approach anticipates the contribution of computational modules from other individuals and groups, both within and outside of the AMP consortium.
Policies for Data Release, Accessing the Portal, and Terms of Conduct
Data processing and availability (applicable for both data coming to the DCC and to Federated nodes)
We will have a 3-stage process for data release on the Portal. The figure below outlines the stages and timelines. The stages are:
(1) Data Deposit: The DCC (and Federated nodes) will receive data from submitters on an ongoing basis. The Data deposition stage has several components that must be completed for the data to be ready for release into the Portal. These are:
- Data use agreements and ethical approvals for data transfer to DCC (or the Federated nodes) and release into Portal.
- Physical transfer of data and all meta-data, in required formats, into Data Intake System at the DCC (or at a Federated node).
- Data storage, curation, QC, and harmonization.
In general, upon depositing data into the Portal, a QC filter on genotypes and phenotypes will be deployed as per standard operating procedures in the field. Data will only be available after these initial filters are applied. Filters will include automated steps and final human curation, as determined by the AMP-T2D investigative team.
(2) Early Access Period: After the Data deposition steps are complete, data will be released to the Portal. This denotes the start of the Early Access Period. This 6-month time period will be divided into 2 phases. The first 3 months will be the "Early access phase 1" window where data are available to all users. At this point, project sanctioned QC and analyses have been performed, but the data are NOT considered final or fully integrated on the Portal. The goal of the "Early access phase 1" time period is to allow users to review the data, to perform additional QC and analyses (ideally in a "crowd-sourced" manner), and to finalize and integrate the dataset. At the end of the first 3 months, the data will move to the "Early access phase 2" state, where data are considered final and fully integrated.
During this 6-month period, all analyses, results, and publications proposed are subject to the "Fort Lauderdale Principles" articulated for the sharing of genomic data. Users must not submit a manuscript concerning newly deposited data for publication until both phase 1 and phase 2 of the Early Access Period have ended.
Data release timeline:
Data use and availability
Datasets labeled "Open access": All users are welcome to use results from analyses of these data to further their research without seeking explicit permission from the Portal team or funders. Users are also welcome to cite the data in scientific publications, provided that they cite the Portal as the source. If users are citing a single dataset represented in the Portal, they should cite both the Portal and the relevant paper for that dataset.
Datasets labeled "Unpublished": These data have been submitted to the Portal by authors in advance of publication in order to provide the immediate benefits of data access to the T2D research community. Portal users may explore these data via all of the Portal tools and interfaces, but are not permitted to submit for publication the results of any such analyses until the primary paper has been published.
Datasets labeled "Early access phase 1" and "Early access phase 2": all analyses, results, and proposed publications are subject to the "Fort Lauderdale Principles" articulated for the sharing of genomic data (see above).
To access the Portal, users must obtain a Google ID, which will be used for quality control (QC) and monitoring purposes (see the "User tracking" tab). In the future, should the AMP-T2D Consortium and Data Submitters agree, we may develop a more stringent registration process, requiring identification and authentication of the user and institutional affiliation.
Portal users are expected to abide by the following provisions on data use:
- Users will not attempt to download any dataset in bulk from the Portal
- Users will not attempt to identify or contact research participants
- Users will protect data confidentiality
- Users will not share any of the data with unauthorized users
- Users will report any inadvertent data release, security breach, or other data management incidents of which they become aware
- Users will abide by all applicable laws and regulations for handling genomic data
- Users will not submit a manuscript for publication until the Early Access Period is over (6 months after the clean dataset becomes available in the Portal), to allow for beta testing on the integrity of the dataset and finalization of the results on the Portal.
Agreeing to these provisions is a requirement of Portal use. Violating them may result in an NIH investigation and sanctions including revocation of access to the Portal.
Citing portal data
Users who wish to cite data in this Portal in a scientific publication should do so in the following format:
AMP-T2D Program; T2D-GENES Consortium, SIGMA T2D Consortium. Year/Month/Date of access; URL of page you are citing.
For instance, a user who viewed the Portal's page on the gene SLC30A8 on February 1, 2015, and wanted to cite it would use this citation:
AMP-T2D Program; T2D-GENES Consortium, SIGMA T2D Consortium. SLC30A8. type2diabetesgenetics.org. 2015 Feb 1; http://www.type2diabetesgenetics.org/gene/geneInfo/SLC30A8.
The Portal does not yet have a PubMed identifier.
Re-using written content on the portal
Except where otherwise noted, text on this site is licensed under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International License.
The Portal team tracks a limited set of usage statistics. We do this to improve functionality based on how users interact with the Portal and to ensure that Portal data are being used properly (see our data use policy). Two types of people are allowed to view usage statistics at different levels of detail:
- Our website developer tracks deidentified, aggregate analytics (such as hit counts for specific pages) in order to improve the Portal's user experience. He/she does not view statistics attached to individual user accounts.
- NIH personnel may be asked to examine individual user histories in cases of suspected misuse of Portal data.