Genomic Resources for Cancer Epidemiology

Note: This web page provides links to research resources that may be of interest to genetic epidemiologists conducting cancer research, but is not exhaustive. Within each section, the resources are listed in alphabetical order.

If you have suggestions for additional resources to add, please contact nciepimatters@mail.nih.gov.


Data Resources, Genotyping and Sequencing Centers, and NCI- and NIH-Sponsored Networks and Programs

  • Databases and Catalogues of Genetic Variation
    View Resources
    • 1000 Genomes ProjectExternal Web Site Policy
      The goal of the 1000 genomes project is to provide a comprehensive resource on human genetic variation. The Project is sequencing the genomes of approximately 2,500 samples at 4x coverage, to provide data on genetic variants with frequencies of at least 1% in the populations studied.
    • Cancer Genome Anatomy ProjectExternal Web Site Policy
      NCI's Cancer Genome Anatomy Project (CGAP) sought to determine the gene expression profiles of normal, precancer, and cancer cells. Interconnected modules provide access to all CGAP data, bioinformatic analysis tools, and biological resources.
    • Database of Genomic Structural Variation (dbVar)External Web Site Policy
      dbVar is the NCBI central repository for structural variation. Structural variation is generally defined as any region of DNA involved in inversions and balanced translocations, insertions and deletions, or copy number variation.
    • Database of Single Base Nucleotide Substitutions (dbSNP)External Web Site Policy
      dbSNP is the NCBI central repository for single base nucleotide substitutions (SNPs) and short deletion and insertion polymorphisms.
    • Encyclopedia of DNA Elements (ENCODE) dataExternal Web Site Policy
      ENCODE provides a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.
    • Exome Aggregation Consortium (ExAC)External Web Site Policy
      A coalition of investigators seeking to aggregate and harmonize exome sequencing data from a wide variety of large-scale sequencing projects, and to make summary data available for the wider scientific community.
    • Genetic EUropean VAriation in Health and DISease (GEUVADIS)External Web Site Policy
      A medical sequencing consortium committed to gaining insights into the human genome and its role in health and medicine by sharing data, experience, and expertise in high-throughput sequencing.
    • Genome of the NetherlandsExternal Web Site Policy
      A project to characterize DNA sequence variation among 250 Dutch families in order to create an ultra-sharp genetic group portrait of the Dutch.
    • Known VARiants (Kaviar)External Web Site Policy
      A compilation of human genetic mutations and variants that is designed to facilitate testing for the novelty and frequency of observed variants.
    • National Heart Lung and Blood Institute (NHLBI) Exome Variant Server (EVS)External Web Site Policy
      The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of exome sequencing across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community. The current EVS data release represents all variants identified from exome sequencing of 6503 ESP samples.
    • SweGen Variant Frequency DatabaseExternal Web Site Policy
      A database of whole-genome variant frequencies for 1,000 Swedish individuals that can be used as a resource for the research community and clinical genetics laboratories.
    • UK10KExternal Web Site Policy
      A high-quality whole-genome sequence data repository that includes 24 million rare variants from nearly 4,000 European-ancestry individuals.
  • Genomic Datasets for Cancer Research: Datasets and Access Policy
    View Resources
    • Data Access Request Process
      This page contains instructions for submitting a Data Access Request for dataset(s) under the purview of the NCI's Extramural Data Access Committee.
    • Database of Genotypes and Phenotypes (dbGaP)External Web Site Policy
      dbGaP was developed to archive and distribute the results of studies that have investigated the interaction of genotype and phenotype. Such studies include GWAS, medical sequencing, molecular diagnostic assays, as well as studies of associations between genotype and non-clinical traits. dbGaP provides two levels of access, open and controlled, in order to allow broad release of non-sensitive data, while providing oversight and investigator accountability for sensitive data sets involving personal health information.
    • Genomic Datasets for Cancer Research
      This page provides information on a variety of datasets from genome-wide association studies (GWAS) of cancer and other genotype-phenotype studies, including sequencing and molecular diagnostic assays. These data are available to approved investigators through the National Cancer Institute (NCI)'s Extramural Data Access Committee (DAC).
    • Genomic Data Sharing Policy Home PageExternal Web Site Policy
      To promote sharing of human and non-human genomic data and to provide appropriate protections for research involving human data, the National Institutes of Health (NIH) issued the Genomic Data Sharing (GDS) Policy on August 27, 2014. This NCI webpage supports the policy's implementation.
  • Genotyping and Sequencing Centers
    View Resources
    • Cancer Genomics Research Laboratory (CGR)External Web Site Policy
      NCI established the CGR to investigate the contribution of germline genetic variation to cancer susceptibility and outcomes. Working in concert with epidemiologists, biostatisticians and basic research scientists in the intramural research program, the CGR has developed the capacity to conduct genome-wide association studies and next-generation sequencing to identify the heritable determinants of various forms of cancer.
    • Center for Inherited Disease Research (CIDR)
      CIDR provides high-quality next generation sequencing and genotyping services to investigators working to discover genes that contribute to common diseases.
    • Mendelian Genome CentersExternal Web Site Policy
      These Centers, funded by the National Human Genome Research Institute (NHGRI) apply next-generation sequencing and computational approaches to discover the genes and variants that underlie Mendelian conditions, including certain forms of cancer.
    • National Human Genome Research Institute (NHGRI) Large Scale Sequencing ProgramExternal Web Site Policy
      NHGRI funds large-scale genome sequencing capacity at several centers located in the U.S. This program undertakes sequencing projects to provide critical genomic information that can be of significant value to the scientific community in areas of very broad scientific interest.
  • Literature and Knowledge Base Resources
    View Resources
    • Genetic Testing RegistryExternal Web Site Policy
      The Genetic Testing Registry (GTRĀ®) provides a central location for voluntary submission of genetic test information by providers. The scope includes the test's purpose, methodology, validity, evidence of the test's usefulness, and laboratory contacts and credentials.
    • HuGE NavigatorExternal Web Site Policy
      The Navigator is an integrated, searchable knowledge base of genetic associations and human genome epidemiology.
    • Pharmacogenomic Resources
      This page provides links to pharmacogenomics collaborative opportunities, consortia, and networks; databases related to pharmacogenomics research; knowledge synthesis resources; reports; and toolkits.
    • SEQanswersExternal Web Site Policy
      SEQanswers was founded to be an information resource and user-driven community focused on all aspects of next-generation genomics. The site aims to be a central location for next generation sequencing technology discussion and education. The site will always attempt to cater to everyone, regardless of scientific background or knowledge.
  • NCI/NIH Sponsored Networks and Programs
    View Resources
    • Cancer Genetics Markers of Susceptibility (CGEMS)External Web Site Policy
      CGEMS was launched to identify common inherited genetic variations associated with risk for breast and prostate cancer. It involves genome-wide association studies (GWAS) for a number of cancers, and more recently, exposures and survival. The raw genotype data from each of the CGEMS projects will be available for download to accredited investigators, upon approval of a Data Access Request.
    • Environmental Polymorphism Registry (EPR)External Web Site Policy
      The EPR is a long-term research project to collect and store DNA from up to 20,000 North Carolinians in a biobank. The DNA samples are available to scientists to study variations in genes (known as polymorphisms) that might be linked to common diseases such as diabetes, heart disease, cancer, asthma and others. While many types of genes are studied as part of the EPR, the focus is on a category known as environmental response genes.
    • Genes, Environment and Health Initiative (GEI)External Web Site Policy
      The GEI is an NIH-wide initiative that aims to accelerate understanding of genetic and environmental contributions to health and disease. There are two components to GEI: genetics and exposure biology. The genetics component includes a genome-wide association program called GENEVA (Gene Environment Association Studies)External Web Site Policy.
    • Genetic Associations and Mechanisms in Oncology (GAME-ON)
      GAME-ON comprises five NCI sponsored cooperative agreements for transdisciplinary research projects addressing two overall goals: 1) To pursue promising scientific leads from previously generated GWAS of cancer; and 2) To coordinate and accelerate integrative post-GWAS discovery research, which could provide the basis for expediting clinical translation and public health dissemination of the findings.
  • Data Repositories with Functional Genomics Data
    View Resources
    • ArrayExpressExternal Web Site Policy
      A database housing gene expression and microarray data from high-throughput sequencing studies. Data from this database are submitted to ArrayExpress or imported from Gene Expression Omnibus.
    • Gene Expression Omnibus (GEO)External Web Site Policy
      A functional genomics data repository that stores microarray, next-generation sequencing, and high-throughput functional genomics data for access to the research community.
    • Multiple Tissue Human Expression Resource (MuTHER)External Web Site Policy
      A resource of genome-wide association, sequencing, expression, and methylation from a range of tissues to understand the molecular basis of genetic susceptibility.
    • The Cancer Genome Atlas (TCGA)External Web Site Policy
      A catalogue of tumor data from various cancers and subtypes which are available to the research community for understanding the molecular basis of cancer.

Return to Top

Analytical Tools and Statistical Software

  • Analysis Tools
    View Resources
    • Broad Institute Software ToolsExternal Web Site Policy
      Scientists in the Broad community have developed many critical software tools for the analysis of increasingly large genome-related datasets, and they make these tools openly available to the scientific community. Includes GATKExternal Web Site Policy and HaploviewExternal Web Site Policy.
    • Genetic Simulation Resources (GSR)External Web Site Policy
      This web tool provides a catalogue of existing computer simulation programs that simulate genetic data of the human genome for studies in population and evolutionary genetics, genetic epidemiology, and other relevant application areas. It contains computer programs that generate samples by simulating evolutionary processes backward (coalescent) or forward in time, resampling empirical data, or using other novel methods. This is for use for aid in selection of most appropriate genetic simulation tools for specific genetic epidemiology questions.
    • Genome Variation Server (GVS)External Web Site Policy
      GVS provides information on allele frequencies, linkage disequilibrium, tagSNP selection and SNP summaries. Fed by a local database, GVS enables rapid access to human genotype data found in dbSNP, and provides tools for analysis of genotype data.
    • NGSpeAnalysisExternal Web Site Policy
      This pipeline will use the Burrows-Wheeler Aligner (BWA), Genome Analysis Toolkit (GATK), Picard, ANNOVAR and BEDTools to conduct analysis from alignment of pair ended short reads generated by Next Generation Sequencing machine to high quality variants genotype calling.
    • PLINKExternal Web Site Policy
      PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.
    • SEQanswers Software ListExternal Web Site Policy
      Dynamic and comprehensive table of next-generation sequence analysis software compiled on the SEQanswers website. Includes programs that recalibrate the quality scores produced by next-generation sequencing base callers (ShortRead, SHREC, BING, GATK) and algorithms for DNA sequencing (BWA, MAQ, BFAST, SOAP, etc).
    • University of Michigan Software ToolsExternal Web Site Policy
      Scientists at the University of Michigan have developed software tools for statistical genetics analysis, and they make these tools openly available to the scientific community. Includes LocusZoom, MACH and the CaTS Power Calculator.
    • VarScanExternal Web Site Policy
      This statistical package can be used to detect germline variants, somatic mutations, and copy number variations for next-generation sequencing platforms.
  • Genome Browsers and Map Viewers
    View Resources
    • EnsemblExternal Web Site Policy
      Ensembl is a joint project between the European Bioinformatics Institute the Wellcome Trust Sanger Institute to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes.
    • Integrative Genomics ViewerExternal Web Site Policy
      The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.
    • National Center for Biotechnology Information (NCBI) Human Genome ResourcesExternal Web Site Policy
      NCBI's website strives to offer an integrated, one-stop, genomic information resource for data emerging from the Human Genome Project and other sequencing projects worldwide.
    • The University of California, Santa Cruz (UCSC) Genome BrowserExternal Web Site Policy
      The UCSC Genome Browser contains the reference sequence and working draft assemblies for a large collection of genomes. This interactive website offers access to genome sequence data integrated with aligned annotations.
      • UCSC XenaExternal Web Site Policy
        Securely analyze and visualize your private functional genomics data set in the context of public and shared genomic/phenotypic data sets.
  • Toolkits for Harmonizing or Generating Standardized Measures for Phenotypes and Exposures
    View Resources
  • Tools and Applications for Functional Genomics Analysis
    View Resources
    • Catalogue of Somatic Mutations in Cancer (COSMIC)External Web Site Policy
      A catalogue of somatic mutation information for human cancers extracted from primary literature.
    • DTP/NCI Molecular Target DatabasesExternal Web Site Policy
      A database used to search for molecular targets near a sequence, gene, or variant of interest. Molecular target data presented from the NCI panel of 60 human tumor cell lines.
    • Encyclopedia of DNA Elements (ENCODE)External Web Site Policy
      A catalogue of regulatory elements across the human genome.
    • Epigenome BrowserExternal Web Site Policy
      Hosts data from the Roadmap Epigenomics Project for comparative analysis with large scale genomic and epigenomic data.
    • eQTL BrowserExternal Web Site Policy
      A tool for searching for expression quantitative trait loci (eQTL) near sequences, genes, and loci.
    • FunciSNPExternal Web Site Policy
      Software that identifies the functionality of SNPs in coding and non-coding regions through integration of GWAS data, 1000 Genomes data, and chromatin features.
    • HaploRegExternal Web Site Policy
      A tool that links SNPs to their predicted chromatin state and other regulatory motifs.
    • JASPARExternal Web Site Policy
      A database profiling transcription factor binding sites.
    • RegulomeDBExternal Web Site Policy
      A database used to indentify regulatory elements in non-coding regions of the human genome.
    • SNP and CNV Annotation Database (SCANdb)External Web Site Policy
      A database that can be used in follow-up analyses of GWAS data to determine whether a SNP is an eQTL for a specified gene.
    • SNPexpExternal Web Site Policy
      A web tool for calculating and visualizing correlation between HapMap genotypes and gene expression levels.
    • SNP Function Prediction (FuncPred)External Web Site Policy
      A tool that queries SNP function predictions.
    • TRANSFACExternal Web Site Policy
      This database links the genome to transcription factors, transcription factor binding sites, and regulated genes.

Return to Top

Interpretative Tools for Genomic Data

  • Biological Pathway Analysis Programs and Databases
    View Resources
    • Ariadne Pathway StudioExternal Web Site Policy
      This pathway analysis software may be used to interpret gene expression and other high-through put data and is a useful resource for building, expanding, and analyzing pathways. Investigators may also use MedScan as a data mining tool to extract relevant information from publications. Pathways can be used in publications.
    • HotnetExternal Web Site Policy
      Hotnet, an algorithm created by the Ralph Lab in the Department of Computer Science at Brown University, can be used with Matlab and Python statistic packages to find significantly altered sub-networks in large gene interaction networks. Visualizations for Hotnet output require Cytoscape.
    • Ingenuity Pathway AnalysisExternal Web Site Policy
      This program allows researchers to analyze gene expression, RNA-Seq, microRNA, qPCR, proteomics, metabolomics, and genotyping data through the identification of relevant pathways, relationships, mechanisms, and functions. The program can be used for large-scale genomic data.
    • MuSiCExternal Web Site Policy
      This statistical package uses multiple tools to find gene alterations and relationships in cancer. Investigators can compare their mutations with mutations found in COSMIC and OMIM, compare their data with clinical data, and use Pathscan to find altered pathways in cancer.
    • NetpathExternal Web Site Policy
      This database, created in collaboration with the John Hopkins University Pandey Lab and the Institute of Bioinformatics, is a useful resource for curated signal transductions pathways in humans. Netpath provides information on 10 immune pathways and 10 cancer pathways. Each pathway includes information on protein-protein interactions, enzyme catalysis, protein translocation, and gene regulation. All pathways are available for batch download.
    • Regulome ExplorerExternal Web Site Policy
      Regulome Explorer is a web-based tool that allows researchers to use TCGA data for cancer comparisons, random forest regression, and individual genome aberrations. Investigators can use Pubcrawl for data-mining, and building gene networks. These networks can be based on the literature distances in medline or protein domain interactions.
  • Cancer Genome and Somatic Mutation Information
    View Resources
    • cBio Cancer Genomics PortalExternal Web Site Policy
      The cBio Cancer Genomics Portal, developed by the Computational Biology Center at Memorial Sloan-Kettering Cancer Center, provides visualization, analysis and download of subsets of large-scale cancer genomics data sets.
    • Catalogue of Somatic Mutations in Cancer (COSMIC)External Web Site Policy
      The COSMIC database is designed to store and display somatic mutation information and related details and contains information relating to human cancers.
    • The Cancer Genome Atlas (TCGA)External Web Site Policy
      TCGA is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing. TCGA data are available to the research community for use in developing better ways of diagnosing, treating, and preventing cancer.
  • Catalogues and Databases of Relationships Between Genotypes and Phenotypes
    View Resources
    • CanDLExternal Web Site Policy
      An expert-curated database of potentially actionable driver mutations for molecular pathologists and laboratory directors to facilitate literature-based annotation of genomic testing of tumors.
    • Clinical Interpretation of Variants in Cancer (CiViC)External Web Site Policy
      An open access, open source, community-driven web resource for Clinical Interpretation of Variants in Cancer.
    • ClinVARExternal Web Site Policy
      This is a freely accessible, public archive of reports of the relationships among human variations and phenotypes along with supporting evidence.
    • Database of Curated Mutations (DoCM)External Web Site Policy
      A highly curated database of known, disease-causing mutations that provides easily explorable variant lists with direct links to source citations for easy verification.
    • Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources (DECIPHER)External Web Site Policy
      DECIPHER is an interactive web-based database which incorporates a suite of tools designed to aid the interpretation of submicroscopic chromosomal imbalance. This database collects clinical information about chromosomal microdeletions/duplications/insertions, translocations and inversions.
    • GeneNetworkExternal Web Site Policy
      A group of linked data sets and tools used to study complex networks of genes, molecules, and higher order gene function and phenotypes from the University of Tennessee.
    • Human Gene Mutation Database (HGMD)External Web Site Policy
      This database provides a comprehensive core collection of germline mutations in nuclear genes that underlie or are associated with human inherited disease.
    • Jackson Laboratory Clinical Knowledgebase (JAX CKB)External Web Site Policy
      The Jackson Laboratory Clinical Knowledgebase is a semi-automated curated database of gene/variant annotations, therapy/diagnostic/prognostic information, and clinical trials related to oncology.
    • NHGRI Catalog of Published Genome-Wide Association StudiesExternal Web Site Policy
      This resource provides information on SNP-trait associations abstracted from GWAS publications.
    • Online Mendelian Inheritance in Man (OMIM)External Web Site Policy
      OMIM is a comprehensive, authoritative, and timely compendium of human genes and genetic phenotypes. The full-text, referenced overviews in OMIM contain information on all known Mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between phenotype and genotype. It is updated daily, and the entries contain copious links to other genetics resources.
    • Phenotype-Genotype Integrator (PhenGenI)External Web Site Policy
      PhenGenI merges NHGRI GWAS catalog data with several databases housed at the NCBI, including Gene, dbGaP, OMIM, Genotype-Tissue Expression (GTEx), and the Database of Single Nucleotide Polymorphisms (dbSNP).
    • Precision Medicine Knowledgebase (PMKB)External Web Site Policy
      The Precision Medicine Knowledgebase provides editable, pathologist-reviewed information about clinical cancer variants and interpretations in a structured way.
  • Tools for Predicting Impact of Amino Acid Substitutions
    View Resources
    • Polymorphism Phenotyping (PolyPhen-2)External Web Site Policy
      This tool predicts the possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations.
    • PMutExternal Web Site Policy
      This software, aimed at the annotation and prediction of whether a mutation is pathological, formulates predictions with neural networks, using internal databases, secondary structure prediction and sequence conservation.
    • The Sorting Tolerant From Intolerant (SIFT) AlgorithmExternal Web Site Policy
      This tool predicts whether an amino acid substitution affects protein function based on the degree of conservation of amino acid residues in sequence alignments derived from closely related species.
    • Variant Effect PredictorExternal Web Site Policy
      This system (formerly known as the SNP Effect Predictor) categorizes Ensembl genomic variants in known transcripts by their potential effect.

Return to Top