Genomic Resources for Cancer Epidemiology
- Data Resources, Genotyping and Sequencing Centers, and NCI- and NIH-Sponsored Networks and Programs
- Analytical Tools and Statistical Software
- Interpretative Tools for Genomic Data
Note: This web page provides links to research resources that may be of interest to genetic epidemiologists conducting cancer research, but is not exhaustive. Within each section, the resources are listed in alphabetical order.
If you have suggestions for additional resources to add, please contact Carolyn Hutter, Ph.D., a Program Director in the Epidemiology and Genomics Research Program's Host Susceptibility Factors Branch.
Data Resources, Genotyping and Sequencing Centers, and NCI- and NIH-Sponsored Networks and Programs
-
Databases and Catalogues of Genetic Variation
View Resources- 1000 Genomes Project

The goal of the 1000 genomes project is to provide a comprehensive resource on human genetic variation. The Project is sequencing the genomes of approximately 2,500 samples at 4x coverage, to provide data on genetic variants with frequencies of at least 1% in the populations studied. - Database of Genomic Structural Variation (dbVar)

dbVar is the NCBI central repository for structural variation. Structural variation is generally defined as any region of DNA involved in inversions and balanced translocations, insertions and deletions, or copy number variation. - Database of Single Base Nucleotide Substitutions (dbSNP)

dbSNP is the NCBI central repository for single base nucleotide substitutions (SNPs) and short deletion and insertion polymorphisms. - Encyclopedia of DNA Elements (ENCODE) data

ENCODE provides a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active. - International HapMap Project

The HapMap is a catalog of common genetic variants that occur in human beings. It describes what these variants are, where they occur in our DNA, and how they are distributed among people within populations and among populations in different parts of the world. - National Heart Lung and Blood Institute (NHLBI) Exome Variant Server (EVS)

The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of exome sequencing across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community. The current EVS data release represents all variants identified from exome sequencing of 6503 ESP samples. - SNP500Cancer

SNP500Cancer provides a central resource for sequence verification of SNPs in genetic regions of importance to molecular epidemiology studies in cancer.
- 1000 Genomes Project
-
Genomic Datasets for Cancer Research: Datasets and Access Policy
View Resources- Data Access Request Process
This page contains instructions for submitting a Data Access Request for dataset(s) under the purview of the NCI's Extramural Data Access Committee. - Database of Genotypes and Phenotypes (dbGaP)

dbGaP was developed to archive and distribute the results of studies that have investigated the interaction of genotype and phenotype. Such studies include GWAS, medical sequencing, molecular diagnostic assays, as well as studies of associations between genotype and non-clinical traits. dbGaP provides two levels of access, open and controlled, in order to allow broad release of non-sensitive data, while providing oversight and investigator accountability for sensitive data sets involving personal health information. - Genomic Datasets for Cancer Research
This page provides information on a variety of datasets from genome-wide association studies (GWAS) of cancer and other genotype-phenotype studies, including sequencing and molecular diagnostic assays. These data are available to approved investigators through the National Cancer Institute (NCI)'s Extramural Data Access Committee (DAC). - GWAS Policy Home Page

In January 2008, the National Institutes of Health (NIH) implemented a policy for the sharing of data obtained in NIH-supported or conducted GWAS. The purpose of the policy is to foster science for the benefit of the public through the creation of a centralized NIH GWAS data repository. This website supports the GWAS policy's implementation. - Notice on Development of Data Sharing Policy for Sequence and Related Genomic Data

This notice details NIH plans to: 1) updated data sharing policies for NIH supported research involving sequence and related genomic data; 2) encourage investigators and IRBs to consider the potential for broad sharing of this genomic data in developing informed consent processes and documents for such studies; and 3) communicate the agency's intent to develop a policy pertaining to the deposition of these large datasets into centralized databases.
- Data Access Request Process
-
Genotyping and Sequencing Centers
View Resources- Cancer Genomics Research Laboratory (CGR)

NCI established the CGR to investigate the contribution of germline genetic variation to cancer susceptibility and outcomes. Working in concert with epidemiologists, biostatisticians and basic research scientists in the intramural research program, the CGR has developed the capacity to conduct genome-wide association studies and next-generation sequencing to identify the heritable determinants of various forms of cancer. - Center for Inherited Disease Research (CIDR)

CIDR provides high-quality next generation sequencing and genotyping services to investigators working to discover genes that contribute to common diseases. - Mendelian Genome Centers

These Centers, funded by the National Human Genome Research Institute (NHGRI) apply next-generation sequencing and computational approaches to discover the genes and variants that underlie Mendelian conditions, including certain forms of cancer. - National Human Genome Research Institute (NHGRI) Large Scale Sequencing Program

NHGRI funds large-scale genome sequencing capacity at several centers located in the U.S. This program undertakes sequencing projects to provide critical genomic information that can be of significant value to the scientific community in areas of very broad scientific interest.
- Cancer Genomics Research Laboratory (CGR)
-
Literature and Knowledge Base Resources
View Resources- Cancer Genome-Wide Association and Meta Analyses database (Cancer GAMAdb)

Cancer GAMAdb provides a continually updated database containing key descriptive characteristics of each genetic association extracted from published GWAS and meta-analyses relevant to cancer risk. - Cancer Genomic Evidence-Based Medicine Knowledge Base (CancerGEM KB)

CancerGEM KB is a resource for researchers, public health professionals, policy makers, and health care providers who are interested in the use of genomic information in cancer care and prevention. - GeneReviews

GeneReviews are overviews providing expert-authored, peer-reviewed, current disease descriptions that apply genetic testing to the diagnosis, management, and genetic counseling of patients and families with specific inherited conditions. - HuGE Navigator

The Navigator is an integrated, searchable knowledge base of genetic associations and human genome epidemiology. - Pharmacogenomic Resources
This page provides links to pharmacogenomics collaborative opportunities, consortia, and networks; databases related to pharmacogenomics research; knowledge synthesis resources; reports; and toolkits. - SEQanswers

SEQanswers was founded to be an information resource and user-driven community focused on all aspects of next-generation genomics. The site aims to be a central location for next generation sequencing technology discussion and education. The site will always attempt to cater to everyone, regardless of scientific background or knowledge.
- Cancer Genome-Wide Association and Meta Analyses database (Cancer GAMAdb)
-
NCI/NIH Sponsored Networks and Programs
View Resources- Cancer Genetics Markers of Susceptibility (CGEMS)

CGEMS was launched to identify common inherited genetic variations associated with risk for breast and prostate cancer. It involves genome-wide association studies (GWAS) for a number of cancers, and more recently, exposures and survival. The raw genotype data from each of the CGEMS projects will be available for download to accredited investigators, upon approval of a Data Access Request. - Environmental Polymorphism Registry (EPR)

The EPR is a long-term research project to collect and store DNA from up to 20,000 North Carolinians in a biobank. The DNA samples are available to scientists to study variations in genes (known as polymorphisms) that might be linked to common diseases such as diabetes, heart disease, cancer, asthma and others. While many types of genes are studied as part of the EPR, the focus is on a category known as environmental response genes. - Genes, Environment and Health Initiative (GEI)

The GEI is an NIH-wide initiative that aims to accelerate understanding of genetic and environmental contributions to health and disease. There are two components to GEI: genetics and exposure biology. The genetics component includes a genome-wide association program called GENEVA (Gene Environment Association Studies)
. - Genetic Associations and Mechanisms in Oncology (GAME-ON)
GAME-ON comprises five NCI sponsored cooperative agreements for transdisciplinary research projects addressing two overall goals: 1) To pursue promising scientific leads from previously generated GWAS of cancer; and 2) To coordinate and accelerate integrative post-GWAS discovery research, which could provide the basis for expediting clinical translation and public health dissemination of the findings.
- Cancer Genetics Markers of Susceptibility (CGEMS)
Analytical Tools and Statistical Software
-
Analysis Tools
View Resources- Alphabetical List of Genetic Analysis Software

Curated at Rockefeller University, a list of computer software on the following topics: genetic linkage analysis for human pedigree data, QTL analysis for animal/plant breeding data, genetic marker ordering, genetic association analysis, haplotype construction, pedigree drawing, and population genetics. - Broad Institute Software Tools

Scientists in the Broad community have developed many critical software tools for the analysis of increasingly large genome-related datasets, and they make these tools openly available to the scientific community. Includes GATK
and Haploview
. - Genetic Simulation Resources (GSR)

This web tool provides a catalogue of existing computer simulation programs that simulate genetic data of the human genome for studies in population and evolutionary genetics, genetic epidemiology, and other relevant application areas. It contains computer programs that generate samples by simulating evolutionary processes backward (coalescent) or forward in time, resampling empirical data, or using other novel methods. This is for use for aid in selection of most appropriate genetic simulation tools for specific genetic epidemiology questions. - Genome Variation Server (GVS)

GVS provides information on allele frequencies, linkage disequilibrium, tagSNP selection and SNP summaries. Fed by a local database, GVS enables rapid access to human genotype data found in dbSNP, and provides tools for analysis of genotype data. - PLINK

PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner. - SEQanswers Software List

Dynamic and comprehensive table of next-generation sequence analysis software compiled on the SEQanswers website. Includes programs that recalibrate the quality scores produced by next-generation sequencing base callers (ShortRead
, SHREC
, BING
, GATK
) and algorithms for DNA sequencing (BWA
, MAQ
, BFAST
, SOAP
, etc) - University of Michigan Software Tools

Scientists at the University of Michigan have developed software tools for statistical genetics analysis, and they make these tools openly available to the scientific community. Includes LocusZoom
, MACH
and the CaTS Power Calculator
. - VarScan

This statistical package can be used to detect germline variants, somatic mutations, and copy number variations for next-generation sequencing platforms.
- Alphabetical List of Genetic Analysis Software
-
Genome Browsers and Map Viewers
View Resources- Ensembl

Ensembl is a joint project between the European Bioinformatics Institute the Wellcome Trust Sanger Institute to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes. - Integrative Genomics Viewer

The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations. - National Center for Biotechnology Information (NCBI) Human Genome Resources

NCBI's website strives to offer an integrated, one-stop, genomic information resource for data emerging from the Human Genome Project and other sequencing projects worldwide. - NCBI Map Viewer

The Map Viewer provides graphical displays of features on the human reference genome sequence assembly maintained by the genome reference consortium and the alternate HuRef genome assembly, as well as cytogenetic, genetic, physical, and radiation hybrid maps. - The University of California, Santa Cruz (UCSC) Genome Browser

The UCSC Genome Browser contains the reference sequence and working draft assemblies for a large collection of genomes. This interactive website offers access to genome sequence data integrated with aligned annotations.- UCSC Cancer Genomics Browser

This browser allows researchers to investigate cancer genomics data and its corresponding clinical information. The browser can be used to view biological pathways, chromosomal locations, and gene expression data. Statistical analysis can be performed on subsets of the data.
- UCSC Cancer Genomics Browser
- Ensembl
-
Toolkits for Harmonizing or Generating Standardized Measures for Phenotypes and Exposures
View Resources- Consensus Measures for Phenotypes and EXposures (PhenX)

PhenX, funded by NHGRI, is intended to integrate genetics and epidemiologic research. The toolkit is a web-based catalog of high-priority measures of phenotypes and exposures for use in GWAS and other research efforts. - Data Schema and Harmonization Platform for Epidemiological Research (DataSHaPER)

DataSHaPER is both a scientific approach and a suite of practical tools. Its primary aims are to facilitate the prospective harmonization of emerging biobanks, provide a template for retrospective synthesis and support the development of questionnaires and information-collection devices.
- Consensus Measures for Phenotypes and EXposures (PhenX)
Interpretative Tools for Genomic Data
-
Biological Pathway Analysis Programs and Databases
View Resources- Ariadne Pathway Studio

This pathway analysis software may be used to interpret gene expression and other high-through put data and is a useful resource for building, expanding, and analyzing pathways. Investigators may also use MedScan as a data mining tool to extract relevant information from publications. Pathways can be used in publications. - Hotnet

Hotnet, an algorithm created by the Ralph Lab in the Department of Computer Science at Brown University, can be used with Matlab and Python statistic packages to find significantly altered sub-networks in large gene interaction networks. Visualizations for Hotnet output require Cytoscape. - Ingenuity Pathway Analysis

This program allows researchers to analyze gene expression, RNA-Seq, microRNA, qPCR, proteomics, metabolomics, and genotyping data through the identification of relevant pathways, relationships, mechanisms, and functions. The program can be used for large-scale genomic data. - MuSiC

This statistical package uses multiple tools to find gene alterations and relationships in cancer. Investigators can compare their mutations with mutations found in COSMIC and OMIM, compare their data with clinical data, and use Pathscan to find altered pathways in cancer. - Netpath

This database, created in collaboration with the John Hopkins University Pandey Lab and the Institute of Bioinformatics, is a useful resource for curated signal transductions pathways in humans. Netpath provides information on 10 immune pathways and 10 cancer pathways. Each pathway includes information on protein-protein interactions, enzyme catalysis, protein translocation, and gene regulation. All pathways are available for batch download. - Regulome Explorer

Regulome Explorer is a web-based tool that allows researchers to use TCGA data for cancer comparisons, random forest regression, and individual genome aberrations. Investigators can use Pubcrawl for data-mining, and building gene networks. These networks can be based on the literature distances in medline or protein domain interactions.
- Ariadne Pathway Studio
-
Cancer Genome and Somatic Mutation Information
View Resources- cBio Cancer Genomics Portal

The cBio Cancer Genomics Portal, developed by the Computational Biology Center at Memorial Sloan-Kettering Cancer Center, provides visualization, analysis and download of subsets of large-scale cancer genomics data sets. - Catalogue of Somatic Mutations in Cancer (COSMIC)

The COSMIC database is designed to store and display somatic mutation information and related details and contains information relating to human cancers. - The Cancer Genome Atlas (TCGA)

TCGA is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing. TCGA data are available to the research community for use in developing better ways of diagnosing, treating, and preventing cancer.
- cBio Cancer Genomics Portal
-
Catalogues and Databases of Relationships Between Genotypes and Phenotypes
View Resources- ClinVAR

This is a freely accessible, public archive of reports of the relationships among human variations and phenotypes along with supporting evidence. - Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources (DECIPHER)

DECIPHER is an interactive web-based database which incorporates a suite of tools designed to aid the interpretation of submicroscopic chromosomal imbalance. This database collects clinical information about chromosomal microdeletions/duplications/insertions, translocations and inversions. - GeneNetwork

A group of linked data sets and tools used to study complex networks of genes, molecules, and higher order gene function and phenotypes from the University of Tennessee. - Human Gene Mutation Database (HGMD)

This database provides a comprehensive core collection of germline mutations in nuclear genes that underlie or are associated with human inherited disease. - NHGRI Catalog of Published Genome-Wide Association Studies

This resource provides information on SNP-trait associations abstracted from GWAS publications. - Online Mendelian Inheritance in Man (OMIM)

OMIM is a comprehensive, authoritative, and timely compendium of human genes and genetic phenotypes. The full-text, referenced overviews in OMIM contain information on all known Mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between phenotype and genotype. It is updated daily, and the entries contain copious links to other genetics resources. - Phenotype-Genotype Integrator (PhenGenI)

PhenGenI merges NHGRI GWAS catalog data with several databases housed at the NCBI, including Gene, dbGaP, OMIM, Genotype-Tissue Expression (GTEx), and the Database of Single Nucleotide Polymorphisms (dbSNP). - wikiGWA

wikiGWA is a Wikipedia style platform for researchers to share their GWA findings.
- ClinVAR
-
Tools for Predicting Impact of Amino Acid Substitutions
View Resources- Polymorphism Phenotyping (PolyPhen-2)

This tool predicts the possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations. - PMut

This software, aimed at the annotation and prediction of whether a mutation is pathological, formulates predictions with neural networks, using internal databases, secondary structure prediction and sequence conservation. - The Sorting Tolerant From Intolerant (SIFT) Algorithm

This tool predicts whether an amino acid substitution affects protein function based on the degree of conservation of amino acid residues in sequence alignments derived from closely related species. - Variant Effect Predictor

This system (formerly known as the SNP Effect Predictor) categorizes Ensembl genomic variants in known transcripts by their potential effect.
- Polymorphism Phenotyping (PolyPhen-2)