Integrative Analysis

You are here

Many research projects involve multiple sources of data, and integration of these types of data is often one of the primary challenges in conducting the data analysis. Such projects could include multiple -omics data sets, such as measuring the effect of an epigenomic mark from ChIP-seq on gene expression from RNA-seq; or a mixture of molecular and clinical data, such as associations between genomic variants and disease states, response to drugs, and confounding factors like age, sex, and ethnicity. Correctly bringing together different data types requires a deep understanding of the biological context of the project, knowledge of the available statistical methods for describing and testing associations between different types of data, and awareness of the limitations of each type of measurement.

The CRI provides extensive expertise in data integration, and utilizes a number of resources to supplement the data sets collected within individual projects, including publicly available data sets like GEO, TCGA, and DAVID; and commercial tools and databases, such as MetaCore and Transfac. Researchers interested in our assistance are encouraged to contact us to schedule a consultation or service request where we can discuss the solutions available for your project.

As examples, past projects requiring data integration have included:

  • Systems-level integration multiple of -omics data across two species: we analyzed transcriptomics (RNA-seq) and proteomics (mass spectrometry) data in Alzheimer's disease, collected from both mouse models and human subjects. Differential expressed genes and proteins were identified from each platform and each species, and pathway analysis was performed on set separately in MetaCore to identify biological functions impacted by molecular-scale changes. The pathway-level results allowed us to directly compare the affected systems both across platforms, and between species. Furthermore, we mined the affected pathways for key molecular players and constructed molecular interaction networks within MetaCore to identify molecules to target for follow-up studies.
  • Whole-transcriptome gene expression microarray data from patients with sickle cell disease (SCD) were analyzed individually (N-of-1) for significantly enriched pathways using FAIME, and patients were compared based on the significance of each misregulated pathway. This strategy is more robust to the large person-to-person variability typically seen in gene expression, and is more easily applied across transcriptomics platforms (e.g., microarray vs. RNA-seq). From the pathway-level results we identified two distinct SCD subtypes. One subtype was strongly associated with higher mortality and poorer clinical prognosis, suggesting the need for more intensive therapeutic clinical strategies for these patients. A classifier was built on the pathway enrichments from this gene set and validated in several additional cohorts.
  • Comparison of ChIP-seq and gene expression data to evaluate the epigenetic effects of nitric oxide (•NO) in breast cancer development. ChIP-seq from multiple histone modifications as a result of •NO treatment was compared to untreated breast cancer cells, and peak locations with increased or decreased enrichment were identified. These peak groups were associated with both proximal and distal genes, and we tested whether such putatively regulated genes were associated with up- or down-regulation in response to •NO. We then searched for enriched transcription factor binding motifs around differentially regulated histone modifications to identify transcription factors potentially co-involved in gene regulation. We also obtained enriched gene ontologies from DAVID to highlight the affected biological pathways.