Frequently Asked Questions

You are here

Yes. We frequently get requests for statistical analysis of additional types of data, for example metabolomics data from LCMS or protein array data, or clinical/patient data. Please contact us for more information about your particular project, and we can discuss the services we can provide for you.
As we are physically located in the College of Medicine West building on UIC’s west campus, we are ideally located to work closely with UIC’s medical and biological researchers. However, CRI services are available to outside academic and commercial institutions as well, at a higher rate than for internal UIC users. Researchers from Rush University, Northwestern University, and University of Chicago may use CRI services at internal rates.
We have standardized analysis methodologies for many services, such as RNA-seq, but due to the rapidly changing nature of the bioinformatics field and the analysis complexities that are specific to different projects, we are unable to provide static quotes for services. If you would like a quote for your project, please contact us to schedule a free consultation. After reviewing your project and analysis needs we will be able to provide a project plan and budget.
This depends on the scope and budget of your project. As a general rule, microarrays are cheaper and faster to run and analyze than NGS, but your measurements are limited to the features on the array so there is less scope for novel discovery.

For genotyping/resequencing projects, microarray platforms generally have hundreds of thousands to millions of SNPs, and aim to cover the most common variants in a population. They usually do not cover insertions and deletions, and are generally not suitable for detecting somatic mutations. If your goal is to genotype a large cohort of patients, microarray will likely be the most effective platform; the Sequenom platform used by the UIC Core Genomics Facility also offers the ability to cheaply genotype a customized set of SNPs (10s to 100s) on a large number of patients. If your goal is to study rare variations that may not be present on an array, more complex variants than SNVs, or somatic mutations (i.e., in cancer) in a smaller set of samples, NGS is most appropriate.

For gene expression, the extra benefits greatly outweigh the slightly higher costs of RNA-seq compared to microarray: RNA-seq provides a broader dynamic range, the ability to detect novel transcripts and sequencing variations, and better forwards and backwards compatibility as analysis methods or databases change.
Technical replicates are essentially repeated measurements of the same sample, for example, preparing multiple libraries from the same cell population and sequencing the libraries separately. Biological replicates are measurements of different samples that are in the same biological condition, for example cells obtained from two mice that are the same age, genotype, treatment, etc.

Collecting biological replicates is critical for accurately detecting changes between different biological conditions (such as WT vs KO). This analysis typically boils down to some version of comparing within-group variation to between-group variation, as you would do in a t-test (though the model of variability may be different), and so estimating within-group variation correctly is very important. There is always natural variation from measurement-to-measurement – some comes from the measurement itself, and some is inherent to the biological system, such as individual mouse-to-mouse differences. Technical replicates will only reflect the former, but biological replicates will reflect both. Since different samples from different conditions must vary at least as much as different samples from the same condition, biological replicates are necessary to differentiate conditions.
For a study of a model system with high expected reproducibility between samples, we recommend at least 3 biological replicates per condition, and ideally 4 or 5 in case one sample is of low quality, for both RNA-seq. If samples are collected from patient cohorts, rather than an animal model or cell line, you will need considerably more than 3 per group, but the exact number is difficult to predict, as it depends on the person-to-person variation within each group. In some circumstances, fewer than 3 replicates is sufficient. For instance, if you are following a time series, 1 or 2 samples per time point may be sufficient depending on the type of patterns you hope to detect.

The level of sequencing depth depends on the scope of the experiment, and the type of RNA-seq performed. For differential expression from whole-transcript RNA-seq (the standard experiment), we recommend at least 20-30M reads for a mammalian genome. Keep in mind that the deeper you sequence, the better you will be able to distinguish changes between low-expressed genes, where most of the noise is. If you are interested in discovering novel isoforms or non-coding transcripts, we recommend much deeper sequencing, 100M reads or more. We also strongly recommend paired-end sequencing for RNA-seq, especially if differential splicing/isoforms are of interest. These depths may change based on other factors as well, such as the quality of the RNA sample (how degraded transcripts are), and the strategy used to exclude ribosomal RNA from sequencing (rRNA depletion versus polyA capture).

On the other hand, if you are primarily interested in gene expression, and not differentiation of isoforms or discovery of novel transcripts, 3' RNA-seq – where only the 3' end of transcripts are sequenced – offers a more economical option, as ribosomal RNAs are not a concern, and sequencing depth as low as 5M reads may be acceptable, and only single-end sequencing is needed.

For miRNA-seq, as few as 5M reads is sufficient for differential expression of annotated miRNAs. If you would like to discover unannotated miRNAs as well, we recommend closer to 25M reads. Single-end sequencing is sufficient for miRNA-seq.

For more information, we recommend reading the ENCODE guidelines for RNA-seq.
The choice of platform depends on the scope of the project and the samples being genotyped. For instance, Affymetrix offers arrays tailored to different ethnic groups (such as the Axiom Pan-African arrays for people of African descent) that capture the bulk of genetic variation within that group. Additionally, you can design custom arrays for a set of 10s to 100s of individual SNVs to test, using Sequenom. We recommend that investigators contact the UIC Core Genomics facility to get more information about the available platforms.

In general, one replicate is sufficient for genotyping arrays, as long as the quality of the DNA is sufficiently high.
The sequencing depth depends on the type of variation you are looking for, namely germline genetic variants or somatic mutations. For germline variation we recommend at least 50x coverage (average reads per base). For somatic mutations we recommend at least 150x coverage. The overall recommended sequencing depth then depends on the genomic domain being resequenced. For example, for detecting germline variation in whole-exome reqsequencing in humans (~30Mb of coding sequences) with 2x100 paired-end reads, we would recommend ~11M paired-end reads: 30Mb * (1/200 bases/read pair) * (50 reads/base coverage) * 1.4 – the “buffer” factor of 1.4 adds 40% depth to account for discrepancies from the ideal coverage, which includes PCR duplication, variance in coverage, and low-quality reads.

A couple extra notes about somatic mutations: the ability to detect these mutations depends strongly on the purity of the affected tissue (histology of the tumor sample). Samples from tumor tissue should be paired with a control sample from the same individual to differentiate somatic mutations from germline variants. Finally, the deeper sequencing recommended for these experiments typically creates more redundant reads (PCR duplicates) and thus we recommend considering a larger buffer factor, possibly as high as 2.0; recommended depth for somatic mutation in exome sequencing with 2x100 reads is closer to 45M paired-end reads.

In general, we recommend paired-end reads for resequencing projects, as longer fragments yield higher confidence alignments, better differentiation of PCR duplicates, and more accurate SNP calls. In general, one replicate is sufficient for DNA resequencing, as long as the quality of the data is sufficient.
We recommend 2 replicates per condition, with a paired input (no-IP DNA-sequencing) for ChIP-seq. Sequencing depth depends on the type of protein being studied, whether a narrow mark or broad mark. For narrow marks, where the protein binds in a site-specific manner (true of most transcription factors) or is highly localized (promoter- or enhancer-associated histone marks, like H3K9Ac or H3K4me3), we recommend 40-60M reads; similar recommendations hold for ATAC-seq, though there is no paired input for ATAC. For broad marks, where the ChIP enrichment may span large (>50kb) domains of the genome (histone marks like H3K27me3, H3K9me2), we recommend at least 75-100M reads. We recommend paired-end reads for ChIP-seq due to the need to identify PCR duplicates, but short reads are acceptable.
There are a large variety of experiments you can choose to measure various epigenomics marks. Histone modifications, which are associated with a variety of chromatin states like promoters, enhancers, active transcription, and silenced transcription, can be measured by ChIP-seq.

Regions of open chromatin (i.e., absence of nucleosomes), which are associated with active transcription and protein-DNA binding, can be measured by DNase-seq or FAIRE-seq. Alternatively, nucleosome positioning can be measured by MNase-seq. However, a new methodology, ATAC-seq, can be used to measure both open chromatin and nucleosome positioning (the latter only if paired-end sequencing is done), and is an easier protocol to follow.

DNA methylation can be measured by MeDIP-seq or bisulfite-seq (BS-seq). MeDIP uses an antibody to pull down methylated regions, and thus gives a broad measure of DNA methylation in a gene locus. BS-seq relies on chemical conversion of non-methylated nucleotides, and thus gives single-nucleotide resolution of methylated DNA, at the risk of false-positives due to incomplete conversion.

Finally, long-range looping interactions consistent with enhancer-promoter regulation or long-scale DNA structure can measured by the chromosome conformation capture family of methodologies (3C, 4C, 5C, Hi-C). These protocols can also be linked to an immunoprecipitation step to measure looping in the context of a specific protein (CHIA-PET).
This depends greatly on what you wish to study, and some information can be found in the answers above. As a general rule, longer reads are helpful when (A) high-confidence alignments are crucial and (B) you are looking for potential deviations from the normal genomic structure. (A) is typically true of genome resequencing projects (whole-genome resequencing, whole exome resequencing, etc.), where alignment biases from short reads can cause SNP calling errors. (B) often the case for RNA-seq projects, especially where differentiating between different gene isoforms is important: reads mapping across different exon-exon junctions are the key evidence, and longer reads give more resolution about where splicing is occurring.

Read length is currently limited to 100-150 bases on the Illumina HiSeq platform (and a bit longer on the NextSeq), so in cases where long reads are necessary, paired-end sequencing is a powerful approach. Illumina MiSeq can sequence up to 250 bases reliably, but is limited in overall library size to a few million reads.
The CRI works closely with the UIC Core Genomics Facility (CGF), which processes microarrays for both genotyping and gene expression as well as RNA-seq, and DNA services (DNAS), which offers both Illumina (MiSeq, NextSeq, and HiSeq) and Ion Torrent next-generation sequencing services. However, we will work with data collected from any facility, and any institution.