With the fast development of next-generation-sequencing technology, the ability to generate data has grown much rapidly than the ability to analyze and interpret the data sets. To meet these challenges, the bioinformatics specialists at CRI are well-equipped with the deep understanding of the properties of large-scale data and appropriate analytical techniques, the biomedical background, and high performance computing resources. Over the course of a collaboration with biomedical researchers, bioinformaticians at CRI may realize the need to develop novel computational techniques as part of an analytical project:
Here, we would love to share a couple of stories of success in method development during the collaboration with PIs:
Automated Batch Randomization for Better Study Design
Data collected on high-throughput biological platforms, such as microarray and next-generation sequencing (NGS), can often be processed in parallel in batches, greatly lowering the cost and time for collection. However, details in the personnel, protocol, or instrument setting/calibration often vary slightly from batch-to-batch. When large studies with hundreds or thousands of samples are conducted, these variations may result in statistically significant, but biologically irrelevant, anomalies between batches, confounding efforts to determine true biological differences between sample conditions. Such batch effects can be mitigated by proper randomization, where sample traits, such as diseased or control, are evenly distributed across batches.
To correctly remove the batch effect, the bioinformaticians at CRI developed a computational tool called ARTS (Automated Randomization of multiple Traits for Study design) for automated study randomization, which can be applied to a study of any size, with any number of traits and any batch size. ARTS uses a genetic algorithm to optimize an objective function based on a rigorous statistic from information theory, mutual information. ARTS’ performance shows a good balance between computational speed and optimization quality. Researchers may access ARTS via a downloadable command-line tool, as well as at the Galaxy installation hosted by the UIC Center for Research Informatics (CRI) at galaxy.cri.uic.edu.
Detection and interpretation of extrachromosomal microDNAs from next-generation sequencing data
Extrachromosomal microDNAs are short, circular DNA molecules derived from genomic DNA. They are typically hundreds of nucleotides long, and appear to be omnipresent in mammalian cells. However, their mechanism of formation and function in cells is far from being understood.
The major roadblocks in microDNA studies include the lack of a robust computational methodology for detecting them from next-generation sequencing (NGS) data and a clear path to interpreting their presence in cells. Confounding these problems is the extremely low molecular reproducibility observed for microDNAs, where biologically replicated experiments turn up very few identical microDNAs.
CRI has developed a systematic and flexible pipeline for detecting microDNAs in NGS data. By using the pipeline, we were able to provide a system-based interpretation of microDNAs which substantially increase the concordance between biological replicates, and as well distinguish different conditions from each other, from the microDNA data with low molecular reproducibility.
The methods developed at CRI represent novel and potentially impactful developments that may be of interest to the wider research community, and we may choose to publish the method. This action helps bioinformaticians remain active members of this community, and provides benefits to collaborators as well. Much of the time, we enlist the advice the PI and their group in the preparation of the paper, in which case they will be included in the authorship. Additionally, when researchers publish the results of the original project they can cite a peer-reviewed method for their analysis. Any data required to demonstrate the novel method would be drawn from publicly available sources (e.g., GEO), reserving the biological insights gained in the collaboration for publication by the PI.