Methods Development

You are here

The monumental amount of data generated from clinical and biological studies are the ultimate driving force for the research of Bioinformatics, especially in Methods development. Today, with the fast development of next-generation-sequencing technology, the ability to generate data has grown much rapidly than the ability to analyze and interpret the data sets. One the other hand, the emerging new technologies in next generation sequencing enable the measuring the activities of new bio-markers which were infeasible to detect before. Therefore, helping PIs deciphering the underlying biomedical principles from massive scale data remains a great challenge for bioinformaticians.

To meet these challenges, the bioinformatics specialists at CRI are well-equipped with the deep understanding of the properties of large-scale data, the biomedical background, as well as High Performance Computing clusters. Over the course of collaborating with biomedical researchers, bioinformaticians at CRI have refined in either assemble the workflow in state-of-art with available tools, or develop new methods to crack the hard shell of the questions which had never been addressed finely before. These pipelines employ the latest development and common understanding of bioinformatics community to address the following issues (but not limited to):

  • Optimize the experimental design for better accuracy and statistical significance under a limit budget
  • Improve the efficiency of alignment of the short reads generated under various experimental protocols to the reference genome
  • Accurately quantify the concentration of short reads at each genetic position through normalization
  • Improve the detection of significant different biological markers between samples/experimental conditions

  • Here, we would love to share a couple of stories of success in method development during the collaboration with PIs:

    Automated Batch Randomization for Better Study Design

    Data collected on high-throughput biological platforms, such as microarray and next-generation sequencing (NGS), can often be processed in parallel in batches, greatly lowering the cost and time for collection. However, details in the personnel, protocol, or instrument setting/calibration often vary slightly from batch-to-batch. When large studies with hundreds or thousands of samples are conducted, these variations may result in statistically significant, but biologically irrelevant, anomalies between batches, confounding efforts to determine true biological differences between sample conditions. Such batch effects can be mitigated by proper randomization, where sample traits, such as diseased or control, are evenly distributed across batches.

    To correctly remove the batch effect, the bioinformaticians at CRI developed a computational tool called ARTS (Automated Randomization of multiple Traits for Study design) for automated study randomization, which can be applied to a study of any size, with any number of traits and any batch size. ARTS uses a genetic algorithm to optimize an objective function based on a rigorous statistic from information theory, mutual information. ARTS’ performance shows a good balance between computational speed and optimization quality. Researchers may access ARTS via a downloadable command-line tool, as well as at the Galaxy installation hosted by the UIC Center for Research Informatics (CRI) at

    Detection and interpretation of extrachromosomal microDNAs from next-generation sequencing data

    Extrachromosomal microDNAs are short, circular DNA molecules derived from genomic DNA. They are typically hundreds of nucleotides long, and appear to be omnipresent in mammalian cells. However, their mechanism of formation and function in cells is far from being understood.
    The major roadblocks in microDNA studies include the lack of a robust computational methodology for detecting them from next-generation sequencing (NGS) data and a clear path to interpreting their presence in cells. Confounding these problems is the extremely low molecular reproducibility observed for microDNAs, where biologically replicated experiments turn up very few identical microDNAs.

    CRI has developed a systematic and flexible pipeline for detecting microDNAs in NGS data. By using the pipeline, we were able to provide a system-based interpretation of microDNAs which substantially increase the concordance between biological replicates, and as well distinguish different conditions from each other, from the microDNA data with low molecular reproducibility.

    The methods developed at CRI represent novel and potentially impactful developments that may be of interest to the wider research community, and we may choose to publish the method. This action helps bioinformaticians remain active members of this community, and provides benefits to collaborators as well. Much of the time, we enlist the advice the PI and their group in the preparation of the paper, in which case they will be included in the authorship. Additionally, when researchers publish the results of the original project they can cite a peer-reviewed method for their analysis. Any data required to demonstrate the novel method would be drawn from publicly available sources (e.g., GEO), reserving the biological insights gained in the collaboration for publication by the PI.