Bioinformatics & Computational Biology

Bioinformatics interfaces computer science and molecular biology with the goal of storing, organizing and analyzing biological information. Research includes developing efficient and scalable algorithms for biomolecular simulation and applying data mining, statistical machine learning, natural language processing, and information retrieval to analyze and mine all kinds of biological data, including DNA sequences, protein sequences and structures, microarray data, and biology literature, for the purpose of facilitating biology discovery.

Illinois researchers focus problems pertaining to the phenomenon of “gene regulation” and its evolution. Gene regulation refers to how genes in a cell are switched on (or off) to determine the cell’s functions. It is the reason why, for example, skin and muscle cells are different despite having the same DNA. It is central to a range of biological phenomena such as development and disease. Moreover, evolution of gene regulation underlies the amazing diversity of life forms around us.

Learn more about Bioinformatics research at Illinois:


Jian Peng protein function & structure, systems biology, machine learning and optimization
Saurabh Sinha gene regulation, comparative genomics, sequence analysis
Tandy Warnow multiple sequence alignment, phylogenomics, metagenomics, and historical linguistics
Jiawei Han data mining
ChengXiang Zhai information retrieval, text mining, bioinformatics
Bruce Schatz bioinformatics


  • Biology literature access and mining
  • Cis-regulatory modules and their discovery through comparative genomics
  • Coalescent-based species tree estimation
  • Comprehensive maps of gene regulation in various organisms
  • Evolution of modules
  • Gene regulation and social behavior
  • Metagenomics
  • Models of regulatory function
  • Multiple sequence alignment
  • Phylogenetic network estimation
  • Phylogenomics
  • Probabilistic Alignment
  • Protein structure and function prediction


Computational tools developed by Illinois researchers are available for download and free use by the academic community. 

  • ASTRAL (coalescent-based species tree estimation)
  • CRM discovery benchmark (Data sets from D. melanogaster.) 
  • D2Z software (Alignment free comparison of regulatory sequences.) 
  • DIPS software (For finding discriminative PWM motifs) 
  • EMMA (Prediction and alignment of cis-regulatory modules)
    This Linux-based program is meant for prediction of regulatory targets of a motif using two-species comparison. If you have a sequence window of length ~100 bp - 2000 bp, and its orthologous window from another species, use EMMA to score the window for matches to a given motif. EMMA is also useful for alignment of cis-regulatory modules (enhancers) between two species, if you have knowledge of the relevant transcription factor motifs.
  • GEMSTAT (Thermodynamics-based modeling of gene expression from regulatory sequences)
  • GenomeSurveyor (Prediction of motif targets in D. melanogaster)
    This web-based Genome Browser allows you to find regulatory targets of a large collection of transcription factors in the Drosophila genome. You may use cross-species comparison among 12 genomes to see conserved targets.
  • Indelign software (Probabilistically annotating indels in multiple alignments) 
  • Morph software (Probabilistic alignment of cis-regulatory modules) 
  • PASTA (co-estimation of multiple sequence alignments and phylogenetic trees)
  • PhyME software (Motif finding in orthologous sequences) 
  • SEPP
    Three methods based on Ensembles of HMMs: SEPP (SATe-enabled phylogenetic placement), TIPP (Taxonomic Identification and Phylogenetic Profiling), and UPP (Ultra-large alignments using  Phylogeny-aware Profiles).
  • Stubb software (For finding cis-regulatory modules) 
  • SWAN (Prediction of binding targets of a transcription factor, characterized by a position weight matrix)
    This Linux-based program is meant for genome-wide prediction of regulatory targets of a motif using a Hidden Markov Model. It differs from Stubb in that instead of asking Does the sequence have more sites than expected from a random (background) model of sequences?, it asks the question Does the sequence have more sites than the average genome-wide frequency of sites? We have found this new approach to lead to more accurate motif target predictions overall.
  • YMF software YMF Web Server (Motif finding)


  • CS 466: Introduction to Bioinformatics
  • CS 598TW: Algorithmic Genomic Biology
  • CS 598SS: Probabilistic Methods for Biological Sequence Analysis
  • Summer course on Computational Genomics

Bioinformatics & Computational Biology Centers & Labs