Our ultimate research goal is to develop computational methods to help discovering new or more efficient treatment of complex diseases and performing fast and accurate diagnostics. Our current research interests are in (1) bioinformatics and computational systems biology with applications to infectious diseases and complex genetic disorders, including cancer, neurological disorders, diabetes and others (2) data mining and machine learning with applications to biomedical domain, and (3) computational genomics. We are currently expanding our current interests by pursuing research in the areas of Big Data and data analytics in biomedicine. Below, we list some of the recent projects.

Our research is supported by:

Bioinformatics and systems biology of disease

Determining functional effects of disease-associated genetic variations

Being one of the most prevalent types of genetic variation in humans, single nucleotide polymorphisms (SNPs) are associated with a number of Mendelian diseases and complex genetic disorders. In spite of growing number of high-throughput studies, our knowledge on SNPs that cause a disease is still limited. We have recently developed the first semi-supervised learning approach to characterizing effects of genetic variation on protein-protein interactions, which we call the SNP-IN tool. The assessment demonstrated the superior performance of the SNP-IN tool to other methods, making it readily available to system-wide studies of the rewiring of disease-centered interaction networks induced by the non-synonymous SNPs.

Studying Influenza evolution

We use a computational approach that integrates evolutionary, structural, functional, and population information to study evolution of different subtypes of influenza A virus. Specifically, we are interested in finding evolutionary, structural and functional patterns of the virus in human, swine, and avian hosts. Our recently published study reveals intriguing consensus of the functions associated with the extremely conserved structural regions in H1N1 across various hosts. Remarkably, we found that these regions contribute exclusively to the intra-viral macromolecular interactions and co-localize with the RNA-binding or protein-binding sites. Our findings may provide novel insights to the previously unknown reassortment events and help to identify new drug targets.

Characterizing molecular mechanisms of soybean resistance to pathogens

In collaboration with experimental scientists, our computational methods are often applied to study specific biological systems, characterizing specific diseases in human, animals, and plants. One such application is studying plant-nematode interactions to understand the molecular mechanisms behind the damage caused by these plant parasites and discover new ways of plant resistance. Recently, we have structurally characterized a homodimeric complex of a novel protein SHMT related to nematode resistance in soybeans. The structure-guided functional characterization of natural mutations in the protein has provided insights to the molecular mechanisms behind the resistance. This is an ongoing collaboration with experimental labs of Melissa Mitchum at the University of Missouri and Khalid Meksem at Southern Illinois University.

Biomedical data mining

DOMMINO: A comprehensive Database Of MacroMolecular INteractiOns

We have recently developed DOMMINO, the largest currently available database, which hosts the interactions mediated by proteins (including domains, inter-domain linkers, N- and C-termini, or peptides), RNA, and DNA molecules. The database is automatically updated following the weekly PDB release and has (as of March 2013) ~660,000 protein- protein interaction entries and 31,081 entries that involve interactions with a nucleotide sequence (DNA/RNA). We have implemented a web-interface of DOMMINO that allows a user to flexibly search and study macromolecular interactions at the network as well as atomic levels.

Text mining of host-pathogen interactions

In spite of the immense amount of the literature reporting experimentally validated molecular and genetic interactions between host and pathogen organisms, this information so far have been scattered across the individual publications. To address this problem, we have developed a computational approach that (i) determine if a title or abstract of a biomedical article contains information about host-pathogen protein-protein interactions (HPIs) and (ii) extract the HPI information in terms of interacting proteins or genes and the corresponding host and pathogen organisms. Using our approach, we have recently processed the entire PubMed database identifying more than 21,000 putative HPIs. We made the data publicly available through a web-server Phi2Web that integrates the automated text mining and the Web 2.0 crowdsourcing platforms.

BacPAC: A database of bacterial effectors predicted and curated

Bacterial infections affect hundreds of millions of lives world-wide and are among the deadliest tropical and neglected diseases, especially in the third-world countries. The key players in bacterial infections are effectors, bacterial proteins that are injected through the bacteria's secretion system into the host cells.. Unfortunately, experimental identification of bacterial effectors is a costly and time-consuming process. We have recently developed a database BacPAC , a "Facebook" of bacterial effectors that currently contains predicted and annotated effectors from 14 genomes of Gram-negative bacteria spanning 7 different secretion systems. The effectors are determined using our in-house supervised learning method, PREFFECTOR, for the accurate sequence-based prediction of effectors on the whole-genome scale across multiple secretion systems.

Computational studies of protein-protein interactions

Supervised and semi-supervised classification of native and non-native protein-protein interactions

In this project, we address two problems of determining whether a protein-protein interaction is physiological or it is the artifact of an experimental or computational method. The first problem is concerned with distinguishing between the experimentally obtained physiological and crystal-packing protein-protein interactions. The second problem deals with the classification of near-native and inaccurate docking models. We defined a universal set of interface features and employed an SVM-based approach to classify the interactions for both problems. To further improve the classification for the second, more challenging, problem, we developed a semi-supervised learning approach, which to our best knowledge is the first of its kind in methods that study protein-protein interactions. The obtained scoring function has been successfully used in CAPRI competition.

Unexpected conservation and orchestrated divergence of charged residues at interaction interfaces

In this project, we study the role of charged residues in protein interaction interfaces by analyzing their conservation patterns. We have found that the charged residues exhibit an unexpected conservation pattern, which we call the correlated reappearance. The analysis of the conservation patterns across different superkingdoms as well as structural classes of proteins has revealed that the correlated reappearance of charged residues is by far the most prevalent conservation pattern. This intriguing phenomenon provides an explanation to a seeming contradiction between the well-documented role of charge residues in protein interactions and the fact that on average the charged residues are less conserved in the interaction interfaces than residues of other types.

Computational genomics

Long identical interspecies elements in plants and animals

Genomic elements of extreme conservation are DNA sequences that are exactly or nearly 100% identical (only a few base substitutions, insertions or deletions are allowed). In 2004, hundreds of elements were discovered in the syntenic positions across several mammalian genomes and were named ultraconserved elements (UCEs). Using an advanced data-mining technique capable of ultra-fast comparisons of multiple eukaryotic genomes we discovered that the phenomenon of extreme genomic conservation existed beyond the animal kingdom by detecting similar mechanisms in plants. We have also discovered new classes of non-syntenic elements in the original mammalian genomes. Called Long Identical Multispecies Elements (LIMEs), these genomics regions include UCEs but are not limited to them.