Our ultimate research goal is to develop computational methods to help discovering new or more efficient treatment of complex diseases and performing
fast and accurate diagnostics. Our current research interests are in (1) bioinformatics and computational systems biology with applications to
infectious diseases and complex genetic disorders, including cancer, neurological disorders, diabetes and others (2) data mining and machine learning
with applications to biomedical domain, and (3) computational genomics. We are currently expanding our current interests by pursuing research in the areas
of Big Data and data analytics in biomedicine. Below, we list some of the recent projects.
Our research is supported by:

Bioinformatics and systems biology of disease
Determining functional effects of disease-associated genetic variations
Being one of the most prevalent types of genetic variation in humans, single nucleotide polymorphisms (SNPs) are associated with a number of Mendelian
diseases and complex genetic disorders. In spite of growing number of high-throughput studies, our knowledge on SNPs that cause a disease is
still limited. We have recently developed the first semi-supervised learning approach to characterizing effects of genetic variation on
protein-protein interactions, which we call the
SNP-IN tool. The assessment demonstrated the
superior performance of the
SNP-IN tool to other methods, making it readily available to system-wide
studies of the rewiring of disease-centered interaction networks induced by the non-synonymous SNPs.
Studying Influenza evolution
We use a computational approach that integrates evolutionary, structural, functional, and population information to study evolution of
different subtypes of influenza A virus. Specifically, we are interested in finding evolutionary, structural and functional patterns of the virus
in human, swine, and avian hosts. Our recently published study reveals intriguing consensus of the functions associated with the extremely
conserved structural regions in H1N1 across various hosts. Remarkably, we found that these regions contribute exclusively to the intra-viral
macromolecular interactions and co-localize with the RNA-binding or protein-binding sites. Our findings may provide novel insights to the
previously unknown reassortment events and help to identify new drug targets.
Characterizing molecular mechanisms of soybean resistance to pathogens
In collaboration with experimental scientists, our computational methods are often applied to study
specific biological systems, characterizing specific diseases in human, animals, and plants. One such
application is studying plant-nematode interactions to understand the molecular mechanisms behind the
damage caused by these plant parasites and discover new ways of plant resistance. Recently, we have
structurally characterized a homodimeric complex of a novel protein SHMT related to nematode resistance
in soybeans. The structure-guided functional characterization of natural mutations in the protein has
provided insights to the molecular mechanisms behind the resistance. This is an ongoing collaboration
with experimental labs of Melissa Mitchum at the University of Missouri and Khalid Meksem at Southern
Illinois University.
Biomedical data mining
DOMMINO: A comprehensive Database Of MacroMolecular INteractiOns
We have recently developed
DOMMINO, the largest
currently available database, which hosts the interactions mediated by proteins (including domains, inter-domain linkers, N- and C-termini, or
peptides), RNA, and DNA molecules. The database is automatically updated following the weekly PDB release and has (as of March 2013) ~660,000 protein-
protein interaction entries and 31,081 entries that involve interactions with a nucleotide sequence (DNA/RNA). We have implemented a web-interface of
DOMMINO that allows a user to flexibly search and study macromolecular interactions at the network as well as atomic levels.
Text mining of host-pathogen interactions
In spite of the immense amount of the literature reporting experimentally validated molecular and
genetic interactions between host and pathogen organisms, this information so far have been scattered across
the individual publications. To address this problem, we have developed a computational approach that
(i) determine if a title or abstract of a biomedical article contains information about host-pathogen
protein-protein interactions (HPIs) and (ii) extract the HPI information in terms of interacting proteins or
genes and the corresponding host and pathogen organisms. Using our approach,
we have recently processed the entire PubMed database identifying more than 21,000 putative HPIs. We made the
data publicly available through a web-server
Phi2Web that
integrates the automated text mining and the Web 2.0 crowdsourcing platforms.
BacPAC: A database of bacterial effectors predicted and curated
Bacterial infections affect hundreds of millions of lives world-wide and are among
the deadliest tropical and neglected diseases, especially in the third-world countries. The key players in
bacterial infections are effectors, bacterial proteins that are injected through the bacteria's secretion system
into the host cells.. Unfortunately, experimental identification of bacterial effectors is a costly and time-consuming
process. We have recently developed a database
BacPAC , a "Facebook" of bacterial
effectors that currently contains predicted and annotated effectors from 14 genomes of Gram-negative bacteria spanning 7
different secretion systems. The effectors are determined using our in-house supervised learning method, PREFFECTOR,
for the accurate sequence-based prediction of effectors on the whole-genome scale across multiple secretion systems.
Computational studies of protein-protein interactions
Supervised and semi-supervised classification of native and non-native protein-protein interactions
In this project, we address two problems of determining whether a protein-protein interaction is physiological or it is
the artifact of an experimental or computational method. The first problem is concerned with distinguishing between
the experimentally obtained physiological and crystal-packing protein-protein interactions. The second problem deals
with the classification of near-native and inaccurate docking models. We defined a universal set of interface features
and employed an SVM-based approach to classify the interactions for both problems. To further improve the classification
for the second, more challenging, problem, we developed a semi-supervised learning approach, which to our best knowledge
is the first of its kind in methods that study protein-protein interactions. The obtained scoring function has been
successfully used in CAPRI competition.
Unexpected conservation and orchestrated divergence of charged residues at interaction interfaces
In this project, we study the role of charged residues in protein interaction interfaces by analyzing their conservation patterns.
We have found that the charged residues exhibit an unexpected conservation pattern, which we call the correlated reappearance.
The analysis of the conservation patterns across
different superkingdoms as well as structural classes of proteins has revealed that the correlated reappearance of charged
residues is by far the most prevalent conservation pattern. This intriguing phenomenon provides an explanation to a seeming
contradiction between the well-documented role of charge residues in protein interactions and the fact that on average the
charged residues are less conserved in the interaction interfaces than residues of other types.
Computational genomics
Long identical interspecies elements in plants and animals
Genomic elements of extreme conservation are DNA sequences that are exactly or nearly 100% identical (only a few base substitutions, insertions or
deletions are allowed). In 2004, hundreds of elements were discovered in the syntenic positions across several mammalian genomes and were named
ultraconserved elements (UCEs). Using an advanced data-mining technique capable of ultra-fast comparisons of multiple eukaryotic genomes we
discovered that the phenomenon of extreme genomic conservation existed beyond the animal kingdom by detecting similar mechanisms in plants. We have
also discovered new classes of non-syntenic elements in the original mammalian genomes. Called Long Identical Multispecies Elements (LIMEs), these
genomics regions include UCEs but are not limited to them.