Single nucleotide polymorphisms (SNPs) is one of the most common types of genetic variation in complex genetic disorders. A growing number of studies link the functional role of SNPs with the networks and pathways mediated by disease-associated genes. For example, many non-synonymous missense SNPs (nsSNPs) have been found near or inside the protein-protein interaction (PPI) interfaces. Determining whether such nsSNP will disrupt or preserve a PPI is a challenging task to address, both experimentally and computationally.

Here, we present a new computational method that predicts the effects of nsSNPs on PPIs, given the interaction’s structure. The method, called SNPIN (non-synonymous SNP INteraction effect predictor) includes six feature-based classifiers obtained using supervised and semi-supervised machine learning. The classifiers were trained based on a dataset of comprehensive mutagenesis studies for 151 PPI complexes with experimentally determined binding affinities of the mutant and wild type interactions. Each nsSNP was assigned into three types, beneficial, detrimental, and neutral, based on changes in binding affinities that it causes. In addition to the labeled data, 17,692 unlabeled nsSNPs were computationally generated for each complex.

Three classification problems were considered: (1) a 3-class classification (detrimental, neutral, and beneficial effects), (2) a 2-class classification (detrimental or beneficial effects) and (3) another 2-class classification (detrimental or non-detrimental effects). For each problem, Random Forrest and self-training Random Forrest were used as the supervised and semi-supervised learning approaches, correspondingly. For the most difficult, 3-class, problem the best performance was achieved by the semi-supervised approach resulting in a weighted average f-measure of 69.9%. The supervised method showed the best performance for the 2-class problem f-measure of 87%. Most importantly, both methods demonstrated a near-perfect detection of the detrimental nsSNPs for each classification problem.

download labeled_training_dataset.txt