Iowa State University

Iowa State UniversityIowa State University

PepMIL: Predicting Flexible Length MHC-II binding Peptides

Introduction
A major step in identifying potential T-cell epitopes is predicting which peptides will bind to a specific major histocompatibility complex (MHC) molecule. There are two classes of MHC molecules: MHC class I (MHC-I) molecules characterized by short binding peptides, usually consisting of 9 residues; and MHC class II (MHC-II) molecules with binding peptides of a broad length distribution (typically, 11 to 30 residues, although shorter and longer peptide lengths are not uncommon) [1]. Most computational methods for predicting MHC binding peptides rely on the identification of a 9-mer long core peptide. Therefore, predicting MHC-II binding peptides is more difficult than predicting MHC-I binding peptides because the core peptide region in MHC-II peptides is initially undetermined.

A number of different computational approaches for predicting MHC binding peptides from sequences have been developed. These include methods that use a position weight matrix to model ungapped multiple sequence alignment of MHC binding peptides [2-4], and a statistical approach based on Hidden Markov Models (HMMs), that can model variable length peptides and is independent on the multiple sequence alignment of the binding peptides [5,6]. Machine learning methods including Artificial Neural Networks (ANN) [4,7,8] and Support Vector Machines (SVMs) [9,10] have been also applied. All of these methods, except HMMs, require the identification of core region in each binding peptide for model building. The peptides alignment step is critical because it is computationally difficult and, in some cases, an incorrect region of a peptide may be assigned as a binding core. Moreover, by building the model only from the core region in each training peptide, we lose some information that could improve the discrimination accuracy between binding and non-binding peptides. For example, Chang et al. [11] showed that there is a nonlinear relationship between peptide length and the binding affinity to MHC-II. They also demonstrated that adding peptide lengths as extra features to two MHC-II peptide prediction algorithms improved the prediction accuracy over HLA-DRB1*0101, -DRB1*0401, and -DRB1*1501 alleles.

Although HMMs can be trained from a set of variable length sequences, it is desirable to design machine learning methods that can model flexible length peptides. HMMs do not deal well with correlations between residues because they assume that each residue depends only on one underlying state [12]. Recently, two machine learning methods for predicting MHC-binding peptides of flexible length have been proposed. The first method [13] uses a Bayesian network to predict class I HLA-A2 binding peptides. The second method [14] represents each variable length peptide by a set of features extracted from amino acid compositions and a number of physico-chemical properties as described in [15]. Both methods were trained on datasets of flexible length peptides and evaluated by assigning labels to flexible length test peptides, assuming that the actual peptide length is the length of the test peptide. In other words, no attempts to evaluate the performance of the two methods in predicting the length of the binding peptide had been presented.

In this work, we introduce a machine learning method for predicting flexible length MHC-II binding peptides using multiple-instance learning (MIL). Our experimental results demonstrate that learning from the whole peptide sequence not only solves the problem of computing the core regions in the training peptides, but also helps our method to significantly outperform other machine learning methods trained only from the core peptides.

Citation
1. EL-Manzalawy Y, Honavar V: MICCLLR: A Generalized Multiple-Instance Learning Algorithm Using Class Conditional Log Likelihood Ratio. Tech. rep., Computer Science Department, Iowa State University 2007. ( pdf )

2. EL-Manzalawy Y, Dobbs D, Honavar V: PepMIL: A Novel Method for Predicting Flexible Length MHC-II binding Peptides. Submitted to BMC Bioinformatics.