Binding to MHC-I is the most selective requirement for a peptide to become an epitope through the MHC-I pathway. This has directed most experimental and theoretical work aimed at MHC-I epitope prediction towards this step. In this chapter a new method to predict the affinity of a peptide to an MHC-I molecule is introduced and compared to existing prediction methods. Special interest is paid to the independent binding assumption, which states that each residue within a peptide contributes independently to the overall binding of the peptide. Several published prediction methods make this assumption, while others claim it to be invalid because it neglects interactions between peptide positions. Here, for the first time the violation of the independent binding assumption is quantified.

This chapter is organized as follows: The first section (2.1) gives an overview of existing prediction algorithms, followed by a description of the available datasets consisting of peptides with measured affinities, which are needed for training and testing of the algorithms (section 2.2). Section 2.3 explains how predictions from existing algorithms were obtained. This is followed by the introduction of the novel SMM prediction method (section 2.4), and a description of the means to compare the prediction quality of different methods (section 2.5). In section 2.6, all algorithms making the independent binding assumption are compared, including the novel SMM method. This is followed by a comparison of general prediction methods not making that assumption in section 2.7. In section 2.8, the SMM prediction method is extended to include pair coefficients describing the pair-wise interactions between peptide positions, thus dropping the independent binding assumption. Finally, the distribution of these quantified pair-wise interactions between peptide positions is analyzed (section 2.9).

Most of the results presented in this chapter are taken from (Peters, et al., 2003).

When the first peptides binding to MHC-I molecules were sequenced (Rotzschke, et al., 1990), it soon became clear that each MHC-I molecule has its own narrow binding preference. Only peptides of a certain length were found, typically 8-10 amino acids long. Some positions in the

With more data available, it became apparent that there are several peptides binding to MHC-I that do not comply with the anchor motif or, more often, that peptides containing the motif did not bind (Jameson and Bevan, 1992). This lead to extensions of the motif method, listing more and more good, bad or tolerated residues for binding at the different peptide positions. One deficiency of such a listing is that it does not say if a good residue at one position can compensate for a bad residue somewhere else. To achieve this, scores were assigned to each amino acid at each position. These can be summed up for a given peptide and the total score predicts if the peptide is likely to bind or not. This is called a matrix based prediction, because the scores for each peptide position and amino acid can be written in the form of a matrix like the one shown in Table 1.

The first scoring matrix for MHC-I binding was introduced by (Parker, et al., 1994), which is in the following called the BIMAS method. The experimental basis was a set of peptides with measured half-life dissociation constants for the HLA-A0201 allele. The matrix coefficients were determined mathematically by minimizing the distance between predicted matrix scores for the peptides and the logarithms of their measured half-life constants. This makes the matrix score a prediction of a peptides half-life of dissociation, and therefore of its strength of binding.

A major contribution to the prediction of epitopes was the establishment of the SYFPEITHI database (Rammensee, et al., 1999;Rammensee, et al., 1995). This database is a collection of sequences of naturally processed epitopes and peptides known to bind to MHC-I molecules, which serves as the basis for a scoring algorithm. To determine a scoring matrix for a given allele, the frequencies of amino acids in peptides binding to that allele are used to manually

Another matrix based prediction approach is called the polynomial method (PM) (Gulukota, et al., 1997). It has the advantage of being very easy to calculate from a set of peptides with known affinity values. Each matrix entry corresponding to a particular amino acid type at a particular peptide position is calculated as the mean affinity of all peptides containing that amino acid type at that position.

A matrix based prediction necessarily assumes, that the contribution to binding of each residue in a peptide is independent of the other residues in the peptide. This is a daring assumption, known to be false for many other biological problems. However, matrix based algorithms work surprisingly well for MHC-I affinity predictions: Scoring matrices are usually calculated with little experimental data (typically hundreds of peptides) compared to the size of sequence space (for 9-mers: 20^{9}), and still give good predictions. This shows that the relationship between sequence and affinity can at least roughly be approximated by the independent binding assumption. The assumption is also supported by structural data showing that peptides bind in an extended conformation in the MHC-I groove, so that contacts between residues of the bound peptide are limited.

Even if independent binding is a justified approximation, a method without this restriction should be able to make better predictions if interactions between peptide positions play a role at all. In the following, such methods are called general methods, in contrast to those making the independent binding assumption. The first general methods used to predict binding to MHC-I were artificial neural networks (ANN) (Gulukota, et al., 1997;Milik, et al., 1998). ANNS are a

Another general prediction method are Classification and Regression Trees (CART). These are described in detail in (Breiman, et al., 1984), and were first used by (Segal, et al., 2001) to predict MHC-I binding. Briefly, a classification tree is built by introducing splits in a set of peptides with known binding capability according to what amino acid is at a certain position of a peptide. The splitting is repeated, leading to a tree shaped classification scheme (e.g. Figure 8). Each split is chosen so that it maximizes the homogeneity of the peptides in both daughter nodes. A perfectly homogenous node contains only binders or only non-binders. A very large tree is built first so that all nodes are perfectly homogenous. It is then pruned back to an optimal size determined by cross-validation. Each terminal node in the optimal tree is assigned a binding score, computed as the percentage of non-binders the node contains.

The third general prediction method used in this chapter is the additive method (Doytchinova, et al., 2002). It consists of a scoring matrix + coefficients describing pair-wise interactions between amino acids. The sore of a peptide is calculated by adding those coefficients belonging to pairs of amino acids present in that peptide to its matrix score, as described below in equation (6). All interactions between neighboring and next-neighbor amino acids are considered. The matrix- and interaction coefficients are determined by a partial least square fit to a set of peptides with known affinities.

The prediction methods used in this chapter were chosen because they were freely accessible or reasonably easy to reproduce. Other classes of prediction methods based on binding data which are not treated here are Hidden Markov Models (Mamitsuka, 1998) and Support Vector Machines (Donnes and Elofsson, 2002). There are also prediction methods based on structural data of solved peptide-MHC-I complexes. These use threading or molecular modeling techniques to identify potential binders (Altuvia, et al., 1995;Schueler-Furman, et al., 2000).

Four non-overlapping sets of 9-mer peptides with known affinities to the HLA-A0201 allele are used in this chapter. All datasets are separated into 'binders' and 'non-binders'. The cutoff for making this separation is not critical as long as all the binders have a higher affinity than any of the non-binders. All sets have similar sequence composition: at the anchor positions 2 and 9, amino acids A, I, L, M, T and V are over-represented; at other positions, all 20 amino acids are roughly equally represented. For the SYFPEITHI-set, this reflects the occurrence of amino acids at these positions in naturally processed epitopes. For the *in vitro* datasets, where the choice of peptides is made by an experimentalist, this bias is also found because experimentalists want to include as many binders in their datasets as possible. While it would be better - from a mathematical viewpoint - to have binding information of randomly selected peptides, this would require far more measurements as only very few random peptides bind. As the described bias is present in all learning and test sets, no particular prediction method is favored.

**Train-set**: This set consists of 533 peptides with IC50 values measuring their binding affinity to the HLA-A0201 molecule, as described in (Gulukota, et al., 1997). IC50 values are measured in an assay in which both the peptide of interest and a reference peptide compete for binding to HLA-A0201. The IC50 value of a peptide is the concentration at which it has suppressed 50% of the reference peptides from binding to MHC-I. The logarithm of the IC50 value can be interpreted as the difference in binding free energy between a peptide and the reference.

Several peptides in the dataset are ‘heavy non binders’, i.e. their IC50 value is too large to be measured. Using an IC50 cutoff of 500 nM, the Train-set is split into 127 binders and 406 non-binders, which of course include the 'heavy non binders'. The cutoff lies within the intermediate- to low affinity range; peptides in this set with IC50<50nM are considered to be high-affinity binders.

**Blind-set**: This set 173 of peptides with IC50 values was measured with the same experimental setup as the Train-set. Using the IC50 cutoff of 500 nM, the set is split in 67 binders and 108 non-binders.

**BIMAS-set**: This set of peptides was published by (Parker, et al., 1994). For 134 peptides, β2 microglobulin dissociation half-lives have been measured. Four of the peptides overlap with the Train-set. By interpolating between their known IC50 and half-life values, the IC50=500nM cutoff in the Train-set is translated to a half-life cutoff of approximately 650 minutes. Using this cutoff and excluding the overlapping peptides, the BIMAS-set has 25 binders and 105 non-binders.

**SYFPEITHI-set**: 143 peptides binding to HLA-A0201 were taken from the SYFPEITHI database, most of which are naturally processed epitopes. All of these were classified as binders, even though there is no measured affinity available for them. To have non-binders in this set that definitely have lower affinities than these binders, the 59 heavy non-binders from the Blind-set were included.

The BIMAS and SYFPEITHI predictions were obtained from web servers (BIMAS: http://bimas.dcrt.nih.gov/molbio/hla_bind/, SYFPEITHI: http://www.uni-tuebingen.de/uni/kxi). The training data of these methods contained the equally named test sets described above.

The PM, CART and ANN predictions were trained using the Train-set described above. For the CART method, two commercial software packages (SPSS and CART) were used, leading to an identical optimal classification tree shown in Figure 8. Switching from a classification tree used here, which needs binary experimental binding data for training, to a regression tree, which uses quantitative IC50 values, improves the prediction performance only slightly.

The ANN was designed as described in (Gulukota, et al., 1997): A feed-forward neural network with three layers was built consisting of an input layer with 180 neurons, a hidden layer with 50 neurons and an output layer with one neuron. The Aspirin/MIGRAINES software package from the MITRE Corporation (http://www.emsl.pnl.gov:2080) was used to simulate the network.

Scoring matrices quantify the contributions of individual residues in a peptide to binding. The matrix element s_{i,a} corresponds to amino acid a at position i of the peptide. The total score S_{k} for a given peptide k with the amino acids a_{k}(i) at positions i is then given by the summation:

(1)

where s_{0} is a constant offset.

In this section, the novel stabilized matrix method (SMM) is developed. It determines the values for s_{i,a} and s_{0} by minimizing the distance between predicted scores S_{k} and measured affinities for the peptides in the Train-set:

(2)

For a peptide with a measurable IC50 value, the norm in equation (2) has the form:

(3)

Several peptides in the Train-set have too low affinities to measure an IC50 value. For these ‘heavy non-binders’, the IC50 values are set equal to or greater than the largest experimentally measurable value, which was cutoff=ln(50,000) for the Train-set. Accordingly, for these peptides the norm in (2) is set to be

(4)

To avoid over-fitting, a second term is added to the minimization function in (2):

(5)

By minimizing with a non-zero λ value, a tradeoff is introduced between optimally reproducing the experimental IC50 values (including their inevitable experimental error) and minimizing parameters s_{i,a}. This forces all parameters s_{i,a} towards zero, which do not significantly lower the distance Φ. The optimal value for λ is determined by 10-fold cross-validation on the Train-set, i.e. splitting the total set of peptides into 10 subsets, establishing a scoring matrix using 9 of these subsets and making predictions with that matrix for the left out subset. Figure 5 depicts the sum of the distances between these predictions and the experimental values as a function of λ.

_{opt}=1. The resulting SMM scoring matrix is shown in Table 1. The lower a score, the better an amino acid is suited for binding at the given position.

A similar mathematical concept is used to solve 'inverse problems', where λ is called the regularization parameter. A short introduction to inverse problems is given in (Press, et al., 1992), chapter 18. To minimize equation (5), a commercial non-linear optimizer (Frontline Systems, 1999) using a generalized-reduced-gradient method is applied.

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.58 0.74 3.17 0.13 0.78 1.30 1.35 0.33 2.76
4.90 0.74 1.46 -1.50 -0.15 -0.14 1.50 1.06 2.76
1.47 0.74 2.03 -1.48 -0.36 -1.78 0.51 -0.76 2.76
0.93 0.74 0.96 -0.96 -0.82 -1.15 0.56 0.57 2.76
3.46 0.74 2.79 -1.24 0.40 -0.74 -0.17 0.39 2.76
1.63 0.74 1.24 -0.34 -0.41 -0.52 0.61 0.36 2.76
0.50 0.74 1.08 -0.83 -0.64 -0.60 1.22 0.54 2.76
1.33 0.74 0.90 -1.44 -0.39 -0.60 -0.43 0.06 2.76
0.58 -0.42 1.54 1.67 -0.78 -1.40 -0.29 0.59 -0.50
0.29 -3.06 -0.38 -0.17 -0.03 -1.48 -0.01 0.26 0.50
-0.20 0.74 2.25 0.22 0.05 1.43 1.73 1.20 2.76
-1.46 -2.72 -1.02 0.36 -0.41 -1.43 -1.08 0.58 0.91
-1.24 0.74 0.13 -0.45 -1.84 -1.48 -1.64 -1.10 2.76
3.63 0.74 1.12 -0.67 0.93 -0.90 -1.60 -0.66 2.76
0.24 0.74 1.05 -0.25 -0.26 -0.26 0.18 0.08 2.76
0.82 0.17 1.70 -0.63 -0.75 -2.20 0.17 -0.00 1.06
0.38 0.97 -0.03 -2.38 -1.49 -1.07 -1.67 0.02 2.76
-1.91 0.74 -0.89 -0.12 -1.74 -1.37 -2.34 -0.77 2.76
-0.09 -0.31 1.28 1.15 -0.44 -2.13 0.03 1.04 -0.81
10.14**Pos 1**
**Pos 2**
**Pos 3**
**Pos 4**
**Pos 5**
**Pos 6**
**Pos 7**
**Pos 8**
**Pos 9**
**A**
**R**
**D**
**N**
**C**
**E**
**Q**
**G**
**H**
**I**
**L**
**K**
**M**
**F**
**P**
**S**
**T**
**W**
**Y**
**V**
**Offset**

In this chapter, the predictions of diverse methods on equally diverse datasets have to be compared. To do this fairly, ROC curves are used (Bradley, 1997): For a given cut-off value which separates peptides by their predicted score into potential binders and non-binders, the two variables sensitivity (true positives / total positives) and 1-specificity (false positives / total negatives = false alarm rate) are calculated. By systematically varying the cut-off value from the lowest to the highest predicted score, a ROC curve like Figure 6 is plotted. Prediction quality is measured by the area under the curve (AUC), which is 0.5 for random predictions and 1.0 for perfect predictions. The AUC is equivalent to the probability that the score of a randomly chosen binder is higher than that of a randomly chosen non-binder. This measure has the advantage of not relying on a single arbitrarily chosen cut-off value for the prediction score, and can be equally applied to all datasets and prediction methods.

To assess if one prediction is significantly better than another, the set of peptides for which predictions are made was re-sampled. Using bootstrapping with replacement, 50 new datasets were generated with a constant ratio of positives to negatives. The difference in AUC for the two predictions on each new dataset is then calculated. One prediction is significantly better than another if the distribution of the differences in AUC values is significantly above zero, which is measured using a standard t-test with a p-value of 0.001.

The four matrix based methods had to predict which of the peptides in the Blind-set are binders. The Blind-set is the only set truly ‘blind’ to all methods, i.e. none of peptides in this set were included in the training data of any of the prediction methods. Figure 6 depicts ROC curves for all predictions. It indicates that the performance ranks in the order of SMM>BIMAS>PM over almost the entire range. SYFPEITHI is the worst method for sensitivities above 0.42 and becomes the best for specificities above 0.97. This is due to the fact that SYFPEITHI predictions reflect other components of the antigen presentation pathway in addition to MHC-I binding,

SMM, λ=1 Yes
SMM, λ=0 Yes 0.856 0.846 0.865 BIMAS Yes 0.846 0.829 (0.875) PM Yes 0.795 0.792 0.757 SYFPEITHI Yes 0.745 (0.865) 0.754 SMM + pair coef. No
ANN No 0.796 0.788 0.762 Additive method No 0.820 0.770 0.830 CART No 0.708 0.620 0.539**Prediction Method**
**Independent Binding Assumption**
**AUC on Test Set**
**Blind**
**SYFPEITHI**
**BIMAS**
**0.869**
**0.848**
**0.866**
**0.873**
**0.852**
**0.869**

Two elements make the SMM approach different from other matrix based methods. First, it incorporates the experimental information of heavy non-binders precisely into the distance defined in equation (2). Since only the lower bound of the IC50 values for heavy non-binders can be determined, previous approaches have either left them out entirely or tried to fit them exactly to the lower bound. Second, the regularization technique was used. With errors in experimental measurements, there can be multiple sets of matrix coefficients that can reproduce the experimental data within their range of the error. Choosing the set of coefficients that gives the minimum distance may mean to overfit the problem. By incorporating a regularization parameters (λ in equation 5), a set of coefficients is chosen that reproduces the experimental results reasonably while keeping the parameter values small. This effectively prevents overfitting. Table 2 indicates that the AUC values at λ_{opt}=1, are better than those at λ=0 for all test sets.

Figure 7 shows ROC curves for general prediction methods (not making the independent binding assumption) on the Blind-set. The corresponding AUC values are also listed in Table 2. ANN

In case of the additive method and the ANN, lack of data has lead to overfitting. The additive method has 1850 free parameters, the ANN architecture taken from (Gulukota, et al., 1997) more than 9000 neurons. This number of parameters cannot be determined reliably for the Train-set containing only 533 experimental data points. For the ANN, choosing a different architecture with less free parameters would probably have improved its prediction quality. For the additive method, the overfitting seems to effect mainly the interaction coefficients. When neglecting these coefficients which describe interactions between neighboring amino acids, and keeping only those compatible with the independent binding assumption, the prediction quality improves.

The results of the previous section show that given the limited data, general methods tend to perform worse than matrix based ones. In this section a general method that uses the matrix predictions as a starting point is developed. This is done by quantifying the contribution to binding of interactions of peptide positions with pair coefficients s'_{i,a,i',a'}. For example, coefficient s'_{3A,7L} (**A***L**) describes the difference in binding between the following two scenarios: (1) an Alanine is at the third AND a Lysine at the seventh position of the peptide and (2) the sum of the average contributions of having an Alanine at the third position, described by matrix value s_{3A} (**A******), and having a Lysine at the seventh position, described by s_{7L} (******L**). The total score for a given peptide k with the amino acids a_{k}(i) at positions i is then given by

(6)

where S_{k} is the matrix score defined in equation (1), and the sum includes all pairs of amino acids found in the peptide. For 9-mer peptides, this would result in 20*20*36 =14400 different pair coefficients. Since it is impossible to determine that many coefficients given the limited data, a two-stage selection process is applied: In the first step all coefficients for which fewer than N_{min}=10 peptides exist in the Train-set are eliminated, as they lack sufficient experimental information. The optimal values for the remaining 269 pair coefficients can then be calculated by minimizing

(7)

where Φ' is the same as Φ in equation (2), except that scores S'_{k} are used instead of S_{k}. When optimizing the pair coefficients s'_{i,a,i',a'} in equation (7), the matrix coefficients s_{i,a} are frozen at their optimal value determined from equation (5).

In a second selection step, the Train-set is split into 10 equal-size non-overlapping subsets and 10 different optimal values for each pair coefficient are determined by leaving out one subset at a time and minimizing equation (7). If a pair coefficient contains both positive and negative

The optimal value for λ' is again determined using cross-validation, similar to λ in equation (5). The cross validated distance as calculated using (2) compared to that of the plain matrix predictions is shown in Figure 9 for various values of λ'. For very high values of λ', the pair coefficients are forced to stay around zero, and the prediction accuracy is equal to that of the plain matrix approach. For λ' = 0, there is no restriction on the value of pair coefficients, leading to over-fitting, and the resulting performance is much worse than the plain matrix. In between, _{opt}= 5. The values of the pair coefficients are listed in Table 3.

Using the combined SMM matrix + pair coefficient method leads to an improvement in prediction quality over the plain SMM matrix on all test sets (Table 2). No other general method even reaches the SMM matrix predictions on one of the test sets. There are three main advantages of the novel matrix + pair coefficients approach: First of all, the pair coefficients are determined by systematic investigation of differences between the matrix predictions and the experimental values. As the matrix method is highly accurate, it is a better starting point than trying to determine both position contributions and position interactions all at once. Another novelty is that the interactions under investigations are limited to those with a sufficient amount of consistent training data. The third advantage is again the use of a regularization parameter (λ' in equation 7), which prevents the pair coefficients from overfitting the data. Its importance can be seen clearly in Figure 9: without it (λ'=0) the pair coefficients reduce prediction quality below that of the matrix alone.

Pos1 Pos2 value Pos1 Pos2 value Pos1 Pos2 value 5 S 9 L -0.81 2 L 4 P -0.16 2 V 6 A 0.18 2 V 5 A -0.52 4 E 9 L -0.15 8 P 9 L 0.19 2 L 4 Q -0.43 2 L 8 L -0.14 7 N 9 V 0.20 3 S 9 L -0.42 1 A 7 A -0.13 8 S 9 L 0.20 8 H 9 L -0.41 1 A 9 V -0.13 2 L 5 E 0.20 2 L 6 F -0.41 4 P 9 L -0.13 2 L 7 T 0.21 6 A 9 V -0.39 2 L 7 P -0.12 4 S 9 V 0.21 8 I 9 L -0.36 8 A 9 V -0.11 1 E 2 L 0.22 2 L 3 A -0.36 5 A 9 V -0.11 5 I 9 L 0.22 2 V 7 G -0.36 6 V 9 L -0.11 5 V 9 V 0.23 1 G 2 L -0.34 4 A 9 V -0.10 3 A 7 A 0.24 6 L 9 V -0.34 4 K 9 V -0.10 2 L 7 G 0.24 2 L 4 C -0.33 6 S 9 L -0.09 6 Q 9 L 0.25
Pos2 value Pos1 Pos2 value Pos1 Pos2 value 5 T 9 V -0.32 2 L 4 K -0.06 6 P 9 V 0.26 2 L 6 Y -0.32 3 A 9 L -0.06 5 E 9 L 0.28 1 A 2 L -0.31 1 A 6 A -0.06 2 L 3 S 0.28 2 L 3 L -0.31 5 G 9 V -0.01 3 A 5 A 0.28 2 L 5 I -0.31 5 A 7 A 0.03 2 L 4 G 0.29 2 L 3 Y -0.29 5 A 8 A 0.04 2 V 3 R 0.30 2 V 5 V -0.27 2 V 9 L 0.04 7 S 9 V 0.30 2 L 4 S -0.26 3 V 9 V 0.07 5 Q 9 L 0.32 4 K 9 L -0.26 2 L 6 H 0.09 2 L 7 A 0.33 7 L 9 L -0.26 7 A 8 A 0.10 2 L 5 V 0.33 6 P 9 L -0.25 5 L 9 A 0.10 2 V 3 G 0.34 1 V 9 L -0.25 1 I 9 V 0.11 1 D 2 L 0.35 2 L 6 V -0.25 2 L 5 R 0.12 6 T 9 L 0.35 1 Q 9 L -0.25 4 K 5 A 0.13 3 L 9 V 0.38 8 V 9 L -0.24 2 L 4 E 0.13 6 F 9 L 0.40 2 L 7 S -0.24 2 L 6 G 0.13 4 L 9 V 0.40 7 A 9 V -0.22 7 T 9 L 0.14 1 A 3 A 0.41 7 P 9 L -0.22 3 A 6 A 0.14 1 P 2 L 0.41 2 L 8 P -0.21 3 A 4 K 0.14 1 C 9 L 0.41 1 L 9 V -0.21 2 V 4 S 0.15 2 L 6 L 0.41 2 I 9 V -0.20 7 V 9 V 0.15 2 L 7 R 0.42 6 A 7 A -0.20 1 D 9 L 0.15 2 L 5 S 0.46 2 L 6 I -0.18 7 G 9 L 0.16 6 I 9 L 0.47 8 G 9 L -0.18 1 L 9 L 0.16 6 L 7 L 0.52 2 V 6 P -0.17 2 V 5 L 0.17 4 Q 9 L 0.56 1 T 2 V -0.17 2 L 3 P 0.17 8 D 9 L 0.67 7 A 9 L -0.17 2 L 8 G 0.18 5 S 9 V 0.74 2 L 6 A -0.17 1 C 2 L 0.18 2 L 4 T -0.17 1 A 8 A 0.18

The new scoring matrix + pair coefficients approach should work over a large range of training set sizes. If little data is available, few or no pair coefficients will meet the criteria for inclusion, and the method is reduced to the SMM matrix. With more training data available, more pair coefficients are included, thus adjusting the complexity of the method to the available training data.

Another advantage of the pair coefficients over methods like ANNs is that the extracted rules for binding are easy to interpret. The values determined for the pair coefficients provide direct information about the MHC-I-peptide binding mechanism. Since peptides bind in an extended conformation, one would expect the absolute values of the pair coefficients to be lower if their associated amino acid positions are farther apart. In Figure 10, two quantities that reflect the influence of position distance are plotted: the percentage of pair coefficients discarded due to conflicting information and the average absolute value of retained pair coefficients at the distance. Distances 6 and 7 have the highest percentages of discarded coefficients and distance 7 has the lowest average value of retained coefficients, indicating weak or no interactions between positions at such distances. To a lesser extent, the levels of interaction at distances 2, 5, 6 and 8 are also weaker than those at distances 1, 3 and 4. This agrees with the expected trend of stronger interactions for closer positions, but to a much lesser extend than expected. Also, this study does not confirm that (i, i+2) neighbors influence each other more strongly than (i, i+1) neighbors, which was expected because the side chains of next-neighbor amino acids face in the same direction thus allowing for direct interactions. Taken together, this shows that interactions are not limited to amino acids in direct contact, but can also play a role over longer distances, probably through the conformation of the peptide back-bone.

While the presented results show that interactions between peptide positions do exist, the values of the pair coefficients are roughly an order of magnitude lower than the entries of the scoring matrix. This could be due to two reasons: the values for these coefficients are indeed small, or the noise level in the datasets is high – thus the coefficients are forced to low values by the regularization parameter λ'. Since the plain SMM performs very well, it can be believed that the

A novel sequence based algorithm was introduced that predicts the affinity of peptides to MHC-I molecules. Its basis is a matrix based prediction (SMM) where the entries describe the contributions of individual residues in a peptide to binding. Determining the matrix entries by minimizing the distance between predicted scores and measured IC50 values for a set of training peptides leads to significantly better predictions on three independent test sets than were obtained for a number of previously published methods.

The SMM matrix was combined with pair-coefficients describing interactions between peptide positions, which further improve prediction quality. The pair-coefficient values for the first time quantify the influence of these interactions on peptide binding. Compared to the values of the matrix entries, they are rather small, which explains why good predictions are possible without taking peptide interactions into account at all. The distribution of the coefficient values also shows, that interactions in a peptide are not limited to residues in direct contact, but can also play a role over longer distances, probably through the conformation of the peptide backbone.