 [page 15↓] 
Binding to MHCI is the most selective requirement for a peptide to become an epitope through the MHCI pathway. This has directed most experimental and theoretical work aimed at MHCI epitope prediction towards this step. In this chapter a new method to predict the affinity of a peptide to an MHCI molecule is introduced and compared to existing prediction methods. Special interest is paid to the independent binding assumption, which states that each residue within a peptide contributes independently to the overall binding of the peptide. Several published prediction methods make this assumption, while others claim it to be invalid because it neglects interactions between peptide positions. Here, for the first time the violation of the independent binding assumption is quantified.
This chapter is organized as follows: The first section (2.1) gives an overview of existing prediction algorithms, followed by a description of the available datasets consisting of peptides with measured affinities, which are needed for training and testing of the algorithms (section 2.2). Section 2.3 explains how predictions from existing algorithms were obtained. This is followed by the introduction of the novel SMM prediction method (section 2.4), and a description of the means to compare the prediction quality of different methods (section 2.5). In section 2.6, all algorithms making the independent binding assumption are compared, including the novel SMM method. This is followed by a comparison of general prediction methods not making that assumption in section 2.7. In section 2.8, the SMM prediction method is extended to include pair coefficients describing the pairwise interactions between peptide positions, thus dropping the independent binding assumption. Finally, the distribution of these quantified pairwise interactions between peptide positions is analyzed (section 2.9).
Most of the results presented in this chapter are taken from (Peters, et al., 2003).
When the first peptides binding to MHCI molecules were sequenced (Rotzschke, et al., 1990), it soon became clear that each MHCI molecule has its own narrow binding preference. Only peptides of a certain length were found, typically 810 amino acids long. Some positions in the [page 16↓]binding peptides called anchor positions could only be occupied by a few different amino acids. Apparently, peptides interact more strongly with the MHCI molecule at these anchor positions, limiting the number of amino acids tolerated there. Using pool sequencing of peptides eluted from one type of MHCI allele (Falk, et al., 1991), it could be characterized which amino acids are allowed at the anchor positions. Demanding compliance with this anchor motif was the first method to predict which peptides are potential MHCI binders (Rotzschke, et al., 1991).
With more data available, it became apparent that there are several peptides binding to MHCI that do not comply with the anchor motif or, more often, that peptides containing the motif did not bind (Jameson and Bevan, 1992). This lead to extensions of the motif method, listing more and more good, bad or tolerated residues for binding at the different peptide positions. One deficiency of such a listing is that it does not say if a good residue at one position can compensate for a bad residue somewhere else. To achieve this, scores were assigned to each amino acid at each position. These can be summed up for a given peptide and the total score predicts if the peptide is likely to bind or not. This is called a matrix based prediction, because the scores for each peptide position and amino acid can be written in the form of a matrix like the one shown in Table 1.
The first scoring matrix for MHCI binding was introduced by (Parker, et al., 1994), which is in the following called the BIMAS method. The experimental basis was a set of peptides with measured halflife dissociation constants for the HLAA0201 allele. The matrix coefficients were determined mathematically by minimizing the distance between predicted matrix scores for the peptides and the logarithms of their measured halflife constants. This makes the matrix score a prediction of a peptides halflife of dissociation, and therefore of its strength of binding.
A major contribution to the prediction of epitopes was the establishment of the SYFPEITHI database (Rammensee, et al., 1999;Rammensee, et al., 1995). This database is a collection of sequences of naturally processed epitopes and peptides known to bind to MHCI molecules, which serves as the basis for a scoring algorithm. To determine a scoring matrix for a given allele, the frequencies of amino acids in peptides binding to that allele are used to manually [page 17↓]determine scores for each peptide position. There is no experimental interpretation of the SYFPEITHI score, it simply reflects the agreement of a peptides sequence with that of previously known binders and epitopes. As the sequences of naturally processed epitopes also reflect the selectivity of other components in the antigen procession pathway, these matrices do not describe the pure binding specificity of an MHCI allele, but it is assumed that the influence of other pathway agents is small compared to the selectivity of MHCI binding.
Another matrix based prediction approach is called the polynomial method (PM) (Gulukota, et al., 1997). It has the advantage of being very easy to calculate from a set of peptides with known affinity values. Each matrix entry corresponding to a particular amino acid type at a particular peptide position is calculated as the mean affinity of all peptides containing that amino acid type at that position.
A matrix based prediction necessarily assumes, that the contribution to binding of each residue in a peptide is independent of the other residues in the peptide. This is a daring assumption, known to be false for many other biological problems. However, matrix based algorithms work surprisingly well for MHCI affinity predictions: Scoring matrices are usually calculated with little experimental data (typically hundreds of peptides) compared to the size of sequence space (for 9mers: 20^{9}), and still give good predictions. This shows that the relationship between sequence and affinity can at least roughly be approximated by the independent binding assumption. The assumption is also supported by structural data showing that peptides bind in an extended conformation in the MHCI groove, so that contacts between residues of the bound peptide are limited.
Even if independent binding is a justified approximation, a method without this restriction should be able to make better predictions if interactions between peptide positions play a role at all. In the following, such methods are called general methods, in contrast to those making the independent binding assumption. The first general methods used to predict binding to MHCI were artificial neural networks (ANN) (Gulukota, et al., 1997;Milik, et al., 1998). ANNS are a [page 18↓]biologically motivated approach to machine learning. An ANN consists of several layers of interconnected ‘neurons’, each of which transfers its input into an output according to some mathematical function. The free parameters of this function are determined by learning from a set of data where the correct output belonging to an input is known (see (AgatonovicKustrin and Beresford, 2000) for a review of ANN applications in biomedical sciences).
Another general prediction method are Classification and Regression Trees (CART). These are described in detail in (Breiman, et al., 1984), and were first used by (Segal, et al., 2001) to predict MHCI binding. Briefly, a classification tree is built by introducing splits in a set of peptides with known binding capability according to what amino acid is at a certain position of a peptide. The splitting is repeated, leading to a tree shaped classification scheme (e.g. Figure 8). Each split is chosen so that it maximizes the homogeneity of the peptides in both daughter nodes. A perfectly homogenous node contains only binders or only nonbinders. A very large tree is built first so that all nodes are perfectly homogenous. It is then pruned back to an optimal size determined by crossvalidation. Each terminal node in the optimal tree is assigned a binding score, computed as the percentage of nonbinders the node contains.
The third general prediction method used in this chapter is the additive method (Doytchinova, et al., 2002). It consists of a scoring matrix + coefficients describing pairwise interactions between amino acids. The sore of a peptide is calculated by adding those coefficients belonging to pairs of amino acids present in that peptide to its matrix score, as described below in equation (6). All interactions between neighboring and nextneighbor amino acids are considered. The matrix and interaction coefficients are determined by a partial least square fit to a set of peptides with known affinities.
The prediction methods used in this chapter were chosen because they were freely accessible or reasonably easy to reproduce. Other classes of prediction methods based on binding data which are not treated here are Hidden Markov Models (Mamitsuka, 1998) and Support Vector Machines (Donnes and Elofsson, 2002). There are also prediction methods based on structural data of solved peptideMHCI complexes. These use threading or molecular modeling techniques to identify potential binders (Altuvia, et al., 1995;SchuelerFurman, et al., 2000).
 [page 19↓] 
Four nonoverlapping sets of 9mer peptides with known affinities to the HLAA0201 allele are used in this chapter. All datasets are separated into 'binders' and 'nonbinders'. The cutoff for making this separation is not critical as long as all the binders have a higher affinity than any of the nonbinders. All sets have similar sequence composition: at the anchor positions 2 and 9, amino acids A, I, L, M, T and V are overrepresented; at other positions, all 20 amino acids are roughly equally represented. For the SYFPEITHIset, this reflects the occurrence of amino acids at these positions in naturally processed epitopes. For the in vitro datasets, where the choice of peptides is made by an experimentalist, this bias is also found because experimentalists want to include as many binders in their datasets as possible. While it would be better  from a mathematical viewpoint  to have binding information of randomly selected peptides, this would require far more measurements as only very few random peptides bind. As the described bias is present in all learning and test sets, no particular prediction method is favored.
Trainset: This set consists of 533 peptides with IC50 values measuring their binding affinity to the HLAA0201 molecule, as described in (Gulukota, et al., 1997). IC50 values are measured in an assay in which both the peptide of interest and a reference peptide compete for binding to HLAA0201. The IC50 value of a peptide is the concentration at which it has suppressed 50% of the reference peptides from binding to MHCI. The logarithm of the IC50 value can be interpreted as the difference in binding free energy between a peptide and the reference.
Several peptides in the dataset are ‘heavy non binders’, i.e. their IC50 value is too large to be measured. Using an IC50 cutoff of 500 nM, the Trainset is split into 127 binders and 406 nonbinders, which of course include the 'heavy non binders'. The cutoff lies within the intermediate to low affinity range; peptides in this set with IC50<50nM are considered to be highaffinity binders.
Blindset: This set 173 of peptides with IC50 values was measured with the same experimental setup as the Trainset. Using the IC50 cutoff of 500 nM, the set is split in 67 binders and 108 nonbinders.
 [page 20↓] 
BIMASset: This set of peptides was published by (Parker, et al., 1994). For 134 peptides, β2 microglobulin dissociation halflives have been measured. Four of the peptides overlap with the Trainset. By interpolating between their known IC50 and halflife values, the IC50=500nM cutoff in the Trainset is translated to a halflife cutoff of approximately 650 minutes. Using this cutoff and excluding the overlapping peptides, the BIMASset has 25 binders and 105 nonbinders.
SYFPEITHIset: 143 peptides binding to HLAA0201 were taken from the SYFPEITHI database, most of which are naturally processed epitopes. All of these were classified as binders, even though there is no measured affinity available for them. To have nonbinders in this set that definitely have lower affinities than these binders, the 59 heavy nonbinders from the Blindset were included.
The BIMAS and SYFPEITHI predictions were obtained from web servers (BIMAS: http://bimas.dcrt.nih.gov/molbio/hla_bind/, SYFPEITHI: http://www.unituebingen.de/uni/kxi). The training data of these methods contained the equally named test sets described above.
The PM, CART and ANN predictions were trained using the Trainset described above. For the CART method, two commercial software packages (SPSS and CART) were used, leading to an identical optimal classification tree shown in Figure 8. Switching from a classification tree used here, which needs binary experimental binding data for training, to a regression tree, which uses quantitative IC50 values, improves the prediction performance only slightly.
The ANN was designed as described in (Gulukota, et al., 1997): A feedforward neural network with three layers was built consisting of an input layer with 180 neurons, a hidden layer with 50 neurons and an output layer with one neuron. The Aspirin/MIGRAINES software package from the MITRE Corporation (http://www.emsl.pnl.gov:2080) was used to simulate the network.
 [page 21↓] 
For the Additive Method, the coefficient values determined in (Doytchinova, et al., 2002) were used. The training data used in that paper to determine the coefficient values was a subset of the Trainset described above.
Scoring matrices quantify the contributions of individual residues in a peptide to binding. The matrix element s_{i,a} corresponds to amino acid a at position i of the peptide. The total score S_{k} for a given peptide k with the amino acids a_{k}(i) at positions i is then given by the summation:









where s_{0} is a constant offset.
In this section, the novel stabilized matrix method (SMM) is developed. It determines the values for s_{i,a} and s_{0} by minimizing the distance between predicted scores S_{k} and measured affinities for the peptides in the Trainset:







For a peptide with a measurable IC50 value, the norm in equation (2) has the form:






(3) 
Several peptides in the Trainset have too low affinities to measure an IC50 value. For these ‘heavy nonbinders’, the IC50 values are set equal to or greater than the largest experimentally measurable value, which was cutoff=ln(50,000) for the Trainset. Accordingly, for these peptides the norm in (2) is set to be





(4) 
To avoid overfitting, a second term is added to the minimization function in (2):
By minimizing with a nonzero λ value, a tradeoff is introduced between optimally reproducing the experimental IC50 values (including their inevitable experimental error) and minimizing parameters s_{i,a}. This forces all parameters s_{i,a} towards zero, which do not significantly lower the distance Φ. The optimal value for λ is determined by 10fold crossvalidation on the Trainset, i.e. splitting the total set of peptides into 10 subsets, establishing a scoring matrix using 9 of these subsets and making predictions with that matrix for the left out subset. Figure 5 depicts the sum of the distances between these predictions and the experimental values as a function of λ.
Figure 5: Cross validated distance as a function of λ  
 [page 23↓] 
The optimal predictions were made with λ_{opt}=1. The resulting SMM scoring matrix is shown in Table 1. The lower a score, the better an amino acid is suited for binding at the given position.
A similar mathematical concept is used to solve 'inverse problems', where λ is called the regularization parameter. A short introduction to inverse problems is given in (Press, et al., 1992), chapter 18. To minimize equation (5), a commercial nonlinear optimizer (Frontline Systems, 1999) using a generalizedreducedgradient method is applied.
Pos 1 
Pos 2 
Pos 3 
Pos 4 
Pos 5 
Pos 6 
Pos 7 
Pos 8 
Pos 9 

A 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
R 
0.58 
0.74 
3.17 
0.13 
0.78 
1.30 
1.35 
0.33 
2.76 
D 
4.90 
0.74 
1.46 
1.50 
0.15 
0.14 
1.50 
1.06 
2.76 
N 
1.47 
0.74 
2.03 
1.48 
0.36 
1.78 
0.51 
0.76 
2.76 
C 
0.93 
0.74 
0.96 
0.96 
0.82 
1.15 
0.56 
0.57 
2.76 
E 
3.46 
0.74 
2.79 
1.24 
0.40 
0.74 
0.17 
0.39 
2.76 
Q 
1.63 
0.74 
1.24 
0.34 
0.41 
0.52 
0.61 
0.36 
2.76 
G 
0.50 
0.74 
1.08 
0.83 
0.64 
0.60 
1.22 
0.54 
2.76 
H 
1.33 
0.74 
0.90 
1.44 
0.39 
0.60 
0.43 
0.06 
2.76 
I 
0.58 
0.42 
1.54 
1.67 
0.78 
1.40 
0.29 
0.59 
0.50 
L 
0.29 
3.06 
0.38 
0.17 
0.03 
1.48 
0.01 
0.26 
0.50 
K 
0.20 
0.74 
2.25 
0.22 
0.05 
1.43 
1.73 
1.20 
2.76 
M 
1.46 
2.72 
1.02 
0.36 
0.41 
1.43 
1.08 
0.58 
0.91 
F 
1.24 
0.74 
0.13 
0.45 
1.84 
1.48 
1.64 
1.10 
2.76 
P 
3.63 
0.74 
1.12 
0.67 
0.93 
0.90 
1.60 
0.66 
2.76 
S 
0.24 
0.74 
1.05 
0.25 
0.26 
0.26 
0.18 
0.08 
2.76 
T 
0.82 
0.17 
1.70 
0.63 
0.75 
2.20 
0.17 
0.00 
1.06 
W 
0.38 
0.97 
0.03 
2.38 
1.49 
1.07 
1.67 
0.02 
2.76 
Y 
1.91 
0.74 
0.89 
0.12 
1.74 
1.37 
2.34 
0.77 
2.76 
V 
0.09 
0.31 
1.28 
1.15 
0.44 
2.13 
0.03 
1.04 
0.81 
Offset 
10.14 
In this chapter, the predictions of diverse methods on equally diverse datasets have to be compared. To do this fairly, ROC curves are used (Bradley, 1997): For a given cutoff value which separates peptides by their predicted score into potential binders and nonbinders, the two variables sensitivity (true positives / total positives) and 1specificity (false positives / total negatives = false alarm rate) are calculated. By systematically varying the cutoff value from the lowest to the highest predicted score, a ROC curve like Figure 6 is plotted. Prediction quality is measured by the area under the curve (AUC), which is 0.5 for random predictions and 1.0 for perfect predictions. The AUC is equivalent to the probability that the score of a randomly chosen binder is higher than that of a randomly chosen nonbinder. This measure has the advantage of not relying on a single arbitrarily chosen cutoff value for the prediction score, and can be equally applied to all datasets and prediction methods.
To assess if one prediction is significantly better than another, the set of peptides for which predictions are made was resampled. Using bootstrapping with replacement, 50 new datasets were generated with a constant ratio of positives to negatives. The difference in AUC for the two predictions on each new dataset is then calculated. One prediction is significantly better than another if the distribution of the differences in AUC values is significantly above zero, which is measured using a standard ttest with a pvalue of 0.001.
The four matrix based methods had to predict which of the peptides in the Blindset are binders. The Blindset is the only set truly ‘blind’ to all methods, i.e. none of peptides in this set were included in the training data of any of the prediction methods. Figure 6 depicts ROC curves for all predictions. It indicates that the performance ranks in the order of SMM>BIMAS>PM over almost the entire range. SYFPEITHI is the worst method for sensitivities above 0.42 and becomes the best for specificities above 0.97. This is due to the fact that SYFPEITHI predictions reflect other components of the antigen presentation pathway in addition to MHCI binding, [page 25↓]leading to a decrease in sensitivity when predicting binding alone. The area under the ROC curve (AUC) gives a single number describing prediction accuracy of a method. Figure 6 translates to AUC values of 0.869 for SMM, 0.846 for BIMAS, 0.795 for PM and 0.745 for SYFPEITHI. The AUC values for the predictions of all methods on the other test sets (BIMASset and SYFPEITHIset) are listed in Table 2. For both of these sets, SMM makes the best predictions of all truly blind methods.
 [page 26↓] 
Prediction Method 
Independent Binding Assumption 
AUC on Test Set 

Blind 
SYFPEITHI 
BIMAS 

SMM, λ=1 
Yes 
0.869 
0.848 
0.866 
SMM, λ=0 
Yes 
0.856 
0.846 
0.865 
BIMAS 
Yes 
0.846 
0.829 
(0.875) 
PM 
Yes 
0.795 
0.792 
0.757 
SYFPEITHI 
Yes 
0.745 
(0.865) 
0.754 
SMM + pair coef. 
No 
0.873 
0.852 
0.869 
ANN 
No 
0.796 
0.788 
0.762 
Additive method 
No 
0.820 
0.770 
0.830 
CART 
No 
0.708 
0.620 
0.539 
Two elements make the SMM approach different from other matrix based methods. First, it incorporates the experimental information of heavy nonbinders precisely into the distance defined in equation (2). Since only the lower bound of the IC50 values for heavy nonbinders can be determined, previous approaches have either left them out entirely or tried to fit them exactly to the lower bound. Second, the regularization technique was used. With errors in experimental measurements, there can be multiple sets of matrix coefficients that can reproduce the experimental data within their range of the error. Choosing the set of coefficients that gives the minimum distance may mean to overfit the problem. By incorporating a regularization parameters (λ in equation 5), a set of coefficients is chosen that reproduces the experimental results reasonably while keeping the parameter values small. This effectively prevents overfitting. Table 2 indicates that the AUC values at λ_{opt}=1, are better than those at λ=0 for all test sets.
Figure 7 shows ROC curves for general prediction methods (not making the independent binding assumption) on the Blindset. The corresponding AUC values are also listed in Table 2. ANN [page 27↓]and the additive method make consistently better prediction than the CART tree. None of these previously published general methods reaches the prediction quality of SMM or BIMAS which made the independent binding assumption.
Figure 7: ROC curves for general prediction methods on the HLAA2 Blindset.  
The Figure contains ROC curves described in the legend of Figure 6. Since there are only three terminal nodes in the CART tree, corresponding to three different scores, its ROC curve consists of only two nontrivial points. 
 [page 28↓] 
At first glance, this is surprising, as the general methods should be able to describe all binding mechanisms, including the simple case of independent binding. Why does the more restrictive matrix approach perform better? This can best be seen for the CART algorithm. As shown in Figure 8A, CART suggests to split node 3, but not node 2. If this exactly described reality, peptides in node 3 would bind differently than peptides in node 2, signifying an interaction between positions 2 and 1 only for peptides with an L or M at position 2. In contrast, if the independent binding assumption is true and there are no interactions between positions 2 and 1, the split described for node 3 should also be applicable to node 2. Testing this on the Blindset (Figure 8B) shows that transferring the split actually works, as the new nodes resulting from the transferred split are more homogenous than node 2. The CART algorithm cannot identify this split, because it can only use the peptides in node 2 to decide about splits at node 2, and there are only 11 binders left in that node. This shows that general methods simply require more training data to achieve the prediction quality of matrixbased methods, if the independent binding assumption holds to a high degree.
In case of the additive method and the ANN, lack of data has lead to overfitting. The additive method has 1850 free parameters, the ANN architecture taken from (Gulukota, et al., 1997) more than 9000 neurons. This number of parameters cannot be determined reliably for the Trainset containing only 533 experimental data points. For the ANN, choosing a different architecture with less free parameters would probably have improved its prediction quality. For the additive method, the overfitting seems to effect mainly the interaction coefficients. When neglecting these coefficients which describe interactions between neighboring amino acids, and keeping only those compatible with the independent binding assumption, the prediction quality improves.
 [page 29↓] 
 [page 30↓] 
The results of the previous section show that given the limited data, general methods tend to perform worse than matrix based ones. In this section a general method that uses the matrix predictions as a starting point is developed. This is done by quantifying the contribution to binding of interactions of peptide positions with pair coefficients s'_{i,a,i',a'}. For example, coefficient s'_{3A,7L} (**A***L**) describes the difference in binding between the following two scenarios: (1) an Alanine is at the third AND a Lysine at the seventh position of the peptide and (2) the sum of the average contributions of having an Alanine at the third position, described by matrix value s_{3A} (**A******), and having a Lysine at the seventh position, described by s_{7L} (******L**). The total score for a given peptide k with the amino acids a_{k}(i) at positions i is then given by
where S_{k} is the matrix score defined in equation (1), and the sum includes all pairs of amino acids found in the peptide. For 9mer peptides, this would result in 20*20*36 =14400 different pair coefficients. Since it is impossible to determine that many coefficients given the limited data, a twostage selection process is applied: In the first step all coefficients for which fewer than N_{min}=10 peptides exist in the Trainset are eliminated, as they lack sufficient experimental information. The optimal values for the remaining 269 pair coefficients can then be calculated by minimizing
where Φ' is the same as Φ in equation (2), except that scores S'_{k} are used instead of S_{k}. When optimizing the pair coefficients s'_{i,a,i',a'} in equation (7), the matrix coefficients s_{i,a} are frozen at their optimal value determined from equation (5).
In a second selection step, the Trainset is split into 10 equalsize nonoverlapping subsets and 10 different optimal values for each pair coefficient are determined by leaving out one subset at a time and minimizing equation (7). If a pair coefficient contains both positive and negative [page 31↓]optimal values, it cannot be estimated reliably from the Trainset. Therefore, it is discarded. Setting all 145 discarded coefficients to zero, equation (7) is minimized to determine the values of the remaining 124 pair coefficients.
The optimal value for λ' is again determined using crossvalidation, similar to λ in equation (5). The cross validated distance as calculated using (2) compared to that of the plain matrix predictions is shown in Figure 9 for various values of λ'. For very high values of λ', the pair coefficients are forced to stay around zero, and the prediction accuracy is equal to that of the plain matrix approach. For λ' = 0, there is no restriction on the value of pair coefficients, leading to overfitting, and the resulting performance is much worse than the plain matrix. In between, [page 32↓]there is a maximum in distance improvement at the optimal value λ'_{opt}= 5. The values of the pair coefficients are listed in Table 3.
Using the combined SMM matrix + pair coefficient method leads to an improvement in prediction quality over the plain SMM matrix on all test sets (Table 2). No other general method even reaches the SMM matrix predictions on one of the test sets. There are three main advantages of the novel matrix + pair coefficients approach: First of all, the pair coefficients are determined by systematic investigation of differences between the matrix predictions and the experimental values. As the matrix method is highly accurate, it is a better starting point than trying to determine both position contributions and position interactions all at once. Another novelty is that the interactions under investigations are limited to those with a sufficient amount of consistent training data. The third advantage is again the use of a regularization parameter (λ' in equation 7), which prevents the pair coefficients from overfitting the data. Its importance can be seen clearly in Figure 9: without it (λ'=0) the pair coefficients reduce prediction quality below that of the matrix alone.

 [page 34↓] 
Previously there was one published study comparing several methods to predict peptide binding to MHCI (Yu, et al., 2002). There it was reported that the optimal choice of a prediction method depends on the number of peptides available for training: an ANN was outperformed by scoring matrices when the training data consisted of 234 peptides, while the ANN outperformed scoring matrices when trained on a set of over a thousand peptides. In principle this agrees with the results reported here.
The new scoring matrix + pair coefficients approach should work over a large range of training set sizes. If little data is available, few or no pair coefficients will meet the criteria for inclusion, and the method is reduced to the SMM matrix. With more training data available, more pair coefficients are included, thus adjusting the complexity of the method to the available training data.
Another advantage of the pair coefficients over methods like ANNs is that the extracted rules for binding are easy to interpret. The values determined for the pair coefficients provide direct information about the MHCIpeptide binding mechanism. Since peptides bind in an extended conformation, one would expect the absolute values of the pair coefficients to be lower if their associated amino acid positions are farther apart. In Figure 10, two quantities that reflect the influence of position distance are plotted: the percentage of pair coefficients discarded due to conflicting information and the average absolute value of retained pair coefficients at the distance. Distances 6 and 7 have the highest percentages of discarded coefficients and distance 7 has the lowest average value of retained coefficients, indicating weak or no interactions between positions at such distances. To a lesser extent, the levels of interaction at distances 2, 5, 6 and 8 are also weaker than those at distances 1, 3 and 4. This agrees with the expected trend of stronger interactions for closer positions, but to a much lesser extend than expected. Also, this study does not confirm that (i, i+2) neighbors influence each other more strongly than (i, i+1) neighbors, which was expected because the side chains of nextneighbor amino acids face in the same direction thus allowing for direct interactions. Taken together, this shows that interactions are not limited to amino acids in direct contact, but can also play a role over longer distances, probably through the conformation of the peptide backbone.
 [page 35↓] 
While the presented results show that interactions between peptide positions do exist, the values of the pair coefficients are roughly an order of magnitude lower than the entries of the scoring matrix. This could be due to two reasons: the values for these coefficients are indeed small, or the noise level in the datasets is high – thus the coefficients are forced to low values by the regularization parameter λ'. Since the plain SMM performs very well, it can be believed that the [page 36↓]true values of the pair coefficients are indeed significantly lower than the values of the matrix entries. The low impact of pair interactions compared to the contributions of individual residues explains why methods making the independent binding assumption can make good predictions at all.
A novel sequence based algorithm was introduced that predicts the affinity of peptides to MHCI molecules. Its basis is a matrix based prediction (SMM) where the entries describe the contributions of individual residues in a peptide to binding. Determining the matrix entries by minimizing the distance between predicted scores and measured IC50 values for a set of training peptides leads to significantly better predictions on three independent test sets than were obtained for a number of previously published methods.
The SMM matrix was combined with paircoefficients describing interactions between peptide positions, which further improve prediction quality. The paircoefficient values for the first time quantify the influence of these interactions on peptide binding. Compared to the values of the matrix entries, they are rather small, which explains why good predictions are possible without taking peptide interactions into account at all. The distribution of the coefficient values also shows, that interactions in a peptide are not limited to residues in direct contact, but can also play a role over longer distances, probably through the conformation of the peptide backbone.
© Die inhaltliche Zusammenstellung und Aufmachung dieser Publikation sowie die elektronische Verarbeitung sind urheberrechtlich geschützt. Jede Verwertung, die nicht ausdrücklich vom Urheberrechtsgesetz zugelassen ist, bedarf der vorherigen Zustimmung. Das gilt insbesondere für die Vervielfältigung, die Bearbeitung und Einspeicherung und Verarbeitung in elektronische Systeme.  
DiML DTD Version 3.0  Zertifizierter Dokumentenserver der HumboldtUniversität zu Berlin  HTML generated: 26.11.2004 