[page 15↓]

Peptide binding to MHC-I

Binding to MHC-I is the most selective requirement for a peptide to become an epitope through the MHC-I pathway. This has directed most experimental and theoretical work aimed at MHC-I epitope prediction towards this step. In this chapter a new method to predict the affinity of a peptide to an MHC-I molecule is introduced and compared to existing prediction methods. Special interest is paid to the independent binding assumption, which states that each residue within a peptide contributes independently to the overall binding of the peptide. Several published prediction methods make this assumption, while others claim it to be invalid because it neglects interactions between peptide positions. Here, for the first time the violation of the independent binding assumption is quantified.

This chapter is organized as follows: The first section (2.1) gives an overview of existing prediction algorithms, followed by a description of the available datasets consisting of peptides with measured affinities, which are needed for training and testing of the algorithms (section 2.2). Section 2.3 explains how predictions from existing algorithms were obtained. This is followed by the introduction of the novel SMM prediction method (section 2.4), and a description of the means to compare the prediction quality of different methods (section 2.5). In section 2.6, all algorithms making the independent binding assumption are compared, including the novel SMM method. This is followed by a comparison of general prediction methods not making that assumption in section 2.7. In section 2.8, the SMM prediction method is extended to include pair coefficients describing the pair-wise interactions between peptide positions, thus dropping the independent binding assumption. Finally, the distribution of these quantified pair-wise interactions between peptide positions is analyzed (section 2.9).

Most of the results presented in this chapter are taken from (Peters, et al., 2003).

2.1  Overview of existing prediction methods

When the first peptides binding to MHC-I molecules were sequenced (Rotzschke, et al., 1990), it soon became clear that each MHC-I molecule has its own narrow binding preference. Only peptides of a certain length were found, typically 8-10 amino acids long. Some positions in the [page 16↓]binding peptides called anchor positions could only be occupied by a few different amino acids. Apparently, peptides interact more strongly with the MHC-I molecule at these anchor positions, limiting the number of amino acids tolerated there. Using pool sequencing of peptides eluted from one type of MHC-I allele (Falk, et al., 1991), it could be characterized which amino acids are allowed at the anchor positions. Demanding compliance with this anchor motif was the first method to predict which peptides are potential MHC-I binders (Rotzschke, et al., 1991).

With more data available, it became apparent that there are several peptides binding to MHC-I that do not comply with the anchor motif or, more often, that peptides containing the motif did not bind (Jameson and Bevan, 1992). This lead to extensions of the motif method, listing more and more good, bad or tolerated residues for binding at the different peptide positions. One deficiency of such a listing is that it does not say if a good residue at one position can compensate for a bad residue somewhere else. To achieve this, scores were assigned to each amino acid at each position. These can be summed up for a given peptide and the total score predicts if the peptide is likely to bind or not. This is called a matrix based prediction, because the scores for each peptide position and amino acid can be written in the form of a matrix like the one shown in Table 1.

2.1.1 Matrix prediction methods: BIMAS, SYFPEITHI and PM

The first scoring matrix for MHC-I binding was introduced by (Parker, et al., 1994), which is in the following called the BIMAS method. The experimental basis was a set of peptides with measured half-life dissociation constants for the HLA-A0201 allele. The matrix coefficients were determined mathematically by minimizing the distance between predicted matrix scores for the peptides and the logarithms of their measured half-life constants. This makes the matrix score a prediction of a peptides half-life of dissociation, and therefore of its strength of binding.

A major contribution to the prediction of epitopes was the establishment of the SYFPEITHI database (Rammensee, et al., 1999;Rammensee, et al., 1995). This database is a collection of sequences of naturally processed epitopes and peptides known to bind to MHC-I molecules, which serves as the basis for a scoring algorithm. To determine a scoring matrix for a given allele, the frequencies of amino acids in peptides binding to that allele are used to manually [page 17↓]determine scores for each peptide position. There is no experimental interpretation of the SYFPEITHI score, it simply reflects the agreement of a peptides sequence with that of previously known binders and epitopes. As the sequences of naturally processed epitopes also reflect the selectivity of other components in the antigen procession pathway, these matrices do not describe the pure binding specificity of an MHC-I allele, but it is assumed that the influence of other pathway agents is small compared to the selectivity of MHC-I binding.

Another matrix based prediction approach is called the polynomial method (PM) (Gulukota, et al., 1997). It has the advantage of being very easy to calculate from a set of peptides with known affinity values. Each matrix entry corresponding to a particular amino acid type at a particular peptide position is calculated as the mean affinity of all peptides containing that amino acid type at that position.

2.1.2 The independent binding assumption

A matrix based prediction necessarily assumes, that the contribution to binding of each residue in a peptide is independent of the other residues in the peptide. This is a daring assumption, known to be false for many other biological problems. However, matrix based algorithms work surprisingly well for MHC-I affinity predictions: Scoring matrices are usually calculated with little experimental data (typically hundreds of peptides) compared to the size of sequence space (for 9-mers: 209), and still give good predictions. This shows that the relationship between sequence and affinity can at least roughly be approximated by the independent binding assumption. The assumption is also supported by structural data showing that peptides bind in an extended conformation in the MHC-I groove, so that contacts between residues of the bound peptide are limited.

2.1.3 General prediction methods: ANN, CART and the additive method

Even if independent binding is a justified approximation, a method without this restriction should be able to make better predictions if interactions between peptide positions play a role at all. In the following, such methods are called general methods, in contrast to those making the independent binding assumption. The first general methods used to predict binding to MHC-I were artificial neural networks (ANN) (Gulukota, et al., 1997;Milik, et al., 1998). ANNS are a [page 18↓]biologically motivated approach to machine learning. An ANN consists of several layers of interconnected ‘neurons’, each of which transfers its input into an output according to some mathematical function. The free parameters of this function are determined by learning from a set of data where the correct output belonging to an input is known (see (Agatonovic-Kustrin and Beresford, 2000) for a review of ANN applications in biomedical sciences).

Another general prediction method are Classification and Regression Trees (CART). These are described in detail in (Breiman, et al., 1984), and were first used by (Segal, et al., 2001) to predict MHC-I binding. Briefly, a classification tree is built by introducing splits in a set of peptides with known binding capability according to what amino acid is at a certain position of a peptide. The splitting is repeated, leading to a tree shaped classification scheme (e.g. Figure 8). Each split is chosen so that it maximizes the homogeneity of the peptides in both daughter nodes. A perfectly homogenous node contains only binders or only non-binders. A very large tree is built first so that all nodes are perfectly homogenous. It is then pruned back to an optimal size determined by cross-validation. Each terminal node in the optimal tree is assigned a binding score, computed as the percentage of non-binders the node contains.

The third general prediction method used in this chapter is the additive method (Doytchinova, et al., 2002). It consists of a scoring matrix + coefficients describing pair-wise interactions between amino acids. The sore of a peptide is calculated by adding those coefficients belonging to pairs of amino acids present in that peptide to its matrix score, as described below in equation (6). All interactions between neighboring and next-neighbor amino acids are considered. The matrix- and interaction coefficients are determined by a partial least square fit to a set of peptides with known affinities.

The prediction methods used in this chapter were chosen because they were freely accessible or reasonably easy to reproduce. Other classes of prediction methods based on binding data which are not treated here are Hidden Markov Models (Mamitsuka, 1998) and Support Vector Machines (Donnes and Elofsson, 2002). There are also prediction methods based on structural data of solved peptide-MHC-I complexes. These use threading or molecular modeling techniques to identify potential binders (Altuvia, et al., 1995;Schueler-Furman, et al., 2000).


[page 19↓]

2.2  Experimental datasets

Four non-overlapping sets of 9-mer peptides with known affinities to the HLA-A0201 allele are used in this chapter. All datasets are separated into 'binders' and 'non-binders'. The cutoff for making this separation is not critical as long as all the binders have a higher affinity than any of the non-binders. All sets have similar sequence composition: at the anchor positions 2 and 9, amino acids A, I, L, M, T and V are over-represented; at other positions, all 20 amino acids are roughly equally represented. For the SYFPEITHI-set, this reflects the occurrence of amino acids at these positions in naturally processed epitopes. For the in vitro datasets, where the choice of peptides is made by an experimentalist, this bias is also found because experimentalists want to include as many binders in their datasets as possible. While it would be better - from a mathematical viewpoint - to have binding information of randomly selected peptides, this would require far more measurements as only very few random peptides bind. As the described bias is present in all learning and test sets, no particular prediction method is favored.

Train-set: This set consists of 533 peptides with IC50 values measuring their binding affinity to the HLA-A0201 molecule, as described in (Gulukota, et al., 1997). IC50 values are measured in an assay in which both the peptide of interest and a reference peptide compete for binding to HLA-A0201. The IC50 value of a peptide is the concentration at which it has suppressed 50% of the reference peptides from binding to MHC-I. The logarithm of the IC50 value can be interpreted as the difference in binding free energy between a peptide and the reference.

Several peptides in the dataset are ‘heavy non binders’, i.e. their IC50 value is too large to be measured. Using an IC50 cutoff of 500 nM, the Train-set is split into 127 binders and 406 non-binders, which of course include the 'heavy non binders'. The cutoff lies within the intermediate- to low affinity range; peptides in this set with IC50<50nM are considered to be high-affinity binders.

Blind-set: This set 173 of peptides with IC50 values was measured with the same experimental setup as the Train-set. Using the IC50 cutoff of 500 nM, the set is split in 67 binders and 108 non-binders.


[page 20↓]

BIMAS-set: This set of peptides was published by (Parker, et al., 1994). For 134 peptides, β2 microglobulin dissociation half-lives have been measured. Four of the peptides overlap with the Train-set. By interpolating between their known IC50 and half-life values, the IC50=500nM cutoff in the Train-set is translated to a half-life cutoff of approximately 650 minutes. Using this cutoff and excluding the overlapping peptides, the BIMAS-set has 25 binders and 105 non-binders.

SYFPEITHI-set: 143 peptides binding to HLA-A0201 were taken from the SYFPEITHI database, most of which are naturally processed epitopes. All of these were classified as binders, even though there is no measured affinity available for them. To have non-binders in this set that definitely have lower affinities than these binders, the 59 heavy non-binders from the Blind-set were included.

2.3  Obtaining predictions from published methods

The BIMAS and SYFPEITHI predictions were obtained from web servers (BIMAS: http://bimas.dcrt.nih.gov/molbio/hla_bind/, SYFPEITHI: http://www.uni-tuebingen.de/uni/kxi). The training data of these methods contained the equally named test sets described above.

The PM, CART and ANN predictions were trained using the Train-set described above. For the CART method, two commercial software packages (SPSS and CART) were used, leading to an identical optimal classification tree shown in Figure 8. Switching from a classification tree used here, which needs binary experimental binding data for training, to a regression tree, which uses quantitative IC50 values, improves the prediction performance only slightly.

The ANN was designed as described in (Gulukota, et al., 1997): A feed-forward neural network with three layers was built consisting of an input layer with 180 neurons, a hidden layer with 50 neurons and an output layer with one neuron. The Aspirin/MIGRAINES software package from the MITRE Corporation (http://www.emsl.pnl.gov:2080) was used to simulate the network.


[page 21↓]

For the Additive Method, the coefficient values determined in (Doytchinova, et al., 2002) were used. The training data used in that paper to determine the coefficient values was a subset of the Train-set described above.

2.4  Introducing the stabilized matrix method (SMM)

Scoring matrices quantify the contributions of individual residues in a peptide to binding. The matrix element si,a corresponds to amino acid a at position i of the peptide. The total score Sk for a given peptide k with the amino acids ak(i) at positions i is then given by the summation:

 

 

 

 

 

 

 

 

 

(1)

where s0 is a constant offset.

In this section, the novel stabilized matrix method (SMM) is developed. It determines the values for si,a and s0 by minimizing the distance between predicted scores Sk and measured affinities for the peptides in the Train-set:

 

 

 

 

 

 

 

(2)

For a peptide with a measurable IC50 value, the norm in equation (2) has the form:

 

 

 

 

 

 

(3)

Several peptides in the Train-set have too low affinities to measure an IC50 value. For these ‘heavy non-binders’, the IC50 values are set equal to or greater than the largest experimentally measurable value, which was cutoff=ln(50,000) for the Train-set. Accordingly, for these peptides the norm in (2) is set to be


[page 22↓]

 

 

 

 

(4)

To avoid over-fitting, a second term is added to the minimization function in (2):

       

(5)

By minimizing with a non-zero λ value, a tradeoff is introduced between optimally reproducing the experimental IC50 values (including their inevitable experimental error) and minimizing parameters si,a. This forces all parameters si,a towards zero, which do not significantly lower the distance Φ. The optimal value for λ is determined by 10-fold cross-validation on the Train-set, i.e. splitting the total set of peptides into 10 subsets, establishing a scoring matrix using 9 of these subsets and making predictions with that matrix for the left out subset. Figure 5 depicts the sum of the distances between these predictions and the experimental values as a function of λ.

Figure 5: Cross validated distance as a function of λ


[page 23↓]

The optimal predictions were made with λopt=1. The resulting SMM scoring matrix is shown in Table 1. The lower a score, the better an amino acid is suited for binding at the given position.

A similar mathematical concept is used to solve 'inverse problems', where λ is called the regularization parameter. A short introduction to inverse problems is given in (Press, et al., 1992), chapter 18. To minimize equation (5), a commercial non-linear optimizer (Frontline Systems, 1999) using a generalized-reduced-gradient method is applied.

Table 1: SMM scoring matrix for binding to HLA-A0201

 

Pos 1

Pos 2

Pos 3

Pos 4

Pos 5

Pos 6

Pos 7

Pos 8

Pos 9

A

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

R

0.58

0.74

3.17

0.13

0.78

1.30

1.35

0.33

2.76

D

4.90

0.74

1.46

-1.50

-0.15

-0.14

1.50

1.06

2.76

N

1.47

0.74

2.03

-1.48

-0.36

-1.78

0.51

-0.76

2.76

C

0.93

0.74

0.96

-0.96

-0.82

-1.15

0.56

0.57

2.76

E

3.46

0.74

2.79

-1.24

0.40

-0.74

-0.17

0.39

2.76

Q

1.63

0.74

1.24

-0.34

-0.41

-0.52

0.61

0.36

2.76

G

0.50

0.74

1.08

-0.83

-0.64

-0.60

1.22

0.54

2.76

H

1.33

0.74

0.90

-1.44

-0.39

-0.60

-0.43

0.06

2.76

I

0.58

-0.42

1.54

1.67

-0.78

-1.40

-0.29

0.59

-0.50

L

0.29

-3.06

-0.38

-0.17

-0.03

-1.48

-0.01

0.26

0.50

K

-0.20

0.74

2.25

0.22

0.05

1.43

1.73

1.20

2.76

M

-1.46

-2.72

-1.02

0.36

-0.41

-1.43

-1.08

0.58

0.91

F

-1.24

0.74

0.13

-0.45

-1.84

-1.48

-1.64

-1.10

2.76

P

3.63

0.74

1.12

-0.67

0.93

-0.90

-1.60

-0.66

2.76

S

0.24

0.74

1.05

-0.25

-0.26

-0.26

0.18

0.08

2.76

T

0.82

0.17

1.70

-0.63

-0.75

-2.20

0.17

-0.00

1.06

W

0.38

0.97

-0.03

-2.38

-1.49

-1.07

-1.67

0.02

2.76

Y

-1.91

0.74

-0.89

-0.12

-1.74

-1.37

-2.34

-0.77

2.76

V

-0.09

-0.31

1.28

1.15

-0.44

-2.13

0.03

1.04

-0.81

 

Offset

10.14

       

[page 24↓]

2.5 Evaluating prediction quality

In this chapter, the predictions of diverse methods on equally diverse datasets have to be compared. To do this fairly, ROC curves are used (Bradley, 1997): For a given cut-off value which separates peptides by their predicted score into potential binders and non-binders, the two variables sensitivity (true positives / total positives) and 1-specificity (false positives / total negatives = false alarm rate) are calculated. By systematically varying the cut-off value from the lowest to the highest predicted score, a ROC curve like Figure 6 is plotted. Prediction quality is measured by the area under the curve (AUC), which is 0.5 for random predictions and 1.0 for perfect predictions. The AUC is equivalent to the probability that the score of a randomly chosen binder is higher than that of a randomly chosen non-binder. This measure has the advantage of not relying on a single arbitrarily chosen cut-off value for the prediction score, and can be equally applied to all datasets and prediction methods.

2.5.1 Statistical significance for differences in AUC

To assess if one prediction is significantly better than another, the set of peptides for which predictions are made was re-sampled. Using bootstrapping with replacement, 50 new datasets were generated with a constant ratio of positives to negatives. The difference in AUC for the two predictions on each new dataset is then calculated. One prediction is significantly better than another if the distribution of the differences in AUC values is significantly above zero, which is measured using a standard t-test with a p-value of 0.001.

2.6  Comparison of matrix based predictions: SMM, PM, BIMAS and SYFPEITHI

The four matrix based methods had to predict which of the peptides in the Blind-set are binders. The Blind-set is the only set truly ‘blind’ to all methods, i.e. none of peptides in this set were included in the training data of any of the prediction methods. Figure 6 depicts ROC curves for all predictions. It indicates that the performance ranks in the order of SMM>BIMAS>PM over almost the entire range. SYFPEITHI is the worst method for sensitivities above 0.42 and becomes the best for specificities above 0.97. This is due to the fact that SYFPEITHI predictions reflect other components of the antigen presentation pathway in addition to MHC-I binding, [page 25↓]leading to a decrease in sensitivity when predicting binding alone. The area under the ROC curve (AUC) gives a single number describing prediction accuracy of a method. Figure 6 translates to AUC values of 0.869 for SMM, 0.846 for BIMAS, 0.795 for PM and 0.745 for SYFPEITHI. The AUC values for the predictions of all methods on the other test sets (BIMAS-set and SYFPEITHI-set) are listed in Table 2. For both of these sets, SMM makes the best predictions of all truly blind methods.

Figure 6: ROC curve for matrix based prediction methods on the Blind-set.

For each method, the cutoff was varied from the lowest to the highest predicted score of any peptide in the Blind-set. For each cutoff value, the sensitivity and specificity were calculated, and plotted in the graph.


[page 26↓]

Table 2: Comparison of prediction quality

Prediction Method

Independent Binding Assumption

AUC on Test Set

Blind

SYFPEITHI

BIMAS

SMM, λ=1

Yes

0.869

0.848

0.866

SMM, λ=0

Yes

0.856

0.846

0.865

BIMAS

Yes

0.846

0.829

(0.875)

PM

Yes

0.795

0.792

0.757

SYFPEITHI

Yes

0.745

(0.865)

0.754

SMM + pair coef.

No

0.873

0.852

0.869

ANN

No

0.796

0.788

0.762

Additive method

No

0.820

0.770

0.830

CART

No

0.708

0.620

0.539

Two elements make the SMM approach different from other matrix based methods. First, it incorporates the experimental information of heavy non-binders precisely into the distance defined in equation (2). Since only the lower bound of the IC50 values for heavy non-binders can be determined, previous approaches have either left them out entirely or tried to fit them exactly to the lower bound. Second, the regularization technique was used. With errors in experimental measurements, there can be multiple sets of matrix coefficients that can reproduce the experimental data within their range of the error. Choosing the set of coefficients that gives the minimum distance may mean to overfit the problem. By incorporating a regularization parameters (λ in equation 5), a set of coefficients is chosen that reproduces the experimental results reasonably while keeping the parameter values small. This effectively prevents overfitting. Table 2 indicates that the AUC values at λopt=1, are better than those at λ=0 for all test sets.

2.7  Comparison of general predictions: ANN, CART and the additive method

Figure 7 shows ROC curves for general prediction methods (not making the independent binding assumption) on the Blind-set. The corresponding AUC values are also listed in Table 2. ANN [page 27↓]and the additive method make consistently better prediction than the CART tree. None of these previously published general methods reaches the prediction quality of SMM or BIMAS which made the independent binding assumption.

Figure 7: ROC curves for general prediction methods on the HLA-A2 Blind-set.

The Figure contains ROC curves described in the legend of Figure 6. Since there are only three terminal nodes in the CART tree, corresponding to three different scores, its ROC curve consists of only two non-trivial points.


[page 28↓]

At first glance, this is surprising, as the general methods should be able to describe all binding mechanisms, including the simple case of independent binding. Why does the more restrictive matrix approach perform better? This can best be seen for the CART algorithm. As shown in Figure 8A, CART suggests to split node 3, but not node 2. If this exactly described reality, peptides in node 3 would bind differently than peptides in node 2, signifying an interaction between positions 2 and 1 only for peptides with an L or M at position 2. In contrast, if the independent binding assumption is true and there are no interactions between positions 2 and 1, the split described for node 3 should also be applicable to node 2. Testing this on the Blind-set (Figure 8B) shows that transferring the split actually works, as the new nodes resulting from the transferred split are more homogenous than node 2. The CART algorithm cannot identify this split, because it can only use the peptides in node 2 to decide about splits at node 2, and there are only 11 binders left in that node. This shows that general methods simply require more training data to achieve the prediction quality of matrix-based methods, if the independent binding assumption holds to a high degree.

In case of the additive method and the ANN, lack of data has lead to overfitting. The additive method has 1850 free parameters, the ANN architecture taken from (Gulukota, et al., 1997) more than 9000 neurons. This number of parameters cannot be determined reliably for the Train-set containing only 533 experimental data points. For the ANN, choosing a different architecture with less free parameters would probably have improved its prediction quality. For the additive method, the overfitting seems to effect mainly the interaction coefficients. When neglecting these coefficients which describe interactions between neighboring amino acids, and keeping only those compatible with the independent binding assumption, the prediction quality improves.


[page 29↓]

Figure 8: Classification tree for peptides binding to HLA-A0201

Each ellipse denotes a node corresponding to a set of peptides, with the first number indicating the number of binders, and the second number indicating the total number of peptides in the node. The splits in the nodes are symbolized by arrows, which lead to daughter nodes. To the right of each node is a reference number (1-5) used in the text. (A) the optimal tree generated for the Train-set. (B) the tree in (A) is used to classify peptides in the Blind-set. The additional split on the lower left with dotted arrows, which is taken from the split of node 3 in (A), improves the classification. This indicates that the tree in (A) was not optimal.


[page 30↓]

2.8  Extending SMM with pair coefficients

The results of the previous section show that given the limited data, general methods tend to perform worse than matrix based ones. In this section a general method that uses the matrix predictions as a starting point is developed. This is done by quantifying the contribution to binding of interactions of peptide positions with pair coefficients s'i,a,i',a'. For example, coefficient s'3A,7L (**A***L**) describes the difference in binding between the following two scenarios: (1) an Alanine is at the third AND a Lysine at the seventh position of the peptide and (2) the sum of the average contributions of having an Alanine at the third position, described by matrix value s3A (**A******), and having a Lysine at the seventh position, described by s7L (******L**). The total score for a given peptide k with the amino acids ak(i) at positions i is then given by

        

(6)

where Sk is the matrix score defined in equation (1), and the sum includes all pairs of amino acids found in the peptide. For 9-mer peptides, this would result in 20*20*36 =14400 different pair coefficients. Since it is impossible to determine that many coefficients given the limited data, a two-stage selection process is applied: In the first step all coefficients for which fewer than Nmin=10 peptides exist in the Train-set are eliminated, as they lack sufficient experimental information. The optimal values for the remaining 269 pair coefficients can then be calculated by minimizing

    

(7)

where Φ' is the same as Φ in equation (2), except that scores S'k are used instead of Sk. When optimizing the pair coefficients s'i,a,i',a' in equation (7), the matrix coefficients si,a are frozen at their optimal value determined from equation (5).

In a second selection step, the Train-set is split into 10 equal-size non-overlapping subsets and 10 different optimal values for each pair coefficient are determined by leaving out one subset at a time and minimizing equation (7). If a pair coefficient contains both positive and negative [page 31↓]optimal values, it cannot be estimated reliably from the Train-set. Therefore, it is discarded. Setting all 145 discarded coefficients to zero, equation (7) is minimized to determine the values of the remaining 124 pair coefficients.

Figure 9: Using cross-validation to determine the optimal λ'

The distance between the predictions of SMM + pair coefficient and the experimental IC50 values of the Train-set is plotted in comparison to the distance achieved with the SMM matrix alone. The improvement in prediction quality is highest for λ'opt = 5.

The optimal value for λ' is again determined using cross-validation, similar to λ in equation (5). The cross validated distance as calculated using (2) compared to that of the plain matrix predictions is shown in Figure 9 for various values of λ'. For very high values of λ', the pair coefficients are forced to stay around zero, and the prediction accuracy is equal to that of the plain matrix approach. For λ' = 0, there is no restriction on the value of pair coefficients, leading to over-fitting, and the resulting performance is much worse than the plain matrix. In between, [page 32↓]there is a maximum in distance improvement at the optimal value λ'opt= 5. The values of the pair coefficients are listed in Table 3.

Using the combined SMM matrix + pair coefficient method leads to an improvement in prediction quality over the plain SMM matrix on all test sets (Table 2). No other general method even reaches the SMM matrix predictions on one of the test sets. There are three main advantages of the novel matrix + pair coefficients approach: First of all, the pair coefficients are determined by systematic investigation of differences between the matrix predictions and the experimental values. As the matrix method is highly accurate, it is a better starting point than trying to determine both position contributions and position interactions all at once. Another novelty is that the interactions under investigations are limited to those with a sufficient amount of consistent training data. The third advantage is again the use of a regularization parameter (λ' in equation 7), which prevents the pair coefficients from overfitting the data. Its importance can be seen clearly in Figure 9: without it (λ'=0) the pair coefficients reduce prediction quality below that of the matrix alone.


[page 33↓]

Table 3: Pair coefficient values

Pos1

Pos2

value

 

Pos1

Pos2

value

 

Pos1

Pos2

value

5

S

9

L

-0.81

 

2

L

4

P

-0.16

 

2

V

6

A

0.18

2

V

5

A

-0.52

 

4

E

9

L

-0.15

 

8

P

9

L

0.19

2

L

4

Q

-0.43

 

2

L

8

L

-0.14

 

7

N

9

V

0.20

3

S

9

L

-0.42

 

1

A

7

A

-0.13

 

8

S

9

L

0.20

8

H

9

L

-0.41

 

1

A

9

V

-0.13

 

2

L

5

E

0.20

2

L

6

F

-0.41

 

4

P

9

L

-0.13

 

2

L

7

T

0.21

6

A

9

V

-0.39

 

2

L

7

P

-0.12

 

4

S

9

V

0.21

8

I

9

L

-0.36

 

8

A

9

V

-0.11

 

1

E

2

L

0.22

2

L

3

A

-0.36

 

5

A

9

V

-0.11

 

5

I

9

L

0.22

2

V

7

G

-0.36

 

6

V

9

L

-0.11

 

5

V

9

V

0.23

1

G

2

L

-0.34

 

4

A

9

V

-0.10

 

3

A

7

A

0.24

6

L

9

V

-0.34

 

4

K

9

V

-0.10

 

2

L

7

G

0.24

2

L

4

C

-0.33

 

6

S

9

L

-0.09

 

6

Q

9

L

0.25

Pos1

Pos2

value

 

Pos1

Pos2

value

 

Pos1

Pos2

value

5

T

9

V

-0.32

 

2

L

4

K

-0.06

 

6

P

9

V

0.26

2

L

6

Y

-0.32

 

3

A

9

L

-0.06

 

5

E

9

L

0.28

1

A

2

L

-0.31

 

1

A

6

A

-0.06

 

2

L

3

S

0.28

2

L

3

L

-0.31

 

5

G

9

V

-0.01

 

3

A

5

A

0.28

2

L

5

I

-0.31

 

5

A

7

A

0.03

 

2

L

4

G

0.29

2

L

3

Y

-0.29

 

5

A

8

A

0.04

 

2

V

3

R

0.30

2

V

5

V

-0.27

 

2

V

9

L

0.04

 

7

S

9

V

0.30

2

L

4

S

-0.26

 

3

V

9

V

0.07

 

5

Q

9

L

0.32

4

K

9

L

-0.26

 

2

L

6

H

0.09

 

2

L

7

A

0.33

7

L

9

L

-0.26

 

7

A

8

A

0.10

 

2

L

5

V

0.33

6

P

9

L

-0.25

 

5

L

9

A

0.10

 

2

V

3

G

0.34

1

V

9

L

-0.25

 

1

I

9

V

0.11

 

1

D

2

L

0.35

2

L

6

V

-0.25

 

2

L

5

R

0.12

 

6

T

9

L

0.35

1

Q

9

L

-0.25

 

4

K

5

A

0.13

 

3

L

9

V

0.38

8

V

9

L

-0.24

 

2

L

4

E

0.13

 

6

F

9

L

0.40

2

L

7

S

-0.24

 

2

L

6

G

0.13

 

4

L

9

V

0.40

7

A

9

V

-0.22

 

7

T

9

L

0.14

 

1

A

3

A

0.41

7

P

9

L

-0.22

 

3

A

6

A

0.14

 

1

P

2

L

0.41

2

L

8

P

-0.21

 

3

A

4

K

0.14

 

1

C

9

L

0.41

1

L

9

V

-0.21

 

2

V

4

S

0.15

 

2

L

6

L

0.41

2

I

9

V

-0.20

 

7

V

9

V

0.15

 

2

L

7

R

0.42

6

A

7

A

-0.20

 

1

D

9

L

0.15

 

2

L

5

S

0.46

2

L

6

I

-0.18

 

7

G

9

L

0.16

 

6

I

9

L

0.47

8

G

9

L

-0.18

 

1

L

9

L

0.16

 

6

L

7

L

0.52

2

V

6

P

-0.17

 

2

V

5

L

0.17

 

4

Q

9

L

0.56

1

T

2

V

-0.17

 

2

L

3

P

0.17

 

8

D

9

L

0.67

7

A

9

L

-0.17

 

2

L

8

G

0.18

 

5

S

9

V

0.74

2

L

6

A

-0.17

 

1

C

2

L

0.18

      

2

L

4

T

-0.17

 

1

A

8

A

0.18

      


[page 34↓]

Previously there was one published study comparing several methods to predict peptide binding to MHC-I (Yu, et al., 2002). There it was reported that the optimal choice of a prediction method depends on the number of peptides available for training: an ANN was outperformed by scoring matrices when the training data consisted of 234 peptides, while the ANN outperformed scoring matrices when trained on a set of over a thousand peptides. In principle this agrees with the results reported here.

The new scoring matrix + pair coefficients approach should work over a large range of training set sizes. If little data is available, few or no pair coefficients will meet the criteria for inclusion, and the method is reduced to the SMM matrix. With more training data available, more pair coefficients are included, thus adjusting the complexity of the method to the available training data.

2.9  Distribution of pair coefficient values

Another advantage of the pair coefficients over methods like ANNs is that the extracted rules for binding are easy to interpret. The values determined for the pair coefficients provide direct information about the MHC-I-peptide binding mechanism. Since peptides bind in an extended conformation, one would expect the absolute values of the pair coefficients to be lower if their associated amino acid positions are farther apart. In Figure 10, two quantities that reflect the influence of position distance are plotted: the percentage of pair coefficients discarded due to conflicting information and the average absolute value of retained pair coefficients at the distance. Distances 6 and 7 have the highest percentages of discarded coefficients and distance 7 has the lowest average value of retained coefficients, indicating weak or no interactions between positions at such distances. To a lesser extent, the levels of interaction at distances 2, 5, 6 and 8 are also weaker than those at distances 1, 3 and 4. This agrees with the expected trend of stronger interactions for closer positions, but to a much lesser extend than expected. Also, this study does not confirm that (i, i+2) neighbors influence each other more strongly than (i, i+1) neighbors, which was expected because the side chains of next-neighbor amino acids face in the same direction thus allowing for direct interactions. Taken together, this shows that interactions are not limited to amino acids in direct contact, but can also play a role over longer distances, probably through the conformation of the peptide back-bone.


[page 35↓]

Figure 10: Distribution of pair coefficients.

The percentage of discarded pair coefficients (top) and the average values of retained pair coefficients (bottom) are plotted against the distance between two interacting positions.

While the presented results show that interactions between peptide positions do exist, the values of the pair coefficients are roughly an order of magnitude lower than the entries of the scoring matrix. This could be due to two reasons: the values for these coefficients are indeed small, or the noise level in the datasets is high – thus the coefficients are forced to low values by the regularization parameter λ'. Since the plain SMM performs very well, it can be believed that the [page 36↓]true values of the pair coefficients are indeed significantly lower than the values of the matrix entries. The low impact of pair interactions compared to the contributions of individual residues explains why methods making the independent binding assumption can make good predictions at all.

2.10 Summary

A novel sequence based algorithm was introduced that predicts the affinity of peptides to MHC-I molecules. Its basis is a matrix based prediction (SMM) where the entries describe the contributions of individual residues in a peptide to binding. Determining the matrix entries by minimizing the distance between predicted scores and measured IC50 values for a set of training peptides leads to significantly better predictions on three independent test sets than were obtained for a number of previously published methods.

The SMM matrix was combined with pair-coefficients describing interactions between peptide positions, which further improve prediction quality. The pair-coefficient values for the first time quantify the influence of these interactions on peptide binding. Compared to the values of the matrix entries, they are rather small, which explains why good predictions are possible without taking peptide interactions into account at all. The distribution of the coefficient values also shows, that interactions in a peptide are not limited to residues in direct contact, but can also play a role over longer distances, probably through the conformation of the peptide backbone.


© Die inhaltliche Zusammenstellung und Aufmachung dieser Publikation sowie die elektronische Verarbeitung sind urheberrechtlich geschützt. Jede Verwertung, die nicht ausdrücklich vom Urheberrechtsgesetz zugelassen ist, bedarf der vorherigen Zustimmung. Das gilt insbesondere für die Vervielfältigung, die Bearbeitung und Einspeicherung und Verarbeitung in elektronische Systeme.
DiML DTD Version 3.0Zertifizierter Dokumentenserver
der Humboldt-Universität zu Berlin
HTML generated:
26.11.2004