Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.mccme.ru/albio/slides/filimonov.pdf
Дата изменения: Thu Oct 9 22:27:32 2008
Дата индексирования: Tue Oct 2 12:09:16 2012
Кодировка: Windows-1251

Поисковые слова: п п п п п п п п п п п п р п р п р п р п р п р п р п р п р п р п р п р п р п р п р п р п р п р п р п
RECOGNITION OF PROTEIN FUNCTION USING THE LOCAL SIMILARITY
Kirill E. Alexandrov Dmitry A. Filimonov Boris N. Sobolev Vladimir V. Poroikov
Institute of Biomedical Chemistry of Russian Academy of Medical Sciences, Moscow, Russia


Agenda

1. 2. 3. 4. 5. 6. 7. 8.

History of Problem Sequence Local Similarity Algorithm of Similarity Calculation Local Similarity Approach Paradigm Algorithm of Protein Function Recognition Prediction Accuracy Estimation Results of Local Similarity Approach Evaluation Acknowledgements


The central dogma of SAR/QSAR/QSPR: Property = Function ( Structure ) Continuity hypothesis: the difference of structures is less, the difference of properties is less

ypred = x0 +

ixiFi(S)

Fi(S) = LogP, ..., (LogP)2, ... - traditional QSAR Fi(S) = Sim(S,Si) MLR PLS ANN SVM - - - - - similarity based QSAR

multiple linear regression projections to latent structures artificial neural network support vector machine


The local similarity principle

QSAR with CoMFA

Tripos' patented Comparative Molecular Field Analysis (CoMFA) has been used as the method of choice in hundreds of published QSAR studies.


Neighborhoods of atoms descriptors MOLECULAR BIOLOGY QUANTUM CHEMISTRY QUANTUM FIELD THEORY: M = V + VgM = V + VgV + VgVgV + VgVgVg + ... Mi = Vi + VigM = Vi + Vig(M1 + M2 + ... + Mm) All descriptors are based on the concept of atoms' of molecule description subject to the neighborhood of them: MNA RMNA QNA FNA multilevel neighborhoods of atoms reaction multilevel neighborhoods of atoms quantitative neighborhoods of atoms fuzzy neighborhoods of atoms
. ., . . (2006) , L, (2), 66-75.


Multilevel neighborhoods of atoms descriptors - MNA

N O O
H H N C H H H C C N C H H H C C N C H C C C C H C O H C O H C O O H O H O H

C C

C C

MNA/0:

C

MNA/1:

C(CN-H)

MNA/2:

C(C(CC-H)N(CC)-H(C))

. .,

. . (2006)

, L, (2), 66-75.


Multilevel neighborhoods of atoms descriptors - MNA

MNA/2 C(C(CC-H)C(CC-C)-H(C)) C(C(CC-H)C(CN-H)-H(C)) C(C(CC-H)C(CN-H)-C(C-O-O)) C(C(CC-H)N(CC)-H(C)) C(C(CC-C)N(CC)-H(C)) N(C(CN-H)C(CN-H)) -H(C(CC-H)) -H(C(CN-H)) -H(-O(-H-C)) -C(C(CC-C)-O(-H-C)-O(-C)) -O(-H(-O)-C(C-O-O)) -O(-C(C-O-O))

H H

C C

N C H

C C

H C O O H

. .,

. . (2006)

, L, (2), 66-75.


Prediction of activity spectra for organic compounds According to the Bayes formula the probability P(A|S) of that compound S has activity A is equal to: P(A|S) = P(S|A)ћP(A)/P(S) Let the descriptors of organic compound D1, ..., Dm are mutually independent, then: P(S|A) = P(D1, ..., Dm|A) =
i

P(Di|A)

P(A) and P(A|Di) are caculated as sums over all organic compounds of the training set:

. .,

. . (2006)

, L, (2), 66-75.



Quatitative neighborhoods of atoms descriptors - QNA Qi = a [g(C)]ikb

ik

k

ai and bk are parameters of atoms i and k g(C) is function of the connectivity matrix C Pi = Bi Qi = B A=
-

k

(Exp(-

C))ikB

k

-

i

k

(Exp(- C))ikB

k

A

k

(IP + EA),

B = IP - EA,

IP is the first ionization potential, EA is the electron affinity.
Feynman R. Ph. Phys. Rev., 1939, 56, 340-343. Robert G. Parr et al. J. Chem. Phys., 1978, 68(8), 3801-3807. Gasteiger J, Marsili M. Tetrahedron, 1980, 36, 3219-3228. Rappe A K and W A Goddard III. J. Ph. Ch., 1991, 95, 3358-3363.


Quatitative neighborhoods of atoms descriptors - QNA ChemNavigator DataBase in QNA Space 976,545,026 QNA descriptors of 24,621,668 molecules

Initial QNA Space

Normalized QNA Space


Quatitative neighborhoods of atoms descriptors - QNA

Nicotinic Acid

Aspirin

Sulfathiazole


GUSAR - QNA based prediction of quantitative properties of organic compounds


GUSAR - QNA based prediction of quantitative properties of organic compounds
CDK2 inhibitors DHFR inhibitors ACE inhibitors

Vibrio fischeri

Chlorella vulgaris

Tetrahymena pyriformis


GUSAR - QNA based prediction of quantitative properties of organic compounds
PLS MLR GF A HQSAR CoMF A EVA CoMSIA 3D Cerius2 2D Cerius2 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20
delta R2 test delta Q2 delta R2


OK. But, how local similarity can be used for recognition of protein function?...


Pairwise sequence alignment

1996, Autumn

Homology-derived annotation based on the pairwise sequence alignment was a general way to predict the protein function for a long time.


Sequence Local Similarity. Frame 20, shift from -8 to +8
AANRDPSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVA ANRDPSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVAL NRDPSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALR RDPSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRA DPSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRAL PSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALF SQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFG QFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGR FPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRF PDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFP DPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPA PHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPAL HRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALS RFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALSL FDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALSLG DVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALSLGI VTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALSLGID TRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALSLGIDA GTAINKPLSEKMMLFGMGKRRCIGEVLAKWEIFLFLAILLQQLEFSV 2 1 1 0 1 2 1 1 2 0 1 0 9 0 3 1 1 2 9

The best match

Query sequence

Ri = 9


Sequence Local Similarity. Algorithm of Similarity Calculation

, i is position number in the query sequence A a and b are aminoacid residuals in sequence A and sequence B m is current shift between sequence A and sequence B F is frame size Ri is primary similarity value Si is the local similarity value for position i in the query sequence A with sequence B

About 1000 sequences per second.


Sequence Local Similarity.

13.11.1996


"If there exists correspondence between similarity of substrates and protein sequences in cytochrome P450 superfamily?"
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 25 50 75 100 125 150
Number of clusters

CYP4

Proportion of homologs

-- real data ... average random data *** confidence interval The results of substrate-based clustering correspond to homology-based classification for families CYP 1, 2, 3, 4, 5, 6, 7, 11 For other families of P450 (CYP 8, 17, 19, 21, 24, 26, 27) substrate-based clustering brings to the contradictions with the traditional classification

CYP7

Proportion of homologs

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 25 50 75 100 125 150
Number of clusters

Borodina Yu.V., Lisitsa A.V., Poroikov V.V., Filimonov D.A., Sobolev B.N., Archakov A.A. Nova Acta Leopoldina., 2003, 87(329), 47-55.


"Quantifying the Relationships among Drug Classes"

A subset of the MDDR database containing 65 367 compounds organized in 249 sets that associate with a specific biological target

"By multiple criteria, bioinformatics and chemoinformatics networks differed substantially, and only occasionally did a high sequence similarity correspond to a high ligand-set similarity." Hert, J., Keiser, M. J., Irwin, J. J., Oprea, T. I., Shoichet, B. K. "Quantifying the Relationships among Drug Classes" J. Chem. Inf. Model., 2008, 48(4), 755-765.


Fundamental theory

Unique law of nature

Machine Learning

Ab initio principles

Learning by example

Molecular Modelling

Partial estimate

Homology


Protein function recognition based on learning by example

C A B ѓA

It is based on a data set of sequences with known properties. This data set must be subdivided into "positive" and "negtive" examples - group A and its complement ѓA


Is there universal similarity reasonable?


Sequence Local Similarity. It is descriptor itself!

Descriptor is defined as the similarity value S for position i of sequence under study and experimentally annotated sequence k.

ik


Sequence Local Similarity. Algorithm of Classification
Belonging of the sequence under study to a class A is calculated using statistical function B(A):

i = 1,...,n is position number in the sequence under study; k = 1,...,N is the experimentally annotated sequence number; wk(A), wk(ѓA) are weights in class and its complement ѓA of the experimentally annotated sequence k; Sik is similarity for position i of the sequence under study and the experimentally annotated sequence k.


General Classification Problem

Negative

Positive

FP
Calculated value

TP

Threshold

TN

FN

Observed value


General Classification Problem. Criteria of classification accuracy

Sensitivity = TP/(TP+FN) Specificity = TN/(TN+FP) Accuracy (Concordance) = (TP+TN)/N Predictive value positive = TP/(TP+FP) Predictive value negative = TN/(TN+FN) False Negative Rate = FN/(TP+FN) = Error1 False Positive Rate = FP/(TN+FP) = Error2 Positive Likelihood = SENS/(1-SPEC) Negative Likelihood = (1-SENS)/SPEC ... http://www.intmed.mcw.edu/clincalc/bayes.html


General Classification Problem. Independent Accuracy of Prediction (IAP) IAP is calculated using Leave-One-Out Cross-Validation procedure.

Bi is the estimation for sequence i from the class A Bj is the estimation for sequence j from its complement ѓA (x) = 1 if x > 0, (x) = if x = 0, (x) = 0 if x < 0 NA is the number of sequences in the class A NѓA is the number of sequences in its complement ѓA

Poroikov V. et al. (2000) J. Chem. Inf. Comput. Sci., 40, 1349-1355. P. A. Flach, N. Lachiche, Machine Learning, 2004, 57, 233-269.


How sequence local similarity can be used?


Training sets used for the method evaluation

ћ Serine proteases EC 3.4.21 - 28 groups of 4

th

EC level, 623 sequences

ћ Gold standard, especially composed to test statistical learning methods 5 enzyme superfamilies and 56 families, 832 sequences. ћ P450 superfamily (CPD database) 242 proteins classified by substrate specificity (579 compounds). 163 proteins classified by inhibitor specificity (272 compounds).


Serine proteases

The average accuracy reached the maximum (close to one) at the maximal shift of 50 and frame of 50. 24 of 28 classes were recognized at this parameter values with 100% accuracy.


Gold Standard
Families Superfamilies

The average IAP exceeded 0.99. 4 superfamilies were recognized with 100% accuracy. 45 families were recognized with IAP = 1 and 11 families were recognized with IAP > 0.96. The superfamilies seem to be clearly recognized by alignment-based methods; however the families of the same superfamily are worse recognized by the analysis of aligned sequences with phylogenomics methods.


CYP450 classification. Frame 20, Band 100.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 10 15 20 25 30 35 40 Gr ou p si z e 1 0.9 0.8 0.7 0.6

Substrate specificity

IAP

0.5 0.4 0.3 0.2 0.1 0 5 10 15 20 25 30 35 40 Gr ou p si z e

Inducer specificity

IAP


Prediction of activity spectra for organic compounds

. .,

. . (2006)

, L, (2), 66-75.


GUSAR - QNA based prediction of quantitative properties of organic compounds



Conclusions

Our approach revealed the high efficiency of function prediction with different sequence description types. The high accuracy of prediction was obtained for different levels of protein functional classifications. The projection method is useful both for functional specificity prediction and for sequences mapping, i.e. to reveal the local determinants of the functional specificity. The approach "RECOGNITION OF PROTEIN FUNCTION USING THE LOCAL SIMILARITY" will be published in Journal of Bioinformatics and Computational Biology, 2008


Acknowledgements
Со ав то ы

This work was supported by Russian Federation of Basic Research (grant N 04-04-49390- ). We are grateful to A.V. Lisitsa for providing the data on cytochrome P450 substrates and inducers specificity.