Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://www.mccme.ru/albio/slides/regnier.pdf
Äàòà èçìåíåíèÿ: Thu Oct 9 23:36:05 2008
Äàòà èíäåêñèðîâàíèÿ: Tue Oct 2 11:22:27 2012
Êîäèðîâêà:
Overlap Graph and Clumps
Mireille R´gnier e
LIX and INRIA Mireille.Regnier@inria.fr web page : algo.inria.fr/regnier

Octob er, 9-th ­ 2008

AlBio08

An Optimized Counting Graph

1


Outline

1 Introduction and principles 2 Overlap Graph 3 Combinatorics of clumps 4 Op en problems

AlBio08

An Optimized Counting Graph

2


Cis-regulation

AlBio08

An Optimized Counting Graph

3


Cis-regulation changes

AlBio08

An Optimized Counting Graph

4


Example : the caudal motif in early developmental enhancers from Drosophila
GCTTTTTTATGGTCGGC TCGCTTTTATGGCCCAA CAGTTTTTATGTCTTTA CCGTTTTGATGGCGGTG AAATTTTTAGGGAACCA GCCCGTTTATGGTTCCC GACACTTTATGTGACAA TCGGATTTATGACACAA ATGTCTTTATGATTATT GCAACTTTTGGGCCATA CCCTTTTGTTGGCCAAA

Papatsenko et al., 2002

A| C| G| T|

2 3 4 2

3 7 0 1

2 3 5 1

2 2 1 6

100 300 100 6 11 11

0 0 2 9

9 0 0 2

00 00 2 11 90

2 0 7 2

1 6 1 3

3 4 1 3

3 5 2 1

4 2 1 4

7 2 1 1

(a) Aligned Motifs

(b) Countings

AlBio08

An Optimized Counting Graph

5


Example : the caudal motif in early developmental enhancers from Drosophila
GCTTTTTTATGGTCGGC TCGCTTTTATGGCCCAA CAGTTTTTATGTCTTTA CCGTTTTGATGGCGGTG AAATTTTTAGGGAACCA GCCCGTTTATGGTTCCC GACACTTTATGTGACAA TCGGATTTATGACACAA ATGTCTTTATGATTATT GCAACTTTTGGGCCATA CCCTTTTGTTGGCCAAA

Papatsenko et al., 2002

A| C| G| T|

2 3 4 2

3 7 0 1

2 3 5 1

2 2 1 6

100 300 100 6 11 11

0 0 2 9

9 0 0 2

00 00 2 11 90

2 0 7 2

1 6 1 3

3 4 1 3

3 5 2 1

4 2 1 4

7 2 1 1

(a) Aligned Motifs

(b) Countings

A| -0.22 0.06 -0.22 -0.22 -0.62 -1.32 -1.32 -1.32 0.98 -1.32 -1.32 -0.22 -0.62 0.06 0.06 0.28 0.75 C| 0.06 0.75 0.06 -0.22 0.06 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 0.62 0.28 0.47 -0.22 -0 G| 0.28 -1.32 0.47 -0.62 -0.62 -1.32 -1.32 -0.22 -1.32 -0.22 1.16 0.75 -0.62 -0.62 -0.22 -0.62 -0 T| -0.22 -0.62 -0.62 0.62 0.62 1.16 1.16 0.98 -0.22 0.98 -1.32 -0.22 0.06 0.0 6 -0.62 0.28 -0

(c) Position Sp ecific Scoring matrix
AlBio08 An Optimized Counting Graph 6


Probability Weight Matrices

Probability function ! Threshhold s : A word (site) is similar iff score (w ) > s . ! Pvalue : Probn (H ; score (H ) > s ) .

AlBio08

An Optimized Counting Graph

7


Probability Weight Matrices

Probability function ! Threshhold s : A word (site) is similar iff score (w ) > s . ! Pvalue : Probn (H ; score (H ) > s ) . Algorithms and data structures ! candidates-motifs extraction

AlBio08

An Optimized Counting Graph

8


Probability Weight Matrices

Probability function ! Threshhold s : A word (site) is similar iff score (w ) > s . ! Pvalue : Probn (H ; score (H ) > s ) . Algorithms and data structures ! candidates-motifs extraction Model accuracy ! Improve PWM with structural information

AlBio08

An Optimized Counting Graph

9


Principles

Biological function ! Overrepresented words ! underrepresented words Statistical softwares ! candidates-motifs extraction ! statistical significance

AlBio08

An Optimized Counting Graph

10


Probability Computation
"Classic" methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00].

AlBio08

An Optimized Counting Graph

11


Probability Computation
"Classic" methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Exact (all n) AhoPro (NI IGenetika, Inria) ! O (n â ||) ; n : text size ; : data structure.

AlBio08

An Optimized Counting Graph

12


Probability Computation
"Classic" methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Exact (all n) AhoPro (NI IGenetika, Inria) ! O (n â ||) ; n : text size ; : data structure. Drawback ! n dep endency ; ! numerical precision ;

AlBio08

An Optimized Counting Graph

13


Probability Computation
"Classic" methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00].

AlBio08

An Optimized Counting Graph

14


Probability Computation
"Classic" methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Approximation RSA-tools, Spatt, AhoSoft (NI IGenetika, Inria) ! O (1 â ||)

AlBio08

An Optimized Counting Graph

15


Probability Computation
"Classic" methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Approximation RSA-tools, Spatt, AhoSoft (NI IGenetika, Inria) ! O (1 â ||) Drawback ! size of the data structure ; ! tightness ;
AlBio08 An Optimized Counting Graph 16


AhoCorasick searching automaton

c a t a t

gt a c a c a c a c

a t g a t t a t

c a t t t

t a c c c a c a c a

1

a

2

a

3

a

4

a

5

a

6

c

7

8

a

AlBio08

An Optimized Counting Graph

17


AhoCorasick automaton : searching and computing
! n : wn = largest prefix found =ATA ; ! n + 1 : character x found :
x = G , wx = ATAG Graph, w x = A, C , T , wx Graph
n+1

= ATAG

* x = C ; w = A · TA, wn+1 = TAC Graph * x = T ; w = AT · A, wn+1 = AT Graph * x = A; AA, TAA G , wn+1 = root

c a t a t

gt a c a c a c a c

a t g a t t a t

c a t t t

t a c c c a c a c a

1

a

2

a

3

a

4

a

5

a

6

c

7

8

a

AlBio08

An Optimized Counting Graph

18


AhoPo :pobability computation

Step n : (pn (w ))w Graph . pn (w ) = Prob (largest prefix ending at n isw ). Induction pn
+1

p

(ATAG ) = pn (ATA) · p (G )

n +1

(AT ) = pn (ATA) · p (T )

+ pn (AGA) · p (T )

+ pn (CA) · p (T ) + pn (TA) · p (T )

AlBio08

An Optimized Counting Graph

19


AhoCorasick automaton : searching and computing
Left relation H1 RL H2 FatherLOG (H1 ) = Father ~ {ATACACA, ATAGATA} ATA ATA :Largest prefix of ATACACA that is a suffix in H
LOG

(H2 )

AlBio08

An Optimized Counting Graph

20


AhoCorasick automaton : searching and computing
Left relation H1 RL H2 FatherLOG (H1 ) = Father ~ {ATACACA, ATAGATA} ATA ATA :Largest prefix of ATACACA that is a suffix in H Right relation H1 RR H2 MotherR ¯ {ATACACA, ATACACA} ACA {AGACACA, }
OG LOG

(H2 )

(H1 ) = Mother

ROG

(H2 )

ACA :Largest suffix of ATACACA that is a prefix in H
AlBio08 An Optimized Counting Graph 21


Computation on Graph : induction

AlBio08

An Optimized Counting Graph

22


AhoCorasick automaton : searching and computing
First occurrence at p osition n = 18 GGGGGGGG no H H |ATACACA | | ···

|n

AlBio08

An Optimized Counting Graph

23


AhoCorasick automaton : searching and computing
First occurrence at p osition n = 18 GGGGGGGG no H H AND NOT GGGG CATT | ATACACA| |ATACACA | | ···

|n

GGGG ACAT | ATACACA| GG ACATAT | ATACACA| ···

GG AGACAC | ATACACA|

All marked nodes in AhoGraph
AlBio08 An Optimized Counting Graph 24


Ovelap graph :pobability computation

Compute (pn (H))H LOG dep endency to the past

H

using LOG, ROG.

ROG information to transfer (memory)

AlBio08

An Optimized Counting Graph

25


Ovelap graph :pobability computation

Compute (pn (H))H LOG dep endency to the past

H

using LOG, ROG.

ROG information to transfer (memory) Graph traversals...

AlBio08

An Optimized Counting Graph

26


Clump counts

First occurrence : "small" n. k occurrences : large n. approximation clumps generating functions

AlBio08

An Optimized Counting Graph

27


Clump counts
With H1 = AACGGAA and H2 = GAATCA, AACGG AACGGAACG GAATC ACGGAA k -decomp osition counted with coef. (-1)k [BoClReVa05].

AlBio08

An Optimized Counting Graph

28


Clump counts
With H1 = AACGGAA and H2 = GAATCA, AACGG AACGGAACG GAATC ACGGAA k -decomp osition counted with coef. (-1)k [BoClReVa05]. Contribution (-1)7 = -1 With AACAACAACAA = AA(CAA)3 AACAACAACAA·CAA·CAA·CAA·CAA·CAA·CAA·CAA· CAA·CAA·CAA·CAA·CAA·ACAACAACAA· No contribution : even = odd AACAACAACAA·CAA·CAA·CAA·CAA·CAA·CAA·CAA· CAA·CAA·CAA·CAA·ACAACAACAA·
AlBio08 An Optimized Counting Graph 29


Open problems : Frameshift and riboswitches

AlBio08

An Optimized Counting Graph

30


Open problems : Frameshift and riboswitches

Boxes : (w1 , w2 , w1 , w2 ) ~~ with : P. Nicodeme.
AlBio08 An Optimized Counting Graph 31


Open problems : Frameshift and riboswitches

AlBio08

An Optimized Counting Graph

32