Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.mccme.ru/albio/slides/raffinot.pdf
Дата изменения: Sat Oct 11 22:43:22 2008
Дата индексирования: Tue Oct 2 11:51:58 2012
Кодировка:

Common intervals of genomes
Mathieu Raffinot CNRS LIAFA

Context:
comparative genomics. set of genomes partially/totally annotated Informative group of genes or domains ?

Ex: COG database

Many difficulties !
Biology

What are two similar genes ? What about alternative splicing ? When are two genes close (notion of distance) ? What is an interesting cluster ? basis: pressure selection > keep genes working together close How to model clusters ? Graphs / strings ? How to compute those clusters ? How to manage the sets of clusters and extract useful information ? Computer science

One of the simplest model : genomes as strings of units common intervals

Simplest case in this model: 2 genomes !

E D

A A

B B

B C

D A

Common interval: one interval on each chromosome same set of gene in each interval externals bounds not in the set of gene

E D

A A

B B

B C

D A

E D

A A

B B

B C

D A

E D

A A

B B

B C

D A

E D

A A

B B

B C

D A

E D

A A

B B

B C

D A

E D

A A

B B

B C

D A

How many common intervals ?
X first chromosome, X= x1 x2 .. xn Y second chromosome, Y= y1 y2 .. ym Common alphabet , || <= max(|X|,|Y|) Y

D

A
y2

B

C

A
ym Rank(Y,1)[B]=3

Y= y1 f f f f f o(Y,1)= o(Y,2) = o(Y,3) = o (Y,4) = o (Y,5) = D A B C A B C B C A C A A

D= 1 A = 2 B = 3 C = 4 A = 1 B = 2 C = 3 B = 1 C= 2 A = 3 C = 1 A = 2 A =1

Int[k]

3 2 1

E

A

B

B

D

Y

D

A
y2

B

C

A
ym

Y= y1 fo(Y,1) = D A B C

B = 1 A =2 C = 3

Rank(Y,2)[A]=2

Int[k] are nested ! They form a tree. !

3 2 1

E

A

B

B

D

2 n valid Int[k] at max ! 2 nm common intervals at maximum The bound is reached !!

How to identify all them ?

Two approaches Direct computation (Didier) O(nm) but + Lowest common ancestor (otherwise O(n m logn) + No structure in the output ! + Complexity does not depend of the input + No index Fingerprint computation on a single string + index+ merge after + O(n+|L1|log n + m |L2| log m) (can be worst than Didier) + Structure in the output and possibility of search of fingerprint + Complexity does depend of the input + Keep the index for further computations

S = s1..sN string of length n alphabet of size ||, not fixed (possibly O(n))

A fingerprint f : set of character(s) of a substring si.. sj

General problem:

Compute and represent the set of all fingerprints of S

Examples: dccbcbabbbc {a} {b} {c} {d} {c,d} {b,c} {a,b} {b,c,d} {a,b,c} {a,b,c,d} acbdcadad

{a} {b} {c} {d} {a,c} {a,d} {b,c} {b,d} {c,d} {a,b,c} {a,c,d} {b,c,d} {a,b,c,d}

Maximal location of f i fingerprint f j

not in f, not in f +

Number of maximal locations: L <= n|| But is usually much less k = {a1,a2,..,ak}

Complexity of the bound easily reached

w1 = a1, w k = w k1 ak w k1

w2=a1(a2)a1, w3=(a1a2a1)a3(a1a2a3), ... |wk| . |Lk| = k . (2k 1) |L|k = 2k+1(k+2)

|L|k=o(|wk| . |Lk| )

Naming technique

{a,c,e,f}

= {a,b,c,d,e,f,g,h}

log || +1

a

b

c

d

e

f

g

h

{a,c,e,f,g} Names = {[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]}

{a,c,e,g} Fingerprints ={[7],[9],[10]}

Amir, Apostolico, Landau, Satta 2003

k distinct characters Changing a character: O(log || log n) (n new names maximum by level) One iteration: n log || log n || iterations: || n log || log n Important: different set of names for each iteration

a k=2

b

c

d d c c b c b a b b b c

d c c b c b a b b b c

Tsur 2005

List of fingerprints: d c a d1 b d

{d}, {c,d}, {a,c,d}, {a,c}, {a,b,c} d c

{([0],[1]), B} d c a d1 d c a d1 b

{([1],[1]), B}

{([1],[0]), A}

{([1],[0]), B}

{([1],[1]), A}

List of changes: {([0],[0]), A} {([0,0]), B} | {([0],[1]), B} {([1],[1], B} {([1],[0]), A} {([1],[0]), B} {([1],[1]), A} Radix sort on the pairs + unique > new names

Tsur 2005 List of changes: {([0],[0]), A} {([0],[0]), B} | {([0],[1]), B} {([1],[1], B} {([1],[0]), A} {([1],[0]), B} {([1],[1]), A} [2] [3] [4] [5] > > > > ([0],[ ([0],[ ([1],[ ([1],[ 0] 1] 0] 1] ) ) ) )

New list: {[2], A} {[2], B} | {[3], B} {[5], B} {[4], A} {[4], B} {[5], A}

{([2],[2]), C}

{([2],[3]),C}

New list: {([2],[2]),C} | {([2],[3]),C} {([2],[5]),C} {([4],[5]),C} {([4],[4]),C} {([5],[4]),C} Radix sort, ...

Tsur 2005

Radix sort: O(n) (bounded integers) One iteration : n log || || iterations: || n log || No more name search !

Problems does not depend of L distinct names at each iteration

Our approach (2006)

Simple sequence: no repeated character

lfo(i)

a b a c e a b a c d lfo(4)=ceab

a b a c e a b a c d lfo(2) = bace

Concatenate # to the sequence Bijection L / proper prefixes of lfo(i) cea bac a b a c e a b a c d # a b a c e a b a c d #

Compute all lfo(i) of S#

Our approach (2006) a | bcbadca# a ab | cbadca# a b b

How to calculate all lfo(i) ? abc | badca# a c bc

abcbadca abcba | dca# a c abcbadca | # bc b a ba a

abcb | adca# a c bc b b

bc

bc

bc

abcbad | ca# a c bc b a d abcbadca# a c bc b a d ba ad d d

abcbadc | a# a c bc b a d ba ad dc c d c c

a c

bc b a d

ba ad dc c

d c a

c a

a

bc

bc

bc

ba ad dc c #

d c a #

c a #

a #

a b

bc b a

ba ad d c

d c a

c a

a

bc lfo(i)

Our approach (2006)

Naming all proper prefixes of lfo(i)

a b

bc b a

ba ad d c

d c a

c a

a

n lists: Tsur algorithm Common names

Simple sequence: O(|L| log ||) General sequence: O(n+|L| log ||) |L|<= n ||

Faster or as fast as that of Tsur

Our approach (2006)

Properties and operations on our names

a unique set of names Compute the LCP of two fingerprints in log ||

names sorted by lexicographic order of fingerprints

Fingerprint trie

Chan et al, ESA 2007 bdcad

bd dc c a

c a d

a d

d

O(|F|) space

Search in O(|f|log(|f|/||))

O(|F|log||) time

Back to common intervals:

1) Build the tree for the first sequence: O(n+|L1| log ||) 2) Build the tree for the second sequence: O(m+|L2| log ||) 3) Merge the two trees ! Complexity: O((n+m)+(|L1|+|L2|) log ||) time.

Open problems Memory space reduction Order ? Approximate fingerprint Distance by fingerprints 2D fingerprints