Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.mccme.ru/albio/slides/steyaert.pdf
Дата изменения: Thu Oct 9 23:36:55 2008
Дата индексирования: Tue Oct 2 11:52:04 2012
Кодировка:
The Minisatellite Transformation Problem: The Run-Length-Encoding Approach and Further Enhancements

Behshad Behzadi & Jean-Marc Steyaert, Ecole Polytechnique Mohamed Abouelhoda, Cairo University Robert Giegerich, Bielefeld University


Biology...
Minisatellites consist of tandem arrays of short repeat units found in genome of most higher eukaryotes. High degree of polymorphism at minisatellites has applications from forensic studies to the investigation of the origins of modern human groups.


...Biology...
These repeats are called variants. MVR-PCR is designed to find the variants. As an example, MSY1 is the minisatellite on the human Y-chromosomes. There are five different repeats (variants) in MSY1.


Different Repeat Types (Variants) of MSY1
Map Types:

Distance between types:


Minisatellite Maps: The MSY1 Dataset
DNA Sequence: ... CGGCGAT CGGCGAC CGGCGAC CGGCGAC CGGAGAT... Unit types (Alphabet): X= CGGCGAT Y= CGGCGAC Z= CGGAGAT Minisatallite Map: XYYYZ

· Example Maps from the MSY1 Dataset:


Evolution Mechanism of Minisatellites The unequal crossover is a possible mechanism for tandem duplication:

s1 s2 s3 s4 s3 s4 s3 s4 s1 s2 s 3 s
4

s1 s2 s3 s4 s2 s3 s s3 s4 s1 s2 s
3 4

s

4


Evolutionary Operations
Insertion Deletion Mutation Amplification (p-plication) Contraction (p-contraction)


Examples of operations
Insertion of d abbc Deletion of c abbcb Mutation of c into d caab 4-plication of c abcb 2-contraction of b abbc abc abccccb daab abbb abbdc


Cost Functions


Hypotheses
All the costs are positive. The cost of duplications (and contractions) is less than all other operations. Triangle inequality holds: M(x,y)+M(y,z) <= M(x,z) ; M(x,x) = 0


Transformation distance between s and t
Applying a sequence of operations on s transforming it into t. The cost of a transformation is the sum of costs of its operations. TD = Minimum cost for a possible transformation of s into t. Any transformation which gives this minimum is called an optimal transformation.


Previous Works
BИrard & Rivals (RECOMB'02)

Behzadi & Steyaert (CPM'03, JDA'04)

Behzadi & Steyaert (WABI'04)


Generation vs. Reduction
· The symbols of s which generate a non-empty substring of t are called generating symbols. Other symbols of s are vanishing symbols. (These symbols are eliminated during the transformation by a deletion or contraction.) The transformation of symbol x into non-empty string s is called generation. The transformation of a non-empty string s into a unique symbol x is called reduction.


The Generation x

zbxxyb

The optimal generation of a non-empty string s from a symbol x can be achieved by a nond i ti


The schema for an optimal transformation
There exists an optimal transformation of s into t in which all the contractions are done before all amplifications.


Run-Length Encoding and Run Generation
The RLE encoding of is . The lengths of the encoded strings with length n and m is denoted by m' and n'. There exists an optimal generation of a non-empty string t from a single symbol x in which for every run of size k > 1 in t the k-1 right symbols of the run are generated by duplications of the leftmost symbol of the run


Preprocessing --> Core algorithm
Compute the generation cost of all substrings of the target string t from any symbol x of the alphabet: G(t)[x,i,j] Compute the optimal generation/reduction costs over the substrings by recurrence using dynamic programming. The running time is given by: O((m'3+n'3)|Alpha|+mn'2+nm'3+mn)


A different look at Duplication History

s

1

s2

s

3

s

4

s

5

s

6

s

7

s

8

s

1

s

2

s

3

s

4

s5

s

6

s

7

s

8

observed
s s
Right duplication Left duplication Right duplication Left duplication Right duplication Right duplication Right duplication
3 6 6 6 6 6 7 8 3

s3 s

s3 s4 s

s

s
1

6

s3 s4 s5 s

s

2

s

4

s1 s3 s4 s5 s

s1 s2 s3 s4 s5 s

s

s
5

7

s1 s2 s3 s4 s5 s 6 s s1 s
2

s3 s4 s5 s6 s 7 s

s

8


Alignment of Minisatellite Maps (1)

Example of an alignment:
s s s s s5 s s s s s s3 s4 s5 s s7 s8

S

1

2

3

4

6

7

8

1

2

6

matches

R r1 r2 r3 r4 r5 r6 r1 r2 r3 r4 r5 r6

The two maps S and R

Alignment of S and R


Alignment of Minisatellite Maps (2)

s S

1

s2

s

3

s

4

s

5

s

6

s

7

s

8

matches

R r1 r2 r3 r4 r5 r6

Alignment of S and R


Improved Model of Comparison Left and Right Simultaneous Dups
Example: :
S: S:

R:

R:

BИrard et al., Model

Our NEW Model
It has less score. Because there is a rule to allow simultaneous left/right duplications in S and R

There is no rule to allow simultaneous left/right duplications in S and R


Algorithm Layout
Observations:
s s s s s5 s s s

1

2

3

4

6

7

8

S
matches

r1

r2

r3

r4

r5

r6

R

Alignment of S and R

Therefore:


Finding an Optimal Duplication History

s

1

s

2

s

3

s

4

s5

s

6

s

7

s

8

s

3

s

s
1

6

s

2

s4 s s
5 7

[s4..s6]

s

8


Experimental Running Times

BИrard et al. · MSATcompare is ours


Detection of Duplication Bias in MSY1 Dataset
E1: run algorithm allowing left- and right- duplications EL: allow only left duplications ER: allow only right duplications