Документ взят из кэша поисковой машины. Адрес оригинального документа : http://mouse.genebee.msu.ru/~bennigsen/nhunt_files/poster_mccmb_2011.pdf
Дата изменения: Thu Jul 21 09:27:22 2011
Дата индексирования: Mon Oct 1 19:30:10 2012
Кодировка:
Nhunt: new program for DNA sequence similarity searching
a a

Yury A. Pekov , Sergei S. Spirin

1

2

1 2

Faculty of Bioengineering and Bioinformatics of Moscow State University, Moscow, Russia Belozersky Institute of Moscow State University, Moscow, Russia

Intro duction
DNA sequence similarity searching in databases is one of the most imp ortant problems of bioinformatics.

Results

Comparison with FASTA
archaean genomes.

Low complexity example
-f -10 -g -5, ktup = 3.

FASTA version: 36.3.4, FASTA parameters (in command line):

Background
disadvantage.

Nhunt and FASTA programs were run for searching homologues of E. coli tRNA in 5

In the case of co ding sequences the program TBLASTN is successfully used. In the case of nonco ding sequences, a numb er of programs is used, e. g. FASTA, BLASTN and discontiguous MEGABLAST. But each of these programs has some signicant

Table 1: Numb er of found alignments with E-value less than given Program FASTA Nhunt Run time 5.5 1.4

Example alignment: Query: ctgtttaccaggtcaggtccggaaggaagcagccaaggcagatgacgcgt aaaaaaaa||||aa|||||a|a|aaa|||a||aa|| |||||||aa||a|||aa||a| Sbjct: ctgtgaaccagcttatcgccgcaatcaaacagccaaatcatatgcagcat Identity = 33/50 (66%) Strand: Plus/Minus Alignment features without scoring matrix adjustment: Score: 97.0 (31.8 bits), E-value: 0.1586 Alignment features after scoring matrix adjustment: Initial score: 103.5 (33.7 bits), adjusted score: 64.80 (22.3 bits), E-value: 112.932 Query frequences: A: 0.26, T: 0.17, G: 0.33, C: 0.24, Subject frequences: A: 0.33, T: 0.21, G: 0.17, C: 0.29, Adjusted substitution matrix (subject letters are in rows, query letters are in columns): aaaaAaaaaaaTaaaaaaGaaaaaaC Aaa3.6aaa-3.5aaa-1.5aaa-2.2 T -3.8aaaa6.3aaa-2.7aaa-4.3 G -5.2aa-22.1aaaa4.8aaa-5.9 C -2.4aaa-4.2aaa-1.7aaaa4.1

10

-19

10

-3

0.1
14 17

3
27 42

20
49 129

1 2

6 8

Aim

The aim of this work was to create Nhunt computer program for DNA sequence similarity searching that would exceed b oth FASTA and BLASTN in sensitivity. "sp eed / sensitivity". An original algorithm for diagonal selection was applied, which allows to adjust the ratio

BLASTN version: Comparison with sensitivity. Nhunt in three bacterial

Comparison with BLASTN

2.2.24, BLASTN parameters (in command line): -W 7 -F F -r 5 -q -4 -G 10 -E 6. discontiguous MEGABLAST program showed that it concedes BLASTN in and BLASTN programs were run for searching homologues of E. coli miscRNA genomes.

Table 2: Numb er of found alignments with E-value less than given

Metho ds

Algorithm
database.

Program
B. cereus

10

-10

10-
5 6 6 8 29 30

6

10

-4

0.1
146 238 97 187 112 167

20
859 4135 369 2060 684 2796

Realization and availability
The program is realized on C programming language. Executable les for Linux x86 and amd64 architectures, as well as the source code of the program are accessible in Internet: http://mouse.belozersky.msu.ru/bennigsen/nhunt.html

The algorithm for alignment construction is based on FASTA algorithm, but devoid of inherited disadvantages. Formula for E-value calculation is based on Karlin Altschul extreme value distribution. Its parameters were tted using a large random

BLASTN Nhunt
P. aeruginosa

2 2 4 4 23 25

14 15 11 13 34 41

BLASTN Nhunt
Y. pestis

Low complexity problem
written in the form

Assuming a Bernoulli nucleotide mo del any scoring matrix can b e

BLASTN Nhunt

Conclusions

1 qij sij = ln( ), pi pj
1 4 and qij is a frequency of i, j -nucleotide matching. But for low complexity regions it is reasonable to use adjusted scoring matrix, that prevents alignments to b e high-scored by accidental causes. We set
where

9000 8000 7000 6000 "right" length 5000 4000 3000 2000 1000 00 500 1000 "wrong" length

Nhunt and BLASTN programs were also run for searching homologues of E. coli tRNA in ve archaean genomes. Then co ordinates of found homologues were compared with co ordinates of annotated archaean tRNA and lengths of wrong and right fragments were calculated for each alignment.

pi = pj =

qij = pi pj + crij , ~ ~~
where

pi ~

are nucleotide frequencies, and

c

is a parameter that is

dep endent on

pi ~

and allows to get adjusted scoring matrix

sij ~

. In

Bernoulli mo del

c = 1, qij = qij ~

. The calculation of

c

parameter

is based on conservation of exp ected score:

Nhunt BLASTN BLASTN (default)
1500 2000

1. We have created Nhunt computer program for DNA sequence similarity searching. This program has been successfully tested on a set of genomes. 2. We have proposed a new method of scoring matrix adjustment for low complexity regions. 3. It was shown that for all examples Nhunt exceeds FASTA program both in sensitivity and in speed. Nhunt also exceeds BLASTN in sensitivity.
Acknowledgements

pi pj sij =
i,j i,j

pi pj sij . ~~~

In plot ab ove one p oint corresp onds to one alignment. less than some value. BLASTN (default) is BLASTN with default parameters.

X and Y axes values are

This equation is resolved by numerical metho ds.

resp ectively total lengths of wrong and right fragments of alignments with E-value

The work is partly supported by the grant #10-0700685-a of Russian Foundation of Basic Research