Документ взят из кэша поисковой машины. Адрес оригинального документа : http://kodomo.cmm.msu.ru/~aelita/Term3/getorf.help
Дата изменения: Sun Nov 9 15:05:33 2008
Дата индексирования: Tue Oct 2 08:09:45 2012
Кодировка:

getorf



Function

Finds and extracts open reading frames (ORFs)

Description

This program finds and outputs the sequences of open reading frames
(ORFs).

The ORFs can be defined as regions of a specified minimum size between
STOP codons or between START and STOP codons.

The ORFs can be output as the nucleotide sequence or as the
translation.

The program can also output the region around the START or the initial
STOP codon or the ending STOP codons of an ORF for those doing
analysis of the properties of these regions.

The START and STOP codons are defined in the Genetic Code tables. A
suitable Genetic Code table can be selected for the organism you are
investigating.

Usage

Here is a sample session with getorf


% getorf -minsize 300
Finds and extracts open reading frames (ORFs)
Input nucleotide sequence(s): tembl:eclaci
protein output sequence(s) [eclaci.orf]:

Go to the input files for this example
Go to the output files for this example

Command line arguments

Standard (Mandatory) qualifiers:
[-sequence] seqall Nucleotide sequence(s) filename and optional
format, or reference (input USA)
[-outseq] seqoutall [.] Protein sequence
set(s) filename and optional format (output
USA)

Additional (Optional) qualifiers:
-table menu [0] Code to use (Values: 0 (Standard); 1
(Standard (with alternative initiation
codons)); 2 (Vertebrate Mitochondrial); 3
(Yeast Mitochondrial); 4 (Mold, Protozoan,
Coelenterate Mitochondrial and
Mycoplasma/Spiroplasma); 5 (Invertebrate
Mitochondrial); 6 (Ciliate Macronuclear and
Dasycladacean); 9 (Echinoderm
Mitochondrial); 10 (Euplotid Nuclear); 11
(Bacterial); 12 (Alternative Yeast Nuclear);
13 (Ascidian Mitochondrial); 14 (Flatworm
Mitochondrial); 15 (Blepharisma
Macronuclear); 16 (Chlorophycean
Mitochondrial); 21 (Trematode
Mitochondrial); 22 (Scenedesmus obliquus);
23 (Thraustochytrium Mitochondrial))
-minsize integer [30] Minimum nucleotide size of ORF to
report (Any integer value)
-maxsize integer [1000000] Maximum nucleotide size of ORF to
report (Any integer value)
-find menu [0] This is a small menu of possible output
options. The first four options are to
select either the protein translation or the
original nucleic acid sequence of the open
reading frame. There are two possible
definitions of an open reading frame: it can
either be a region that is free of STOP
codons or a region that begins with a START
codon and ends with a STOP codon. The last
three options are probably only of interest
to people who wish to investigate the
statistical properties of the regions around
potential START or STOP codons. The last
option assumes that ORF lengths are
calculated between two STOP codons. (Values:
0 (Translation of regions between STOP
codons); 1 (Translation of regions between
START and STOP codons); 2 (Nucleic sequences
between STOP codons); 3 (Nucleic sequences
between START and STOP codons); 4
(Nucleotides flanking START codons); 5
(Nucleotides flanking initial STOP codons);
6 (Nucleotides flanking ending STOP codons))

Advanced (Unprompted) qualifiers:
-[no]methionine boolean [Y] START codons at the beginning of protein
products will usually code for Methionine,
despite what the codon will code for when it
is internal to a protein. This qualifier
sets all such START codons to code for
Methionine by default.
-circular boolean [N] Is the sequence circular
-[no]reverse boolean [Y] Set this to be false if you do not wish
to find ORFs in the reverse complement of
the sequence.
-flanking integer [100] If you have chosen one of the options
of the type of sequence to find that gives
the flanking sequence around a STOP or START
codon, this allows you to set the number of
nucleotides either side of that codon to
output. If the region of flanking
nucleotides crosses the start or end of the
sequence, no output is given for this codon.
(Any integer value)

Associated qualifiers:

"-sequence" associated qualifiers
-sbegin1 integer Start of each sequence to be used
-send1 integer End of each sequence to be used
-sreverse1 boolean Reverse (if DNA)
-sask1 boolean Ask for begin/end/reverse
-snucleotide1 boolean Sequence is nucleotide
-sprotein1 boolean Sequence is protein
-slower1 boolean Make lower case
-supper1 boolean Make upper case
-sformat1 string Input sequence format
-sdbname1 string Database name
-sid1 string Entryname
-ufo1 string UFO features
-fformat1 string Features format
-fopenfile1 string Features file name

"-outseq" associated qualifiers
-osformat2 string Output seq format
-osextension2 string File name extension
-osname2 string Base file name
-osdirectory2 string Output directory
-osdbname2 string Database name to add
-ossingle2 boolean Separate file for each entry
-oufo2 string UFO features
-offormat2 string Features format
-ofname2 string Features file name
-ofdirectory2 string Output directory

General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write standard output
-filter boolean Read standard input, write standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report dying program messages

Input file format

getorf reads any nucleic acid sequence USA.

Input files for usage example

'tembl:eclaci' is a sequence entry in the example nucleic acid
database 'tembl'

Database entry: tembl:eclaci

ID ECLACI standard; DNA; PRO; 1113 BP.
XX
AC V00294;
XX
SV V00294.1
XX
DT 09-JUN-1982 (Rel. 01, Created)
DT 10-FEB-1999 (Rel. 58, Last updated, Version 2)
XX
DE E. coli laci gene (codes for the lac repressor).
XX
KW DNA binding protein; repressor.
XX
OS Escherichia coli
OC Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
OC Escherichia.
XX
RN [1]
RP 1-1113
RX MEDLINE; 78246991.
RA Farabaugh P.J.;
RT "Sequence of the lacI gene";
RL Nature 274:765-769(1978).
XX
DR SWISS-PROT; P03023; LACI_ECOLI.
XX
CC KST ECO.LACI
XX
FH Key Location/Qualifiers
FH
FT source 1..1113
FT /db_xref="taxon:562"
FT /organism="Escherichia coli"
FT CDS 31..1113
FT /db_xref="SWISS-PROT:P03023"
FT /note="reading frame"
FT /transl_table=11
FT /protein_id="CAA23569.1"
FT /translation="MKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAE
L
FT NYIPNRVAQQLAGKQSLLIGVATSSLALHAPSQIVAAIKSRADQLGASVVVSMVERSG
V
FT EACKAAVHNLLAQRVSGLIINYPLDDQDAIAVEAACTNVPALFLDVSDQTPINSIIFS
H
FT EDGTRLGVEHLVALGHQQIALLAGPLSSVSARLRLAGWHKYLTRNQIQPIAEREGDWS
A
FT MSGFQQTMQMLNEGIVPTAMLVANDQMALGAMRAITESGLRVGADISVVGYDDTEDSS
C
FT YIPPSTTIKQDFRLLGQTSVDRLLQLSQGQAVKGNQLLPVSLVKRKTTLAPNTQTASP
R
FT ALADSLMQLARQVSRLESGQ"
XX
SQ Sequence 1113 BP; 249 A; 304 C; 322 G; 238 T; 0 other;
ccggaagaga gtcaattcag ggtggtgaat gtgaaaccag taacgttata cgatgtcgca 6
0
gagtatgccg gtgtctctta tcagaccgtt tcccgcgtgg tgaaccaggc cagccacgtt 12
0
tctgcgaaaa cgcgggaaaa agtggaagcg gcgatggcgg agctgaatta cattcccaac 18
0
cgcgtggcac aacaactggc gggcaaacag tcgttgctga ttggcgttgc cacctccagt 24
0
ctggccctgc acgcgccgtc gcaaattgtc gcggcgatta aatctcgcgc cgatcaactg 30
0
ggtgccagcg tggtggtgtc gatggtagaa cgaagcggcg tcgaagcctg taaagcggcg 36
0
gtgcacaatc ttctcgcgca acgcgtcagt gggctgatca ttaactatcc gctggatgac 42
0
caggatgcca ttgctgtgga agctgcctgc actaatgttc cggcgttatt tcttgatgtc 48
0
tctgaccaga cacccatcaa cagtattatt ttctcccatg aagacggtac gcgactgggc 54
0
gtggagcatc tggtcgcatt gggtcaccag caaatcgcgc tgttagcggg cccattaagt 60
0
tctgtctcgg cgcgtctgcg tctggctggc tggcataaat atctcactcg caatcaaatt 66
0
cagccgatag cggaacggga aggcgactgg agtgccatgt ccggttttca acaaaccatg 72
0
caaatgctga atgagggcat cgttcccact gcgatgctgg ttgccaacga tcagatggcg 78
0
ctgggcgcaa tgcgcgccat taccgagtcc gggctgcgcg ttggtgcgga tatctcggta 84
0
gtgggatacg acgataccga agacagctca tgttatatcc cgccgtcaac caccatcaaa 90
0
caggattttc gcctgctggg gcaaaccagc gtggaccgct tgctgcaact ctctcagggc 96
0
caggcggtga agggcaatca gctgttgccc gtctcactgg tgaaaagaaa aaccaccctg 102
0
gcgcccaata cgcaaaccgc ctctccccgc gcgttggccg attcattaat gcagctggca 108
0
cgacaggttt cccgactgga aagcgggcag tga 111
3
//

Output file format

The output is a sequence file containing predicted open reading frames
longer than the minimum size, which defaults to 30 bases (i.e. 10
amino acids).

Output files for usage example

File: eclaci.orf

>ECLACI_1 [735 - 1112] E. coli laci gene (codes for the lac repressor).
GHRSHCDAGCQRSDGAGRNARHYRVRAARWCGYLGSGIRRYRRQLMLYPAVNHHQTGFSP
AGANQRGPLAATLSGPGGEGQSAVARLTGEKKNHPGAQYANRLSPRVGRFINAAGTTGFP
TGKRAV
>ECLACI_2 [1 - 1110] E. coli laci gene (codes for the lac repressor).
PEESQFRVVNVKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPN
RVAQQLAGKQSLLIGVATSSLALHAPSQIVAAIKSRADQLGASVVVSMVERSGVEACKAA
VHNLLAQRVSGLIINYPLDDQDAIAVEAACTNVPALFLDVSDQTPINSIIFSHEDGTRLG
VEHLVALGHQQIALLAGPLSSVSARLRLAGWHKYLTRNQIQPIAEREGDWSAMSGFQQTM
QMLNEGIVPTAMLVANDQMALGAMRAITESGLRVGADISVVGYDDTEDSSCYIPPSTTIK
QDFRLLGQTSVDRLLQLSQGQAVKGNQLLPVSLVKRKTTLAPNTQTASPRALADSLMQLA
RQVSRLESGQ*
>ECLACI_3 [465 - 49] (REVERSE SENSE) E. coli laci gene (codes for the lac repre
ssor).
RRNISAGSFHSNGILVIQRIVNDQPTDALREKIVHRRFTGFDAASFYHRHHHAGTQLIGA
RFNRRDNLRRRVQGQTGGGNANQQRLFARQLLCHAVGNVIQLRHRRFHFFPRFRRNVAGL
VHHAGNGLIRDTGILCDIV

The name of the ORF sequences is constructed from the name of the
input sequence with an underscore character ('_') and a unique ordinal
number of the ORF found appended. The description of the output ORF
sequence is constructed from the description of the input sequence
with the start and end positions of the ORF prepended.

The unique number appended to the name is simply used to create new
unique sequence names, it does not imply any further information
indicating any order, positioning or sense-strand of the ORFs.

If the ORF has been found in the reverse sense, then the start
position will be smaller than the end position. The numbering uses the
forward-sense positions, but read in the reverse sense. For example,
>ECLACI_3 [465 - 49] in the output above is a reverse-sense ORF
running from position 465 to 49. The description will also contain
'(REVERSE SENSE)'.

If the sequence has been specified as a circular genome (using the
command-line switch '-circular'), then ORFs can potentially continue
past the 'end' of the input sequence (the breakpoint of the circular
genome) and into the 'start' of the sequence again. This is dealt with
by appending the sequence to itself three times and reporting long
ORFs that are found in this extended sequence. Any ORF that is longer
that three times the sequence length (i.e one that continues without
hitting a STOP at any point in the genome) will be reported as being a
maximum of three times the length of the input sequence. Note that the
end position of an ORF in circular genomes can be apparently longer
than the input sequence if the ORF crosses the breakpoint. If the ORF
crosses the breakpoint, then the text '(ORF crosses the breakpoint)'
will be added to the description of the output sequence.

Data files

The START and STOP codons used by getorf are defined in the Genetic
Code data files. By default, Genetic Code file EGC.0 is used.

The default file EGC.0 is the 'Standard Code' with the rarely used
alternate START codons omitted, it only has the normal 'AUG' START
codon. The 'Standard Code' with the rarely used alternate START codons
included is Genetic Code file EGC.1.

It is expected that user will sometimes wish to customise a Genetic
Code file. To do this, use the program embossdata.

EMBOSS data files are distributed with the application and stored in
the standard EMBOSS data directory, which is defined by the EMBOSS
environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

% embossdata -showall

To fetch one of the data files (for example 'Exxx.dat') into your
current directory for you to inspect or modify, run:

% embossdata -fetch -file Exxx.dat

Users can provide their own data files in their own directories.
Project specific files can be put in the current directory, or for
tidier directory listings in a subdirectory called ".embossdata".
Files for all EMBOSS runs can be put in the user's home directory, or
again in a subdirectory called ".embossdata".

The directories are searched in the following order:
* . (your current directory)
* .embossdata (under your current directory)
* ~/ (your home directory)
* ~/.embossdata

The Genetic Code data files are based on the NCBI genetic code tables.
Their names and descriptions are:

EGC.0
Standard (Differs from GC.1 in that it only has initiation site
'AUG')

EGC.1
Standard

EGC.2
Vertebrate Mitochodrial

EGC.3
Yeast Mitochondrial

EGC.4
Mold, Protozoan, Coelenterate Mitochondrial and
Mycoplasma/Spiroplasma

EGC.5
Invertebrate Mitochondrial

EGC.6
Ciliate Macronuclear and Dasycladacean

EGC.9
Echinoderm Mitochondrial

EGC.10
Euplotid Nuclear

EGC.11
Bacterial

EGC.12
Alternative Yeast Nuclear

EGC.13
Ascidian Mitochondrial

EGC.14
Flatworm Mitochondrial

EGC.15
Blepharisma Macronuclear

EGC.16
Chlorophycean Mitochondrial

EGC.21
Trematode Mitochondrial

EGC.22
Scenedesmus obliquus

EGC.23
Thraustochytrium Mitochondrial

The format of these files is very simple.

It consists of several lines of optional comments, each starting with
a '#' character.

These are followed the line: 'Genetic Code [n]', where 'n' is the
number of the genetic code file.

This is followed by the description of the code and then by four lines
giving the IUPAC one-letter code of the translated amino acid, the
start codons (indicdated by an 'M') and the three bases of the codon,
lined up one on top of the other.

For example:

------------------------------------------------------------------------------
# Genetic Code Table
#
# Obtained from: http://www.ncbi.nlm.nih.gov/collab/FT/genetic_codes.html
# and: http://www3.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c
#
# Differs from Genetic Code [1] only in that the initiation sites have been
# changed to only 'AUG'

Genetic Code [0]
Standard

AAs = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
Starts = -----------------------------------M----------------------------
Base1 = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
Base2 = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
Base3 = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
------------------------------------------------------------------------------

Notes

If you have selected one of the options to report a regions around a
START or STOP codon, then note that any such region that crosses the
beginning or end of the sequence will not be reported.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

'-sbegin' and -send' do not work with this program.

See also

Program name Description
marscan Finds MAR/SAR sites in nucleic sequences
plotorf Plot potential open reading frames
showorf Pretty output of DNA translations
sixpack Display a DNA sequence with 6-frame translation and ORFs
syco Synonymous codon usage Gribskov statistic plot
tcode Fickett TESTCODE statistic to identify protein-coding DNA
wobble Wobble base plot

* checktrans - Reports STOP codons and ORF statistics of a protein
sequence

Author(s)

Gary Williams (gwilliam © rfcgr.mrc.ac.uk)
MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust
Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

History

2000 - written - Gary Williams

November 2002 - added indication of reverse sense ORFs

November 2002 - added indication of ORFs that cross the breakpoint at
position 1 in circular genomes.

Target users

This program is intended to be used by everyone and everything, from
naive users to embedded scripts.

Comments

None