Документ взят из кэша поисковой машины. Адрес
оригинального документа
: http://mouse.belozersky.msu.ru/tools/npge.html
Дата изменения: Tue Mar 29 20:33:31 2016 Дата индексирования: Sat Apr 9 22:24:10 2016 Кодировка: |
Download prebuild static executables for Windows and Linux.
Warning. The installer and the program doesn't work from a directory with non-ascii letters in name.
BLAST and other dependencies (except Qt 4 in Linux version) are included. Warning. To use Linux version, you should install Qt4 library. In Debian or Ubuntu Qt4 can be installed by command sudo apt-get install libqtgui4
.
In the following instructions, replace x.y.z
with the version of NPG-explorer you use.
For Windows 32 bit, download and run file npge_x.y.z_win32.exe
as administrator.
For Windows 64 bit, download and run file npge_x.y.z_win64.exe
as administrator.
For Linux 64 bit, download and unpack file npge_x.y.z_lin64.tar.gz
(using command tar -xf npge_x.y.z_lin64.tar.gz
).
Windows version adds itself to PATH, so you can use commands npge
and qnpge
in command line. In Linux, you need to add the unpacked directory npge-x.y.z
to PATH. If you use bash
, open ~/.bashrc
in your favorite text editor and add the following line to the end: export PATH=$PATH:/path/to/npge-x.y.z
. Do not forget to replace /path/to/npge-x.y.z
with the actual path.
All files with data and output files have fixed names. This is why it is recomended to make separate working directory for each task.
Create file genomes.tsv
of the form:
all:embl:CP003176 BRUAO chr1 c Brucella abortus A13334 chr 1
all:embl:CP003177 BRUAO chr2 c Brucella abortus A13334 chr 2
fasta:refseqn:NC_016778.1 BRUCA chr1 c Brucella canis HSK A52141 chr 1
features:embl:CP003174 BRUCA chr1 c Brucella canis HSK A52141 chr 1
fasta:file:BRUCA.fasta BRUCA chr1 c Brucella canis HSK A52141 chr 1
features:file:BRUCA.fasta BRUCA chr1 c Brucella canis HSK A52141 chr 1
fasta:file:base.fasta[CP002459] BRUMM chr1 c Brucella melitensis
features:file:base.embl[CP002459] BRUMM chr1 c Brucella melitensis
Fields are:
file.fasta[name_of_sequence]
as an identifier to copy specific sequence from file;AAA_NNN_NNN
);The program downloads input data using dbfetch.
Annotations in the following formats are parsed by the program: GenBank, EMBL.
String CP003175 BRUCA chr2 c
corresponds to EMBL entry CP003175
which is represented by short genome name BRUCA
, chromosome name chr2
and is circular.
You can use contigs instead of chromosomes, if genome is not fully assembled. Set circularity to 'l' in this case.
Create empty directory and create file genomes.tsv
with the table of genomes to be used to build pangenome. npge
will create files and sub-folders in current directory. You can change location of output files using command line options. To see all options, add -h
to a command. To set path to table file (instead of genomes.tsv
), pass option --table
to commands GetFasta, GetGenes and Rename.
Examples of genomes.tsv files can be found in directory examples of the source code.
Here we provide a table for 3 Brucella genomes:
all:embl:CP002459 BRUMM chr1 c Brucella melitensis M28 chromosome 1
all:embl:CP002460 BRUMM chr2 c Brucella melitensis M28 chromosome 2
all:embl:CP003176 BRUAO chr1 c Brucella abortus A13334 chromosome 1
all:embl:CP003177 BRUAO chr2 c Brucella abortus A13334 chromosome 2
all:embl:CP002078 BRUPB chr1 c Brucella pinnipedialis B2/94 chromosome 1
all:embl:CP002079 BRUPB chr2 c Brucella pinnipedialis B2/94 chromosome 2
... and for 17 Brucella genomes as well:
all:embl:CP003176 BRUAO chr1 c Brucella abortus A13334 chromosome 1
all:embl:CP003177 BRUAO chr2 c Brucella abortus A13334 chromosome 2
all:embl:CP003174 BRUCA chr1 c Brucella canis HSK A52141 chromosome 1
all:embl:CP003175 BRUCA chr2 c Brucella canis HSK A52141 chromosome 2
all:embl:CP002459 BRUMM chr1 c Brucella melitensis M28 chromosome 1
all:embl:CP002460 BRUMM chr2 c Brucella melitensis M28 chromosome 2
all:embl:CP001851 BRUM5 chr1 c Brucella melitensis M5-90 chromosome I
all:embl:CP001852 BRUM5 chr2 c Brucella melitensis M5-90 chromosome II
all:embl:CP002931 BRUML chr1 c Brucella melitensis NI chromosome I
all:embl:CP002932 BRUML chr2 c Brucella melitensis NI chromosome II
all:embl:CP002078 BRUPB chr1 c Brucella pinnipedialis B2/94 chromosome 1
all:embl:CP002079 BRUPB chr2 c Brucella pinnipedialis B2/94 chromosome 2
all:embl:CP003128 BRUSS chr1 c Brucella suis VBI22 chromosome I
all:embl:CP003129 BRUSS chr2 c Brucella suis VBI22 chromosome II
all:embl:AE017223 BRUAB chr1 c Brucella abortus biovar 1 str. 9-941 chromosome I
all:embl:AE017224 BRUAB chr2 c Brucella abortus biovar 1 str. 9-941 chromosome II
all:embl:CP000887 BRUA1 chr1 c Brucella abortus S19 chromosome 1
all:embl:CP000888 BRUA1 chr2 c Brucella abortus S19 chromosome 2
all:embl:AM040264 BRUA2 chr1 c Brucella melitensis biovar Abortus 2308 chromosome I
all:embl:AM040265 BRUA2 chr2 c Brucella melitensis biovar Abortus 2308 chromosome II
all:embl:CP000872 BRUC2 chr1 c Brucella canis ATCC 23365 chromosome I
all:embl:CP000873 BRUC2 chr2 c Brucella canis ATCC 23365 chromosome II
all:embl:CP001488 BRUMB chr1 c Brucella melitensis ATCC 23457 chromosome I
all:embl:CP001489 BRUMB chr2 c Brucella melitensis ATCC 23457 chromosome II
all:embl:AE008917 BRUME chr1 c Brucella melitensis bv. 1 str. 16M chromosome I
all:embl:AE008918 BRUME chr2 c Brucella melitensis 16M chromosome II
all:embl:CP001578 BRUMC chr1 c Brucella microti CCM 4915 chromosome 1
all:embl:CP001579 BRUMC chr2 c Brucella microti CCM 4915 chromosome 2
all:embl:CP000708 BRUO2 chr1 c Brucella ovis ATCC 25840 chromosome I
all:embl:CP000709 BRUO2 chr2 c Brucella ovis ATCC 25840 chromosome II
all:embl:AE014291 BRUSU chr1 c Brucella suis 1330 chromosome I
all:embl:AE014292 BRUSU chr2 c Brucella suis 1330 chromosome II
all:embl:CP000911 BRUSI chr1 c Brucella suis ATCC 23445 chromosome I
all:embl:CP000912 BRUSI chr2 c Brucella suis ATCC 23445 chromosome II
The latter one is used below.
Further steps are performed in the command line.
Linux users are expected to be familiar with the command line.
How to launch the command line in Windows:
cmd
and press Enter;cd path/to/directory
+ Enter to navigate to the directory with file genomes.tsv
;:
+ Enter, for example D:
+ Enter to switch to disk D
;$
, for example, npge Prepare
+ Enter.Run the following command:
$ npge Prepare
The following files are created by this command:
genomes-renamed.fasta
is FASTA file with genomes on with a nucleotide pangenome is to be built;genes/features.bs
is a blockset of genes. One gene is represented as one block.Files genomes-raw.fasta
and features.embl
contain unprocessed input data. They are not used by following steps. You can safely remove them.
Run the following command:
$ npge Examine
The following files are created by this command in directory examine
:
genomes-info.tsv
table of genome lengths; make sure genome lengths are about the same, otherwise do not expect NPG-explorer to build good pangenome;draft.bs
draft pangenome;identity_recommended.txt
text file with recommended value of parameter MIN_IDENTITY
(see below how to change this parameter in configuration file).This step is needed to gather some information about input genomes. This information can be used on next step (configuration).
To change values of global options, make file npge.conf
using command npge -g npge.conf
, then edit this file to change values of global options. File npge.conf
contains default values compiled into the program. Sometimes they have to be changed to improve results.
Configuration file looks like this:
MIN_IDENTITY = Decimal('0.9')
MIN_LENGTH = 100
Decimal values are specified using the syntax above. Accuracy of decimal values is 4 digits after the point.
The program applies following configuration files (if exist):
npge.conf
in current directory;npge.conf
in the program's directory (directory where the executable lives);/etc/npge.conf
~/.npge.conf
npge.conf
in current directory (again);$ npge MakePangenome
This command creates file pangenome/pangenome.bs
. The file is in BlockSet format.
$ npge CheckPangenome
This command makes sure that nucleotide pangenome in file pangenome/pangenome.bs
satisfies pangenome criteria.
The command prints if the pangenome is Ok and may print some comments about the pangenome.
This step is done by post-processing as well, result is saved to file check/isgood
.
$ npge PostProcessing
This command produces many files, some of them are located in sub-folders.
Files *.bs
contain blocksets, *.bi
contain tables of blocks' properties, *.ba
contain blockset alignments (cells are fragments) *.blocks
contain blockset alignments (cells are blocks)
Columns of files *.bi
:
Files *.bi
also contain numbers of occurrences of a block in a genome. Each genome adds one column to the table. To add similar columns with numbers of occurrences of a block in a sequence, add option --info-count-seqs=1
to command npge PostProcessing
. File pangenome/pangenome-small.bi
contains short version of pangenome/pangenome.bi
, that lacks columns with occurrences in genomes.
Files produced by npge MakePangenome
and npge PostProcessing
are as follows:
pangenome/pangenome.bs
pangenome (main output of the program);pangenome/fragments.tsv
table with coordinates of all fragments in form: block, genome, chromosome, ac, start, stop, ori;pangenome/pangenome.ba
blockset alignment. Table file representing alignment in which "letters" are fragments of pangenome. This file is used by GUI viewer qnpge
;pangenome/pangenome.blocks
blockset alignment. Table file representing alignment in which "letters" are blocks names. This file can be viewed in Excell. It is not used by qnpge
.global-blocks/blocks.bs
global blocks (blockset). Global blocks are joined collinear s-blocks;global-blocks/blocks.ba
global blocks (blockset alignment of fragments). This file is used by GUI viewer qnpge
;global-blocks/blocks.blocks
global blocks (blockset alignment of blocks).global-blocks/blocks.gbi
properties of global blocks as a table;global-blocks/global-fragments.tsv
properties of fragments of global blocks as a table;pangenome/pangenome.hash
hash of pangenome;pangenome/pangenome.info
summary of pangenome stats (average identity of blocks, etc);extra-blocks/split.bs
and extra-blocks/split.bi
blocks, splitted by diagnostic positions. This file is used by GUI viewer qnpge
;extra-blocks/low.bs
and extra-blocks/low.bi
Subblocks of low identity sliced from pangenome blocks. See explanation for more information about low identity subblocks. This file is used by GUI viewer qnpge
;
directory check
files related to pangenome checking;
isgood
result of check that the pangenome pangenome criteria;hits.blast
output of BLAST all-against-all run on consensuses;filtered-hits.blast
filtered output of BLAST all-against-all run on consensuses (self-hits, reverse hits, short hits and hits of low identity were removed);all-blast-hits.bs
and all-blast-hits.bi
all BLAST hits as blocksets;good-blast-hits.bs
and good-blast-hits.bi
BLAST hits satisfying pangenome criteria, which surpass overlapping blocks from the pangenome. If pangenome satisfies the criteria, there are no such blocks;non-internal-hits.bs
and non-internal-hits.bi
BLAST hits satisfying pangenome criteria, which don't surpass overlapping blocks from the pangenome;joined.bs
and joined.bi
joined subsequent blocks satisfying the criteria;mutations
mutations related files:
mut.tsv
table of all mutations (columns are block, fragment, start of mutation, stop of mutation, letter(s) in consensus, letter (or gap) in the fragment); See file src/tool/parse-mutations-file.py
for example of how to parse file mut.tsv
;mutseq.fasta
FASTA file with sequences composed of columns with mutations (+ 1 columns to right and to left) of stable blocks;mutseq-with-blocks.bs
same as previous + stable blocks mapped of these sequences;consensuses.fasta
consensus sequences of blocks;genes
analysis of genes:
genes/partition-ungrouped.tsv
map of genes onto blocks;genes/partition-grouped.tsv
map of genes onto blocks, grouped by gene fragment;genes/good.bs
groups of genes matching each other exactly accourding to s-blocks;genes/good-upstreams.bs
upstream sequences of good genes (genes/good.bs
);trees
tree related files:
nj-global-tree.tre
global tree constructed using Neighbour-Joining applied to genome distances;How to view
.tre
files using FigTree: open a file with FigTree, set branch label to "Diagnostic positions" in pop-up window, go to "Branch Labels" section of left menu, enable the section's checkbox. Abstract distances between nodes are shown under branches. To show number of diagnostic positions between corresponding clades, select "Diagnostic positions" in drop-down list.
$ qnpge
This command uses pangenome/pangenome.bs
and some of files created by PostProcessing.
The program window is splitted to 3 parts:
pangenome/pangenome.ba
);Columns of blocks table:
You can filter blocks by block name, gene name or their sequence using input located up to block table. By default, pattern matching is wildcard. ^
before the pattern and $
after the pattern correspond to name start/end (as in regular expressions). To hide blocks of one fragment, clock checkbox "only blocks of >= 2 fragments". Blocks table can be sorted by any column.
Blockset alignment table shows alignment of fragment on genomes. Chromosome can be selected using drop-down list located up to blockset table. Each sequence is represented as a row of blockset table. Name of a sequence and its orientation against the alignment is written in first column. Fragments of a sequence are represented by cells of blockset table. Fragments of one block are coloured similarly. Orientation of a fragment against the alignment is indicated by '<' and '>'.
When you navigate in blocks table and blockset alignment, the alignment of the corresponding block is shown in bottom part of the program. Fragment name is shown left to alignment itself. Background colors in alignment correspond to nucleotide types. Name of the selected gene is shown in read-only input located up to the alignment. You can disable genes representation completely by unchecking the checkbox "show genes". Genes are coloured with foreground color white. Genes on reverse chain (relatively to the fragment orientation) are marked with underscore. Overlapping genes are coloured with purple. Start codons are coloured with black, stop codons are coloured with gray. Consensus of the block is shown up to the alignment. Identical columns without gaps are coloured with black, identical columns with gaps are coloured with gray, non-identical columns are white. Columns numbers are shown up to consensus.
Columns numbers of low similarity regions are coloured with red. Low similarity regions represent parts of blocks with unreliable alignment. There are 3 possible reasons of occurrence of low similarity blocks: * these sequences are not related, * recombination, * deletion and insertion in the same position.
You can use arrows keys to navigate through the alignment. Corresponding fragment is selected in blockset alignment. Use keys "Home" and "End" to go to first and last columns of the alignment respectively. If you "go away" from the alignment, the program switches to corresponding block. You go to next gene boundary if you press Ctrl + Arrow Right
or Ctrl + Arrow Left
. You go to next low similarity region if you press Shift + Arrow Right
or Shift + Arrow Left
.
To change order of sequences in blockset alignment and block alignment, select some rows (you can use Ctrl
to select multiple rows) and press Ctrl + Arrow Up
or Ctrl + Arrow Down
.
MIN_LENGTH
;FRAME_LENGTH
columns of any but minor block is greater or equal to MIN_IDENTITY
;MIN_END
first and last columns of any but minor block is greater or equal to MIN_IDENTITY
;Blocks types:
Major blocks = non-minor blocks.
Main executables are command line tool src/tool/npge (or src/tool/npge.exe) and GUI tool src/gui/qnpge (src/gui/qnpge.exe).
To change compiled-in default settings, run ccmake .
in build directory.
To generate config file, run npge -g npge.conf
and change generated file npge.conf
.
Warning. Make sure you use the same Lua which luabind was linked against. Otherwise it compiles but doesn't work: PANIC: unprotected error in call to Lua API (attempt to index a nil value)
Optional:
To build static Linux package in fresh Debian Wheezy, install curl and sudo and run
curl -L https://git.io/vmB9P | sh
Install build requirements (on Debian):
% ./linux/requirements.sh
Build the program as static executables (Qt 4 is not static!):
$ ./linux/build.sh
The program is built in the directory npge-build-linux
.
Create distribution .tar.gz
file: go into npge-build-linux
and run:
$ ./linux/package.sh
How to build manually:
$ ./src/init_lua-npge.sh
$ mkdir build
$ cd build
$ cmake ..
$ make
Pass argument -DNPGE_STATIC_LINUX:BOOL=1
to after cmake
to get static executables (on Debian, Qt 4 is not static!).
Build README.html:
$ pandoc -s -o README.html README.md
Run tests:
$ make test
To build static Windows packages in fresh Debian Wheezy, install curl and sudo and run
curl -L https://git.io/vmB9v | sh
Windows executables are cross-compiled from Linux using MinGW cross-compiler.
For 64-bit Windows you need to export MXE_TARGET
variable:
$ export MXE_TARGET=x86_64-w64-mingw32.static
Install build requirements (on Debian):
% ./windows/requirements.sh
Build the program as static executables:
$ ./windows/build.sh
The program is built in the directories npge-build-windows32
and npge-build-windows64
(first contains 32-bit version and second contains 64-bit version).
Create ZIP file and Installation Wizard for Windows, go into build directory (npge-build-windows32
or npge-build-windows64
) and run:
$ ./windows/package.sh
init_lua-npge.sh
.global-fragments.tsv
filtered-hits.blast
npge SampleGenomesTsv
generates genomes.tsv.Version 0.5.4. Compatibility with Debian Squeeze. Starting with this version, Linux tarballs are built in Debian Squeeze machine to produce a binary compatible with wide range of Linux distributions.
locus_tag
.locus_tag
.meta_test
: ignore directories without script.npge.MIN_END
.FRAME_LENGTH
.DEV_NULL
from npge.confMIN_LENGTH
. Use logarithmic gap penalty. Introduce option MIN_END
. Do not require length of fragment >= MIN_LENGTH
.MIN_LENGTH
.*.gbi
)*.blocks
)*.md files
)prev
and next
(use FragmentCollection
instead),Sequence.circular()
doesn't throw (linear by default),Sequence.set_name()
throws if name includes space or underscore,npge.conf
located in application dir,genomes.tsv
(no automagic, can read from file),spreading
and max_gaps
.Version 0.1.4. Bugfix:
read_block_set
and AddGenes
,Version 0.1.3. Bugfix:
Version 0.1.2. Bugfix:
thread_pool
,Version 0.1.1. Bugfix:
Version 0.1.0. Features developed in Summer 2014 were incorporated in ver. 0.1.0. It was published prior to ECCB'14 event.
This work was presented at ECCB'14 conference: abstract (ru) and poster.
Corresponding author: Boris Nagaev, email: bnagaev@gmail.com
Copyright (C) 2012-2016 Boris Nagaev