�� . �� : http://mouse.belozersky.msu.ru/tools/npge.html
�� : Tue Mar 29 20:33:31 2016
�� : Sat Apr 9 22:24:10 2016
��:

NPG-explorer

NPG-explorer, Nucleotide PanGenome explorer

Instruction

Download NPG-explorer

Download prebuild static executables for Windows and Linux.

Warning. The installer and the program doesn't work from a directory with non-ascii letters in name.

BLAST and other dependencies (except Qt 4 in Linux version) are included. Warning. To use Linux version, you should install Qt4 library. In Debian or Ubuntu Qt4 can be installed by command sudo apt-get install libqtgui4.

In the following instructions, replace x.y.z with the version of NPG-explorer you use.

For Windows 32 bit, download and run file npge_x.y.z_win32.exe as administrator.

For Windows 64 bit, download and run file npge_x.y.z_win64.exe as administrator.

For Linux 64 bit, download and unpack file npge_x.y.z_lin64.tar.gz (using command tar -xf npge_x.y.z_lin64.tar.gz).

Windows version adds itself to PATH, so you can use commands npge and qnpge in command line. In Linux, you need to add the unpacked directory npge-x.y.z to PATH. If you use bash, open ~/.bashrc in your favorite text editor and add the following line to the end: export PATH=$PATH:/path/to/npge-x.y.z. Do not forget to replace /path/to/npge-x.y.z with the actual path.

Create a directory for the results

All files with data and output files have fixed names. This is why it is recomended to make separate working directory for each task.

Prepare files with genomes

Input file: table of genomes

Create file genomes.tsv of the form:

all:embl:CP003176 BRUAO chr1 c Brucella abortus A13334 chr 1
all:embl:CP003177 BRUAO chr2 c Brucella abortus A13334 chr 2
fasta:refseqn:NC_016778.1 BRUCA chr1 c Brucella canis HSK A52141 chr 1
features:embl:CP003174 BRUCA chr1 c Brucella canis HSK A52141 chr 1
fasta:file:BRUCA.fasta BRUCA chr1 c Brucella canis HSK A52141 chr 1
features:file:BRUCA.fasta BRUCA chr1 c Brucella canis HSK A52141 chr 1
fasta:file:base.fasta[CP002459] BRUMM chr1 c Brucella melitensis
features:file:base.embl[CP002459] BRUMM chr1 c Brucella melitensis

Fields are:

chromosome entry identifier; it is composed from record type ('fasta', 'features' (annotation) or 'all'), source (database name or 'file') and identifier in that database or file path; it is possible to get annotations and fasta files from different sources; use file.fasta[name_of_sequence] as an identifier to copy specific sequence from file;
short name for the genome chosen by user, this name is used in output data; this name must not include spaces or underscores (more precisely, it must not look like a fragment name: AAA_NNN_NNN);
chromosome name (e.g., 'chr1', 'chr2'),
chromosome circularity ('c' for circular and 'l' for linear), and arbitrary description (not used by the program).

The program downloads input data using dbfetch.

Annotations in the following formats are parsed by the program: GenBank, EMBL.

String CP003175 BRUCA chr2 c corresponds to EMBL entry CP003175 which is represented by short genome name BRUCA, chromosome name chr2 and is circular.

You can use contigs instead of chromosomes, if genome is not fully assembled. Set circularity to 'l' in this case.

Create empty directory and create file genomes.tsv with the table of genomes to be used to build pangenome. npge will create files and sub-folders in current directory. You can change location of output files using command line options. To see all options, add -h to a command. To set path to table file (instead of genomes.tsv), pass option --table to commands GetFasta, GetGenes and Rename.

Examples of genomes.tsv

Examples of genomes.tsv files can be found in directory examples of the source code.

Here we provide a table for 3 Brucella genomes:

all:embl:CP002459   BRUMM   chr1    c   Brucella melitensis M28 chromosome 1
all:embl:CP002460   BRUMM   chr2    c   Brucella melitensis M28 chromosome 2
all:embl:CP003176   BRUAO   chr1    c   Brucella abortus A13334 chromosome 1
all:embl:CP003177   BRUAO   chr2    c   Brucella abortus A13334 chromosome 2
all:embl:CP002078   BRUPB   chr1    c   Brucella pinnipedialis B2/94 chromosome 1
all:embl:CP002079   BRUPB   chr2    c   Brucella pinnipedialis B2/94 chromosome 2

... and for 17 Brucella genomes as well:

all:embl:CP003176   BRUAO   chr1    c   Brucella abortus A13334 chromosome 1
all:embl:CP003177   BRUAO   chr2    c   Brucella abortus A13334 chromosome 2
all:embl:CP003174   BRUCA   chr1    c   Brucella canis HSK A52141 chromosome 1
all:embl:CP003175   BRUCA   chr2    c   Brucella canis HSK A52141 chromosome 2
all:embl:CP002459   BRUMM   chr1    c   Brucella melitensis M28 chromosome 1
all:embl:CP002460   BRUMM   chr2    c   Brucella melitensis M28 chromosome 2
all:embl:CP001851   BRUM5   chr1    c   Brucella melitensis M5-90 chromosome I
all:embl:CP001852   BRUM5   chr2    c   Brucella melitensis M5-90 chromosome II
all:embl:CP002931   BRUML   chr1    c   Brucella melitensis NI chromosome I
all:embl:CP002932   BRUML   chr2    c   Brucella melitensis NI chromosome II
all:embl:CP002078   BRUPB   chr1    c   Brucella pinnipedialis B2/94 chromosome 1
all:embl:CP002079   BRUPB   chr2    c   Brucella pinnipedialis B2/94 chromosome 2
all:embl:CP003128   BRUSS   chr1    c   Brucella suis VBI22 chromosome I
all:embl:CP003129   BRUSS   chr2    c   Brucella suis VBI22 chromosome II
all:embl:AE017223   BRUAB   chr1    c   Brucella abortus biovar 1 str. 9-941 chromosome I
all:embl:AE017224   BRUAB   chr2    c   Brucella abortus biovar 1 str. 9-941 chromosome II
all:embl:CP000887   BRUA1   chr1    c   Brucella abortus S19 chromosome 1
all:embl:CP000888   BRUA1   chr2    c   Brucella abortus S19 chromosome 2
all:embl:AM040264   BRUA2   chr1    c   Brucella melitensis biovar Abortus 2308 chromosome I
all:embl:AM040265   BRUA2   chr2    c   Brucella melitensis biovar Abortus 2308 chromosome II
all:embl:CP000872   BRUC2   chr1    c   Brucella canis ATCC 23365 chromosome I
all:embl:CP000873   BRUC2   chr2    c   Brucella canis ATCC 23365 chromosome II
all:embl:CP001488   BRUMB   chr1    c   Brucella melitensis ATCC 23457 chromosome I
all:embl:CP001489   BRUMB   chr2    c   Brucella melitensis ATCC 23457 chromosome II
all:embl:AE008917   BRUME   chr1    c   Brucella melitensis bv. 1 str. 16M chromosome I
all:embl:AE008918   BRUME   chr2    c   Brucella melitensis 16M chromosome II
all:embl:CP001578   BRUMC   chr1    c   Brucella microti CCM 4915 chromosome 1
all:embl:CP001579   BRUMC   chr2    c   Brucella microti CCM 4915 chromosome 2
all:embl:CP000708   BRUO2   chr1    c   Brucella ovis ATCC 25840 chromosome I
all:embl:CP000709   BRUO2   chr2    c   Brucella ovis ATCC 25840 chromosome II
all:embl:AE014291   BRUSU   chr1    c   Brucella suis 1330 chromosome I
all:embl:AE014292   BRUSU   chr2    c   Brucella suis 1330 chromosome II
all:embl:CP000911   BRUSI   chr1    c   Brucella suis ATCC 23445 chromosome I
all:embl:CP000912   BRUSI   chr2    c   Brucella suis ATCC 23445 chromosome II

The latter one is used below.

Open the command line

Further steps are performed in the command line.

Linux users are expected to be familiar with the command line.

How to launch the command line in Windows:

press Win+R on your keyboard to open it;
type cmd and press Enter;
type cd path/to/directory + Enter to navigate to the directory with file genomes.tsv;
to switch to other disk, type its label followed by : + Enter, for example D: + Enter to switch to disk D;
type the commands from the following sections without $, for example, npge Prepare + Enter.

Prepare sequences and genes

Run the following command:

$ npge Prepare

The following files are created by this command:

genomes-renamed.fasta is FASTA file with genomes on with a nucleotide pangenome is to be built;
genes/features.bs is a blockset of genes. One gene is represented as one block.

Files genomes-raw.fasta and features.embl contain unprocessed input data. They are not used by following steps. You can safely remove them.

Examine prepared sequences

Run the following command:

$ npge Examine

The following files are created by this command in directory examine:

genomes-info.tsv table of genome lengths; make sure genome lengths are about the same, otherwise do not expect NPG-explorer to build good pangenome;
draft.bs draft pangenome;
identity_recommended.txt text file with recommended value of parameter MIN_IDENTITY (see below how to change this parameter in configuration file).

This step is needed to gather some information about input genomes. This information can be used on next step (configuration).

Set values of global options

To change values of global options, make file npge.conf using command npge -g npge.conf, then edit this file to change values of global options. File npge.conf contains default values compiled into the program. Sometimes they have to be changed to improve results.

Configuration file looks like this:

MIN_IDENTITY = Decimal('0.9')
MIN_LENGTH = 100

Decimal values are specified using the syntax above. Accuracy of decimal values is 4 digits after the point.

The program applies following configuration files (if exist):

npge.conf in current directory;
npge.conf in the program's directory (directory where the executable lives);
/etc/npge.conf
~/.npge.conf
reads environmental variables named like known options;
npge.conf in current directory (again);
applies command line options if passed.

Build nucleotide pangenome

$ npge MakePangenome

This command creates file pangenome/pangenome.bs. The file is in BlockSet format.

Check nucleotide pangenome (optional)

$ npge CheckPangenome

This command makes sure that nucleotide pangenome in file pangenome/pangenome.bs satisfies pangenome criteria.

The command prints if the pangenome is Ok and may print some comments about the pangenome.

This step is done by post-processing as well, result is saved to file check/isgood.

Run post-processing of nucleotide pangenome

$ npge PostProcessing

This command produces many files, some of them are located in sub-folders.

Files *.bs contain blocksets, *.bi contain tables of blocks' properties, *.ba contain blockset alignments (cells are fragments) *.blocks contain blockset alignments (cells are blocks)

Columns of files *.bi:

name of the block;
number of fragments in the block;
length of the alignment of the block;
number of identical columns with no gaps;
number of identical columns with gaps;
number of nonidentical columns with no gaps;
number of nonidentical columns with gaps;
number of columns containing only gaps (must be 0);
identity (from 0.0 to 1.0);
percentage of 'G' and 'C' nucleotides in the block (from 0.0 to 1.0).

Files *.bi also contain numbers of occurrences of a block in a genome. Each genome adds one column to the table. To add similar columns with numbers of occurrences of a block in a sequence, add option --info-count-seqs=1 to command npge PostProcessing. File pangenome/pangenome-small.bi contains short version of pangenome/pangenome.bi, that lacks columns with occurrences in genomes.

Files produced by npge MakePangenome and npge PostProcessing are as follows:

pangenome/pangenome.bs pangenome (main output of the program);
pangenome/fragments.tsv table with coordinates of all fragments in form: block, genome, chromosome, ac, start, stop, ori;
pangenome/pangenome.ba blockset alignment. Table file representing alignment in which "letters" are fragments of pangenome. This file is used by GUI viewer qnpge;
pangenome/pangenome.blocks blockset alignment. Table file representing alignment in which "letters" are blocks names. This file can be viewed in Excell. It is not used by qnpge.
global-blocks/blocks.bs global blocks (blockset). Global blocks are joined collinear s-blocks;
global-blocks/blocks.ba global blocks (blockset alignment of fragments). This file is used by GUI viewer qnpge;
global-blocks/blocks.blocks global blocks (blockset alignment of blocks).
global-blocks/blocks.gbi properties of global blocks as a table;
global-blocks/global-fragments.tsv properties of fragments of global blocks as a table;
pangenome/pangenome.hash hash of pangenome;
pangenome/pangenome.info summary of pangenome stats (average identity of blocks, etc);
extra-blocks/split.bs and extra-blocks/split.bi blocks, splitted by diagnostic positions. This file is used by GUI viewer qnpge;
extra-blocks/low.bs and extra-blocks/low.bi Subblocks of low identity sliced from pangenome blocks. See explanation for more information about low identity subblocks. This file is used by GUI viewer qnpge;
directory check files related to pangenome checking;
- isgood result of check that the pangenome pangenome criteria;
- hits.blast output of BLAST all-against-all run on consensuses;
- filtered-hits.blast filtered output of BLAST all-against-all run on consensuses (self-hits, reverse hits, short hits and hits of low identity were removed);
- all-blast-hits.bs and all-blast-hits.bi all BLAST hits as blocksets;
- good-blast-hits.bs and good-blast-hits.bi BLAST hits satisfying pangenome criteria, which surpass overlapping blocks from the pangenome. If pangenome satisfies the criteria, there are no such blocks;
- non-internal-hits.bs and non-internal-hits.bi BLAST hits satisfying pangenome criteria, which don't surpass overlapping blocks from the pangenome;
- joined.bs and joined.bi joined subsequent blocks satisfying the criteria;
mutations mutations related files:
- mut.tsv table of all mutations (columns are block, fragment, start of mutation, stop of mutation, letter(s) in consensus, letter (or gap) in the fragment); See file src/tool/parse-mutations-file.py for example of how to parse file mut.tsv;
- mutseq.fasta FASTA file with sequences composed of columns with mutations (+ 1 columns to right and to left) of stable blocks;
- mutseq-with-blocks.bs same as previous + stable blocks mapped of these sequences;
- consensuses.fasta consensus sequences of blocks;
genes analysis of genes:
- genes/partition-ungrouped.tsv map of genes onto blocks;
- genes/partition-grouped.tsv map of genes onto blocks, grouped by gene fragment;
- genes/good.bs groups of genes matching each other exactly accourding to s-blocks;
- genes/good-upstreams.bs upstream sequences of good genes (genes/good.bs);
trees tree related files:
- nj-global-tree.tre global tree constructed using Neighbour-Joining applied to genome distances;

How to view .tre files using FigTree: open a file with FigTree, set branch label to "Diagnostic positions" in pop-up window, go to "Branch Labels" section of left menu, enable the section's checkbox. Abstract distances between nodes are shown under branches. To show number of diagnostic positions between corresponding clades, select "Diagnostic positions" in drop-down list.

View results in graphical user interface

$ qnpge

Graphical User Interface of NPG-explorer

This command uses pangenome/pangenome.bs and some of files created by PostProcessing.

The program window is splitted to 3 parts:

top left is the table of genomes;
top right is the blockset alignment (may be absent if there is no file pangenome/pangenome.ba);
bottom is alignment of selected block.

Columns of blocks table:

number of fragments in a block;
length of a block;
identity of a block;
percentage of 'G' and 'C' nucleotides in the block;
number of genes overlapping with a block;
number of parts splitted from a block;
number of low similarity regions.

You can filter blocks by block name, gene name or their sequence using input located up to block table. By default, pattern matching is wildcard. ^ before the pattern and $ after the pattern correspond to name start/end (as in regular expressions). To hide blocks of one fragment, clock checkbox "only blocks of >= 2 fragments". Blocks table can be sorted by any column.

Blockset alignment table shows alignment of fragment on genomes. Chromosome can be selected using drop-down list located up to blockset table. Each sequence is represented as a row of blockset table. Name of a sequence and its orientation against the alignment is written in first column. Fragments of a sequence are represented by cells of blockset table. Fragments of one block are coloured similarly. Orientation of a fragment against the alignment is indicated by '<' and '>'.

When you navigate in blocks table and blockset alignment, the alignment of the corresponding block is shown in bottom part of the program. Fragment name is shown left to alignment itself. Background colors in alignment correspond to nucleotide types. Name of the selected gene is shown in read-only input located up to the alignment. You can disable genes representation completely by unchecking the checkbox "show genes". Genes are coloured with foreground color white. Genes on reverse chain (relatively to the fragment orientation) are marked with underscore. Overlapping genes are coloured with purple. Start codons are coloured with black, stop codons are coloured with gray. Consensus of the block is shown up to the alignment. Identical columns without gaps are coloured with black, identical columns with gaps are coloured with gray, non-identical columns are white. Columns numbers are shown up to consensus.

Columns numbers of low similarity regions are coloured with red. Low similarity regions represent parts of blocks with unreliable alignment. There are 3 possible reasons of occurrence of low similarity blocks: * these sequences are not related, * recombination, * deletion and insertion in the same position.

You can use arrows keys to navigate through the alignment. Corresponding fragment is selected in blockset alignment. Use keys "Home" and "End" to go to first and last columns of the alignment respectively. If you "go away" from the alignment, the program switches to corresponding block. You go to next gene boundary if you press Ctrl + Arrow Right or Ctrl + Arrow Left. You go to next low similarity region if you press Shift + Arrow Right or Shift + Arrow Left.

To change order of sequences in blockset alignment and block alignment, select some rows (you can use Ctrl to select multiple rows) and press Ctrl + Arrow Up or Ctrl + Arrow Down.

Requirements of a good pangenome

no overlapping blocks;
sequences are covered entirely by blocks (including 1-fragment blocks);
alignment is defined for each block of >= 2 fragments;
length of any block except minor blocks is greater or equal to MIN_LENGTH;
identity of any subsequent FRAME_LENGTH columns of any but minor block is greater or equal to MIN_IDENTITY;
first and last columns of blocks do not contain gaps or dangling letters (few letters followed by long gaps);
identity of MIN_END first and last columns of any but minor block is greater or equal to MIN_IDENTITY;
blast run on consensuses finds no blocks which satisfy above criteria and surpass overlapping blocks from pangenome;
no subsequent blocks can be joined so that resulting block satisfies above criteria.

Blocks types:

blocks types

Major blocks = non-minor blocks.

Build and Install

Main executables are command line tool src/tool/npge (or src/tool/npge.exe) and GUI tool src/gui/qnpge (src/gui/qnpge.exe).

To change compiled-in default settings, run ccmake . in build directory.

To generate config file, run npge -g npge.conf and change generated file npge.conf.

Requirements

C++ compiler (C++11 is not needed);
Build system: make, cmake;
Boost library (tested with version 1.42);
Lua (version 5.1 or 5.2) or LuaJIT;
LuaBind library;
ZLIB library;
BLAST legacy or BLAST plus (run-time requirement).

Warning. Make sure you use the same Lua which luabind was linked against. Otherwise it compiles but doesn't work: PANIC: unprotected error in call to Lua API (attempt to index a nil value)

Optional:

Qt library (tested with version 4.8) for GUI;
Readline and NCurses libraries for advanced Lua terminal;
Doxygen to build documentation;
Markdown builder (e.g. Pandoc) to make README.html;
a viewer for trees in Newick format (FigTree, MEGA), to view phylogenetic trees of genomes.

Linux

To build static Linux package in fresh Debian Wheezy, install curl and sudo and run curl -L https://git.io/vmB9P | sh

Install build requirements (on Debian):

% ./linux/requirements.sh

Build the program as static executables (Qt 4 is not static!):

$ ./linux/build.sh

The program is built in the directory npge-build-linux.

Create distribution .tar.gz file: go into npge-build-linux and run:

$ ./linux/package.sh

How to build manually:

$ ./src/init_lua-npge.sh
$ mkdir build
$ cd build
$ cmake ..
$ make

Pass argument -DNPGE_STATIC_LINUX:BOOL=1 to after cmake to get static executables (on Debian, Qt 4 is not static!).

Build README.html:

$ pandoc -s -o README.html README.md

Run tests:

$ make test

Windows

To build static Windows packages in fresh Debian Wheezy, install curl and sudo and run curl -L https://git.io/vmB9v | sh

Windows executables are cross-compiled from Linux using MinGW cross-compiler.

For 64-bit Windows you need to export MXE_TARGET variable:

$ export MXE_TARGET=x86_64-w64-mingw32.static