Документ взят из кэша поисковой машины. Адрес оригинального документа : http://mccmb.belozersky.msu.ru/2015/proceedings/PROGRAM.pdf
Дата изменения: Mon Jul 13 13:06:20 2015
Дата индексирования: Sat Apr 9 22:47:56 2016
Кодировка:

Поисковые слова: galaxy center
MCCMB'2015

International Moscow Conference on Computational Molecular Biology MCCMB'2015
PROGRAM

Moscow, Russia July 16-19, 2015

1



MCCMB'2015

Shor t Program

Thursday, 16 July 2015 9.00 10.00 10.00 10.10 10.10 10.50 10.50 11.10 11.10 11.30 11.30-12.00 12.00 12.40 12.40 13.00 Registration and morning cof fee (MSU, Biological Faculty) Opening (MSU, Biological Faculty, auditorium M1) Session: Bacterial genomics (M1) Emergence of Membrane Bioenergetics from Ancient Systems of Na+/K+ Homeostasis Armen Mulkidjanian A universal signaling mechanism in bacterial chemoreceptors Igor Zhulin Genomic analysis of the respiration the microbiota of human intestine Dmitr y Ravcheev Cof fee break Session: Transcriptomics (M1) Classifying transcriptional and genetic heterogeneity in single-cell measurements Peter Kharchenko Network integration of parallel metabolomictranscriptional data reveals novel metabolic modules regulating divergent macrophage polarization Maxim Ar tyomov

3


MCCMB'2015 13.00 13.20 Differential gene expression by RNA-seq data in brain structures of laboratory animals with aggressive and tolerant behavior Yuriy Orlov Lunch break Session: Replication and chromatin (M1) 15.00 15.40 15.40 16.00 Inferring direction of replication fork and mechanism of DNA damage using sequencing data Maga Rowicka APOBEC-induced mutations are strongly enriched on the lagging strand during replication in human cancers Vladimir Seplyarsky Diversity and Domestication in Oenological Yeasts David Sherman Cof fee break and Poster session Buses depar t for the Conference dinner Dinner

13.20-15.00

16.00 16.40 16.40-18.40 18.40 19.00 19.00

4


MCCMB'2015

Friday, 17 July 2015 09.00-10.00 10.00 12.40 10.10 10.15 10.20 10.30 12.10 12.10-12.40 12.40 15.00 12.40 12.10 14.20 14.40 14.40 15.00 10.00 13.00 10.00 10.10 Morning cof fee OMICS -2015 (Skolkovo satellite) (MSU, Biological Faculty, auditorium M1 ) Introduction Yuri Nikolsky Introduction Pakhomov or Meyer Introduction of Jur y Presentation of projects Projects 1-10 Cof fee break OMICS -2015 (Skolkovo satellite) cont'd (MSU, Biological Faculty, auditorium M1 ) Presentation of projects Projects 11-20 Jur y Counting Announcement of winners, presentation of awards, conclusion MoBiLe 10Y symposium (Leiden satellite) (MSU, Faculty of Bioengineering and Bioinformatics, auditorium 221) Welcome A.E.Gorbalenya, A.Alexeevski

5


MCCMB'2015 10.10 10.40 Fast evolution of a conser ved residue of polyomaviruses defines a new mechanism of adaptation that operates by accelerated codonconstrained Val-Ala (COCO-VA) toggling within an intrinsically disordered protein region A.E.Gorbalenya The impact of biological ageing on RNA processing I.Pulyakhina, V.A.Takhaveev, M. Vermaat, M.S. Gelfand, J.F.J. Laros, P.A.C. 't Hoen, BIOS consor tium Probing-directed structured elements detection in RNA sequences Svetlana Vinogradova Mobilis in MoBiLe: Students in a Dynamic Research Field O.A. Mayboroda Cof fee break Novel insights in the regulation of mRNA transcription, processing and translationthrough integration of mRNA sequencing data P.A.C. 't Hoen Active chromatin regions are suf ficient to define borders of topologically associated domains in D. melanogaster interphase chromosomes E.Khrameeva, S.Ulyanov, A.Gavrilov, Yu.Shevelyov, M.Gelfand, and S.Razin Lunch break MoBiLe 10Y symposium (Leiden satellite) cont'd (Faculty of Bioengineering and Bioinformatics, auditorium 221) Alignment-free telomere length estimation from whole genome NGS data S.M.Kielbasa

10.40 11.00 11.00 11.20 11.20 11.50 11.50-12.10 12.10 12.40

12.40 13.00

13.00-14.00 14.40 15.50 14.00 14.30

6


MCCMB'2015

14.30 14.50

A comparison of two methods for detection of exceptional words in genomic sequences of prokar yotes Ivan Rusinov Peptide search engine approach for detection of translated mutations P.Sinitcyn, S.Tyanova, M.Mann and J.Cox NPG-explorer: a new tool for nucleotide pangenome construction and analysis of closely related prokaryotic genomes Boris Nagaev O2PLS as an integrative tool in systems oncology E.Nevedomskaya and H.Keun

14.50 15.10 15.10 15.30

15.30 15.50

Session: Medical bioinformatics (MSU, Biological Faculty, auditorium M1) 15.20 15.50 15.50 16.20 16.20 17.00 17.00-17.30 17.30 17.50 Building a Sustainable Bioinformatics Program Through Integrated Suppor t Michael Tartakovsky Network of the Country Tuberculosis Por tals Alexander Rosenthal Session: Cancer-1 (M1) Deciphering gene interaction to explain tumor progression Emmanuel Barillot Cof fee break Session: Cancer-2 (M1) A model for scoring damaging mutations in the noncoding tumoral genome based on germline and tumor data Jia Li

7


MCCMB'2015 17.50 18.10 Selection pressure on breast cancer somatic mutations revealed by bioinformatics sequence analysis Ivan Kulakovskiy Searching For Essential Cancer Proteins: Analysis Of Hypomutated Genes In Skin Melanoma Mikhail Pyatnitskiy Revealing mechanisms of cancer progression by pan-cancer deconvolution of tumoral transcriptomes Andrei Zinovyev

18.10 18.30 18.30 19.10

8


MCCMB'2015

Saturday, 18 July 2015 09.00-10.00 10.00 10.20 10.20 10.40 10.40 11.00 11.00 11.20 11.20 11.40 11.40-12.10 12.10 12.50 12.50 13.30 13.30-15.00 15.00 15.20 Morning cof fee Session: Methods and algorithms (M1) Analysis of variation in 3000 rice genomes project Tatiana Tatarinova Distance-based profiling aids in evaluation of ageing-related phenomena Ancha Baranova Scaf fold assembly based on genome rearrangement analysis Max Alekseyev A method for model comparison based on the parameter sensitivity measures Ekaterina Myasnikova Method to predict the percentage of cell types in human blood Anna Igolkina Cof fee break Mammalian Systems biology Alistar Forrest Genomics of Lifespan Control Vadim Gladyshev Lunch break Session: Genomics of regulation (M1) Computer analysis of genome co-localization of transcription factor binding sites based on ChIP-seq dat a Ar thur Dergilev

Session: Systems biology: Mammals and beyond (M1)

9


MCCMB'2015 15.20 15.40 15.40 16.00 Search for simple and composite auxin responsive elements in Arabidopsis thaliana genome Victoria Mironova Antisense interactions of long noncoding RNAs in human cells Ivan Antonov Session: Viruses ­ 1 (M1) 16.00 16.40 16.40-17.10 17.10 17.30 Challenges in virus genomics Manja Marz Cof fee break Session: Viruses ­ 2 (M1) Sequence and structural analysis of related proteins in distant viral species Olga Kalinina Session: Proteins (M1) 17.30 17.50 17.50 18.10 Bioinformatic analysis of diverse protein super families to design improved enzymes Dmitr y Suplatov Detecting the features of functional specificity in protein families based on the local sequence similarity Boris Sobolev Determination of the size of folding nuclei of protofibrils from the concentration dependence of the rate and lag-time of their formation Oxana Galzitskaya Assessing protein synthesis with ribosome profiling Pavel Baranov

18.10 18.30

18.30 19.10

10


MCCMB'2015

Sunday, 19 July 2015 09.00-10.00 10.00 10.40 10.40 11.00 11.00 11.20 11.20 11.40 11.40-12.10 12.10 12.50 12.50 13.30 13.30-15.00 15.00 15.20 Morning cof fee Anhydrobiosis in the sleeping chironomids: where are we now Takahiro Kikawada Expression regulation of desiccation-resistance genes in Polypedilum vanderplanki Pavel Mazin Molecular basics of different mechanisms of desiccation tolerance in Chironomidae midges Olga Kozlova Adapting to extremes: linking metabolome and genome of an anhydrobiotic insects Elena Shagimardanova Cof fee break Single cell molecular toolkit for inducible resistance to complete desiccation Oleg Gusev Genetic toolkit for investigation of anhydrobiosis: promoters and RNAi Richard Cornette Lunch break Session: Genome structure (M1) Genome mapping revealed scaffold misassemblies and elevated gene shuf fling on the X chromosome in malaria mosquitoes Igor Sharakhov Session: Genomics of anhydrobiosis (M1)

Session: Toolkits for anhydrobiosis research (M1)

11


MCCMB'2015 15.20 15.40 15.40 16.00 Detection of shor t size mutations and copy number alterations in ultra-deep targeted sequencing data Valentina Boeva Genomic structural instability and homologous recombination deficiency in breast and ovarian cancers Tatiana Popova Genome Track Analyzer : New tool for genome-wide study of correlations between distributed genome features Galina Kravatskaya Tale on the transposons on chromatin landscape Vladimir Babenko Cof fee break Session: Evolution (M1) 17.10 17.30 17.30 17.50 17.50 18.10 18.10 18.30 18.30 19.10 19-10-19.20 19.30 Assessing the impact of horizontal gene transfer on the evolution of prokar yotes Vladimir Makarenkov Rare amino acid changes fixation drives divergence in Metazoa evolution Konstantin Gunbin Evolution of TAG codon in Methanosarcina Margarita Meer A model of protein evolution within local fitness landscape changing with time Dinara Usmanova Char tering the local fitness landscape of the green fluorescent protein Fedor Kondrashov Closing Farewell par ty

16.00 16.20

16.20 16.40 16.40-17.10

12


MCCMB'2015 Extended Program

Thursday, 16 July 2015 9.00 10.00 10.00 10.10 10.10 10.50 Registration and morning cof fee (MSU, Biological Faculty) Opening (MSU, Biological Faculty, auditorium M1) Session: Bacterial genomics (M1) Emergence of Membrane Bioenergetics from Ancient Systems of Na+/K+ Homeostasis Armen Mulkidjanian

It is well known that the cytoplasm of living cells, generally, contains more potassium ions than sodium ions. The prevalence of K+ ions is crucial for the activity of numerous (nearly) universal, key enzymes, including those components of the translation system that even preceded the Last Universal Cellular Ancestor (LUCA). In modern prokaryotic cells, the [K+]/[Na+] ratio > 1.0 is maintained by ion-tight cellular membranes and an arsenal of ion pumps. It is unlikely that modern-type ion-tight membranes made of twotail lipids, not to mention a plethora of ion- pumping machines, were present in the very first cells. It is more likely that the monovalent ion content of the cytoplasm of the first cells would have to be equilibrated with the environment. The inhibitory effect of Na+ on many of these ubiquitous K+-dependent enzymes does not seem compatible with the evolution of the respective cellular systems and, generally, the first cells in environments with high sodium levels. As early as in 1926 Archibald Macallum suggested that the first cells might have emerged in K+-rich habitats. Several different, albeit complementary, geochemical scenarios have been recently proposed for the K+-rich environments of the primordial Ear th. Marine and freshwater environments generally show a [K+]/[Na+] ratio less than unity. Therefore, to invade such environments, while maintaining the cytoplasmic [K+]/[Na+] ratio over unity, primordial cells needed ion-tight membranes and means to extrude sodium ions. The foray into new, Na+-rich habitats was the likely driving force behind the evolution of diverse redox-, light-, chemically-, or osmotically-dependent sodium expor t pumps. By combining comparative structural and phylogenomic analyses we try to reconstruct how an interplay between diverse, initially independent sodium expor t pumps could lead to the emergence of membrane bioenergetics.

10.50 11.10

A universal signaling mechanism in bacterial chemoreceptors Igor Zhulin

13


MCCMB'2015
Bacterial chemoreceptors serve as a model system for understanding transmembrane signaling. However, the mechanisms by which conformational signals move within and between receptors and how they control kinase activity remain unknown. Using all-atom, microsecond-range molecular dynamics simulations on a special-purpose supercomputer, we show that the kinase-activating cytoplasmic tip of the chemoreceptor fluctuates between two stable conformations in a signal-dependent manner. A specific residue, Phe396, appears to serve as the conformational switch, because flipping of the stacked aromatic rings of an interacting F396-F396' pair in the receptor homodimer took place concomitantly with the signal-related conformational changes. Comparative genomic analysis reveals that F396 is the single most conserved residue in the entire chemoreceptor molecule: it is invariant in 99.8% of chemoreceptor sequences from all available genomes of bacteria and archaea. We conclude that despite substantial differences in the signaling domain between diverse bacterial species, the signaling mechanism is universally conserved

11.10 11.30 11.30-12.00 12.00 12.40

Genomic analysis of the respiration the microbiota of human intestine Dmitr y Ravcheev Cof fee break Session: Transcriptomics (M1) Classifying transcriptional and genetic heterogeneity in single-cell measurements Peter Kharchenko

Single-cell assays are making it possible to examine transcriptional states and other genome-wide proper ties of thousands individual cells. The ability to directly assess cell heterogeneity is par ticularly critical in the context of cancer therapy, where presence of phenotypically distinct subclonal populations fuels relapse and resistance to treatment. The transcriptional heterogeneity within such tumors and its impact on disease progression is poorly understood. Fur thermore, the extent to which genetic and transcriptional subpopulations correspond to each other cannot be currently assessed. To investigate these questions we have developed methods for analysis of single-cell RNA-seq data in concert with other genomic information. To characterize transcriptional subpopulations we identify annotated or newly-discovered gene sets that are linked to statistically significant heterogeneity within the measured collection of cells. To infer genotype information we rely on probabilistic assessment of single nucleotide variants and copy number variation in individual cells, which can be used to distinguish genetically subclonal populations. We apply these methods to examine transcriptional and genetic heterogeneity in samples of multiple myeloma and other tumors.

14


MCCMB'2015

12.40 13.00

Network integration of parallel metabolomictranscriptional data reveals novel metabolic modules regulating divergent macrophage polarization Maxim Ar tyomov

We have developed an integrated high-throughput transcriptional-metabolic profiling and analysis pipeline, and applied it to characterize global rewiring during murine macrophage polarization to pro- and anti-inflammatory (M1 and M2) states. Network based integration of metabolic and transcriptional RNA-seq data allowed us to mitigate problems specific to individual types of data and to obtain a global view of the metabolic changes during macrophage polarization. Metabolic profiling can be directly associated with a well-defined network of biochemical reactions, but it is not thorough: absence of a signal does not imply absence of the metabolite. On the other hand, transcriptional profiling catches all suf ficiently expressed genes and this information can be associated with metabolic reactions via enzymes. Thus, we compiled a network of reactions based on KEGG as a framework to integrate the metabolic and transcriptional profiling data. Next, we adapted BioNet algorithm to weigh the nodes and edges in the network based on the p-value of differential expression (DE) between M1 and M2 conditions. Then we found a most connected subnetwork that contained as much positively scored and as few negatively scored nodes as possible. That led to a set of most important interconnected reactions. As expected this set contained well-known macrophage related pathways such as glycolysis, TCA cycle, etc. However, 1) it showed how these pathways were interacting and 2) it contained modules not described previously. In M2 macrophages we discovered novel glutamine/glutamate- and UDP-GlcNAc-associated modules, and validated their involvement using isotope labeling studies. Functional impor tance of these modules was fur ther confirmed by glutamine deprivation and N-glycosylation inhibition experiments. In M1 macrophages we identified a metabolic break at Idh fragmenting the TCA cycle, and validated it using isotope labeling. Label distribution suggested presence of novel variant of aspar tate-arginosuccinate shunt. Consistently, inhibition of aspar tateaminotransferase, a key enzyme of the shunt, hindered NO and IL6 production while promoting mitochondrial respiration. This systems approach provides a highly integrated picture of the physiological modules suppor ting macrophage polarization, identifying potential pharmacologic control points for both macrophage phenotypes.

13.00 13.20

Differential gene expression by RNA-seq data in brain structures of laboratory animals with aggressive and tolerant behavior Yuriy Orlov Lunch break Session: Replication and chromatin (M1)

13.20-15.00

15


MCCMB'2015 15.00 15.40 Inferring direction of replication fork and mechanism of DNA damage using sequencing data Maga Rowicka

Double-stranded DNA breaks (DSBs) are a genotoxic form of DNA damage. The damage to both DNA strands precludes the straightforward use of the complementary strand as a template for repair, resulting in mutagenic lesions. Despite many studies on the mechanisms of DSB formation, our knowledge of them is very incomplete. A main reason for our limited knowledge is that, to date, DSB formation has been extensively studied only at specific loci but remains largely unexplored at the genome-wide level. We recently developed a method to label DSBs in situ followed by deep sequencing (BLESS), and used it to map DSBs in human cells with a resolution 2-3 orders of magnitude better than previously achieved. There are many factors inducing DSBs, including replication stress, oxidative stress and irradiation. Most of them cause two-ended DSBs (having two free ends of DNA), the only exception is replication stress which usually induces one-ended DSBs (caused by replication fork stalling and collapse).We use this observation to infer DSBs resulting from replication stress and to analyze chromatin context and sequence features related to replication stress-induced DSBs. Moreover, we show how to reconstruct the direction of replication fork movement from BLESS-Seq read pattern. We apply this concept to infer replication domain boundaries for several cell lines and conditions and to analyze how they change upon treatments and vary between cell lines. We also provide experimental verification for the proposed computational method and show that purely computational methods can predict >80% of experimentally detected DSBs.

15.40 16.00

APOBEC-induced mutations are strongly enriched on the lagging strand during replication in human cancers Vladimir Seplyarskiy

Mutagenesis induced by deaminases of the APOBEC family is prevalent in many cancers. A fraction of APOBEC mutations is clustered around DSBs, however vast majority of them are dispersed over the genome. Since APOBEC mutates specifically single stranded DNA (ssDNA) we hypothesized that lagging DNA strand which exists in single strand state during DNA replication may be a frequent target for APOBEC mutations. Knowing the direction of replication fork progression in human genome we were able predict for each genomic region which of the two DNA strands is lagging during replication. We observed that APOBEC mutations exhibit a strong 1.96 fold bias towards lagging strand, suggesting that this is the major mechanism of generation of APOBEC mutations explaining more than 1/3 of cases. Additionally we repor t the 2.3 fold preference of APOBEC mutations for non-methylated cytosines then for 5-methylcytosine; and nearly complete absence of enrichment of APOBEC and non- APOBEC mutations in patients with APOBEC signature in late replication time. This research provides novel insights into the APOBEC mutagenesis and suggests mechanistic explanations for a considerable fraction of APOBEC induced mutations.

16.00 16.40

Diversity and Domestication in Oenological Yeasts David Sherman 16


MCCMB'2015

16.40-18.40 18.40 19.00 19.00

Cof fee break and Poster session Buses depar t for the Conference dinner Dinner

17


MCCMB'2015

Friday, 17 July 2015 09.00-10.00 10.00 12.40 10.10 10.15 10.20 10.30 12.10 12.10-12.40 12.40 15.00 12.40 12.10 14.20 14.40 14.40 15.00 10.00 13.00 10.00 10.10 Morning coffee OMICS-2015 (Skolkovo satellite) (MSU, Biological Faculty, auditorium M1 ) Introduction Yuri Nikolsky Introduction Pakhomov or Meyer Introduction of Jur y Presentation of projects Projects 1-10 Cof fee break OMICS-2015 (Skolkovo satellite), cont'd (MSU, Biological Faculty, auditorium M1 ) Presentation of projects Projects 11-20 Jur y Counting Announcement of winners, presentation of awards, conclusion MoBiLe 10Y symposium (Leiden satellite) (MSU, Faculty of Bioengineering and Bioinformatics, auditorium 221) Welcome A.E.Gorbalenya, A.Alexeevski

18


MCCMB'2015

10.10 10.40

Fast evolution of a conser ved residue of polyomaviruses defines a new mechanism of adaptation that operates by accelerated codonconstrained Val-Ala (COCO-VA) toggling within an intrinsically disordered protein region A.E.Gorbalenya

It is common knowledge that conserved residues evolve slowly. We challenge generality of this central tenet of molecular biology by describing the fast evolution of a nucleotide position that is among the most conserved in the long overlap of de novo and ancestral open reading frames (ORFs) of a large subset of polyomaviruses. The de novo ORF is expressed through either the ALTO protein or the Middle T antigen (MT/ALTO), while the ancestral ORF encodes the N-terminal domain of helicase-containing Large T (LT) antigen. In the latter domain the conserved Cys codon of the LXCXE pRB-binding motif constrains codon evolution in the overlapping MT/ALTO ORF to a binar y choice between Val and Ala codons, termed here as codon-constrained Val- Ala (COCO-VA) toggling. We found the rate of COCO-VA toggling to approach the speciation rate and to be significantly accelerated compared to the baseline rate of chance substitution in a large monophyletic lineage of MT/ALTO encoding viruses comprising dozens species. We have then extended this analysis to the characterization of the evolution of the COCO-VA site within a single polyomavirus species. To this end, we have analyzed thir teen mostly newly sequenced genomes of Trichodysplasia spinulosa-associated polyomavirus (TSPyV) representing ~40% of repor ted cases of the Trichodysplasia spinulosa disease in humans world-wide. Only ver y limited genome variation ( 0.6%) was found, with a total of four nonsynonymous substitutions (NSS). Three of these affected only MT/ALTO, with one NSS fixed most early in TSPyV evolution - involving the COCO-VA toggling. Impor tantly, the COCO-VA site is located in a shor t linear motif (SLiM) of an intrinsically disordered region, a typical characteristic of adaptive responders. These findings provide evidence that the COCO-VA toggling is under positive selection in TSPyV and many other polyomaviruses that form a monophyletic lineage and infect a wide range of hosts. Thus, the COCO-VA toggling plays a critical role in virus adaptation, which is unprecedented for conserved residues.

10.40 11.00

The impact of biological ageing on RNA processing I.Pulyakhina, V.A.Takhaveev, M. Vermaat, M.S. Gelfand, J.F.J. Laros, P.A.C. 't Hoen, BIOS consor tium

Ageing of humans has been associated with large-scale changes in gene expression, however, influence of biological ageing on RNA processing has not been extensively studied yet. In this work we utilized an unprecedented collection of transcriptomes to uncover changes that encompass alternative splicing over the process of human ageing. Whole blood was collected from a large cohort of 626 individuals with a broad age distribution (20-80 years) and subjected to RNA-sequencing. Analyzing this RNA-Seq data, we developed a statistical model to evaluate characteristics of alternative splicing, accounting for potential confounder effects of phenotypic traits and age-related switching in the cell composition of blood. We discovered that the rates of exon skipping and intron

19


MCCMB'2015
retention significantly elevate with age, and that affected genes show no functional selectivity. GC content of the transcriptome was found to increase temporatily, and the changes in alternative splicing were recognized contributing to that. We discovered that the usage of non-canonical donor splice sites increases with age, fur thermore, we show that the number of acceptors paired with one donor significantly increases with age leading to potential functional changes. Our findings indicate that splicing machiner y undergoes significant age-related changes. They lead to the increased incidence of such alternative splicing events as intron retention and exon skipping, and promote implication of novel splice sites with unconventional nucleotide motifs.

11.00 11.20

Probing-directed structured elements detection in RNA sequences Svetlana Vinogradova

Transcripts often harbor RNA elements, which regulate cell processes co- or posttranscriptionally. The functions of many regulatory RNA elements depend on their structure, so it is impor tant to determine the structure as well as to scan genomes for structured elements. The best way to do this is to use comparative genomics approach and search for evolutionary conserved structures. But a suitable set of homologous sequences with moderate sequence divergence is too often not available due to the lack of related sequenced genomes, or rugged fitness landscape resulted in extremely high or low sequence conservation of structured RNAs. In these cases, we have to deal with single RNA sequences. Functional RNAs are more stable than genomic background and we used this fact to develop the RNASur face algorithm that detects putative structured elements in RNA sequences. The sizes of regulatory RNA elements vary from tens to hundreds of nucleotides and this results in limitation for computational approaches based on sliding window. RNASur face does not restrict the search to elements of fixed size but rather detects structured RNAs of optimal lengths. Chemical probing of RNA is an alternative source of structural information: probing reactivities strongly correlate with local nucleotide flexibility. We incorporate probing data in MFE calculation using procedure that is called `soft constraint' approach. It is based on pseudo-energies that favour individual positions in RNA structure to be paired or unpaired. One important advantage of our approach is the ability to incorporate any type of experimental data (SHAPE, PARS, DMS, etc.): we deal with probabilities of nucleotides of being paired/unpaired instead of relying on arbitrar y normalized data from experiments. Incorporation of RNA probing data into computation pipeline increases the signal/noise rate of structures prediction and detects more functional structures. However, our method is still dependent on the quality of probing data. Though at the moment high quality genome-wide RNA probing data for various organisms is not available, we believe that global interrogation of RNA structure will assist computational strategies to better model RNA structure, predict RNA function and screen genomes for functional RNAs.

11.20 11.50

Mobilis in MoBiLe: Students in a Dynamic Research Field O.A. Mayboroda

One of the most quoted definitions of bioinformatics describes it as a discipline which "encompassesalmostallcomputerapplicationsin biological sciences" (Attwood, 1999), but

20


MCCMB'2015

at the same time points out that the term "was originally coined in the mid-1980s for the analysis of the biological sequence data". Indeed, the analysis of the biological sequence data had dominated the field until analytical instrumentation such as mass spectrometry and NMR has become an everyday reality of a biological/medical laboratory. Today, translation of the instrumental raw data into a "computationally efficient" format, matching tandem mass spectra to peptide sequences (derived from DNA sequences) or 1D and 2D NMR data to metabolic libraries is probably most dynamic part of the field. Here we present the data processing solutions for proteomics and metabolomics implemented into routine analysis over the last years. An essential element of the selected workflows is a contribution of the MoBiLe program students to their development. Presented with small but intellectually challenging tasks, the students have often had to better define the problem as a first step towards solving it. Examples include a format conver ter for a novel algorithm matching tandem mass spectra to peptides that had to infer enzyme specificity, and a model for species identification that ended up simulating and comparing bottom-up proteomics experiments. This year, a pair of students (or "Dimitries") will combine anatomical and stage ontologies, controlled vocabularies with mass spectrometry, RNA-Seq or NMR data to produce visual, molecular, maps projecting the NMR, RNA-Seq or MS data onto representative anatomical drawings of our model systems. Finally, using an ongoing collaboration with Department of Nephrology as an example, we show how the tools developed by our students can lead to clinically meaningful results.

11.50-12.10 12.10 12.40

Cof fee break Novel insights in the regulation of mRNA transcription, processing and translationthrough integration of mRNA sequencing data P.A.C. 't Hoen

To date, the human transcriptome is known to contain around 80,000 protein-coding transcripts, and the estimated number of proteins synthesized range from 250,000 to 1 million. All these transcripts and proteins are coded by less than 20,000 genes, suggesting extensive regulation at transcriptional, post-transcriptional and translational level. I will discuss how integration of data obtained from diverse RNA sequencing technologies (RNA-seq, deepCAGE, ribosome footprinting) improves our understanding of these regulatory mechanisms and I will illustrate how these mechanisms jointly orchestrate the changes in protein demands during muscle differentiation. The individual regulator y layers appear to be tightly linked, with extensive cross-talk and feedback between them. To decipher the cross-talk between transcriptional and posttranscriptional regulation, we analysed PacBio® single-molecule long sequencing reads capturing fulllength mRNA molecules. These data show that the vast number of potential combinations between alternative transcription star t sites, alternatively spliced exons and alternative polyadenylation sites result in a relatively limited number of mRNA species, suppor ting the tight coupling between these processes. Fur ther integration of RNA sequencing data will elucidate the true complexity of the transcriptome and its multilayered regulation. Resume: Peter-Bram 't Hoen is Associate Professor in Bioinformatics at Leiden University Medical Center. Since 2010, he has been responsible for all

21


MCCMB'2015
bioinformatics activities within the depar tment of Human Genetics. He is leading a multidisciplinary team of researchers (molecular biologists and bioinformaticians) working on transcriptomics (RNA-seq) and proteomics data analysis, modeling of transcriptional networks, (cross-species) data integration, analysis of biological networks, and discovery of molecular biomarkers. He is an expert in RNA sequencing and coordinator of the yearly BioSB "Advanced RNA sequencing data analysis" course. He is also a member of the management team of BBMRI's Biobank-based integrative omics study (BIOS). His main research interest is the regulation of gene expression and the mechanisms controlling alternative transcription, splicing, polyadenylation, and translation.

12.40 13.00

Active chromatin regions are sufficient to define borders of topologically associated domains in D. melanogaster interphase chromosomes E.Khrameeva, S.Ulyanov, A.Gavrilov, Yu.Shevelyov, M.Gelfand, and S.Razin

In Drosophila, interphase chromosomes are organized in topologically associated domains (TADs) within which chromatin-chromatin interactions are frequent, while interactions across domain borders are rare. TAD positions on chromosomes appear to be conservative between cells of different lineages, and even between animal species. However, molecular mechanisms underlying partitioning of chromosomes in TADs are poorly understood. Insulator elements have been proposed to play a key role in definition of TAD borders but recently experimental evidences against this hypothesis have appeared. Here we used Hi-C method to map TADs in four drosophila cell lines of different origin. The cell lines share up to 80% TAD positions, while cell type specific TAD borders correlate with transcription changes between cell lines. TADs appear to be self-organizing condensed chromatin domains depleted in active chromatin marks. Active chromatin regions that cannot be organized in compact structures separate TADs, being sufficient to establish TAD borders without contribution of insulator proteins, such as Su(Hw) or CTCF

13.00-14.00 14.40 15.50 14.00 14.30

Lunch break MoBiLe 10Y symposium (Leiden satellite), cont'd (Faculty of Bioengineering and Bioinformatics, auditorium 221) Alignment-free telomere length estimation from whole genome NGS data S.M.Kielbasa

Telomeres are repetitive structures present at each end of a chromatid. They play role in maintenance of genome integrity. Due to the nature of the chromosome replication process, the telomeres shorten at each replication cycle. Consequently, with lifetime of the organism the average telomere length decreases and it may be used as a marker for organism's biological age. Here we present a method for accurate estimation of telomere lengths from unaligned whole genome sequencing reads. We developed the method based on a dataset provided by The Genome of the Netherlands (GoNL) project which

22


MCCMB'2015

generated whole genome sequencing data for 754 samples of 248 Dutch families. For 381 of the samples telomere length measurements were available. These measurements were obtained without usage of next generation sequencing methods. Our method contains two components: the read classifier and a linear model. The read classifier is a fast function for detection of repetitive sequences (in par ticular the telomeric motif TTAGGG) in read sequences. We apply this function to all reads of a sample and then we build a table of counts of reads with various repetitive motifs. Next, based on the read counts table and available telomere length measurements we train a linear predictor of telomere length. We demonstrate that the simplest possible predictor, which only bases on frequency of reads with the telomeric motif TTAGGG, displays a strong sequencing batch bias. When frequencies of a few other repetitive motifs are incorporated to the model, its per formance significantly improves. Finally, we compare our predictions with predictions obtained from telseq algorithm. The telseq estimations show strong effect of sequencing batch. Moreover, we demonstrate that our method delivers estimations more strongly associated with individuals age.

14.30 14.50

A comparison of two methods for detection of exceptional words in genomic sequences of prokaryotes Ivan Rusinov

Exceptional word is an oligonucleotide which observed frequency in genome notably differs from the expected one. Such words are good candidates for functional sites under evolutionar y pressure. The maximum order Markov model (Mmax) is widely used for estimation of expected frequency of a short word in a genome. But the real DNA sequences are described with such model poorly. Karlin et al proposed another method that takes into account observed frequencies of all subwords of a word, including degenerate ones, to estimate its expected frequency. We compared the Karlin's method with the Mmax based one in terms of detection of recognition sites of restrictionmodification systems avoided in a prokaryotic genome. Restriction sites were chosen for the methods comparison as target shor t words because of high specificity of restrictionmodification systems. A significant difference in restriction site representation estimated with the two methods was shown. Thus, the method used has significant impact on the results. We demonstrated that Karlin's method is more reliable for detection of exceptional words in prokaryotic genome sequences, probably due to use of all site subwords frequencies for the representation evaluation.

14.50 15.10 15.10 15.30

Peptide search engine approach for detection of translated mutations P.Sinitcyn, S.Tyanova, M.Mann and J.Cox NPG-explorer: a new tool for nucleotide pangenome construction and analysis of closely related prokar yotic genomes Boris Nagaev

Genomes of closely related bacteria have highly similar sequences of or thologous

23


MCCMB'2015
fragments but usually undergo multiple rearrangements, long deletions, inser tions of mobile elements and occasionally horizontally transferred regions. We developed a new tool, Nucleotide PanGenome explorer (NPG-explorer), designed for aligning and analysis of a number of input closely related genomes. NPG-explorer constructs nucleotide pangenome - a set of aligned blocks, each block consisting of orthologous fragments. Minimum length of block (default 100 bp) and minimum identity (default 90%) are algorithm parameters. NPG-explorer iterates block detection algorithm until the following criterion is satisfied: BLAST search all-against-all block consensuses detects no hits of appropriate size and identity. Each nucleotide from input genomes belongs to exactly one block of NPG (it is a reason for NPG terminology). Blocks are classified into four categories. Stable blocks (named s-blocks) are composed of one fragment from each genome. Hemi-stable blocks (h-blocks) are presented by one fragment from a subset of genomes. Repeat containing blocks (r-blocks) contain more than one fragment from at least one genome. Unique sequence blocks (u-blocks) contain only one fragment of length greater than a threshold. Minor blocks (m-blocks) are blocks of fragments of length less than a threshold. Blockset of global and intermediate blocks. Global blocks consist of glued consequent collinear s-blocks and fragments of sequencesthat are between them. Intermediate blocks consist of fragments of sequences that are between consequent global blocks. In addition NPG-explorer provides: (1) Multiple alignments of input chromosomes represented by a sequence of block identifiers. These alignments allow to detect chromosomal rearrangements. (2) File with consensus sequences of all blocks and file with description of all mutations with respect to consensuses. Thus, all input genome sequences can be completely reconstructed from these two files. (3) Phylogenetic trees of core blocks and of whole genomes. Core blocks are those that contain exactly one fragment of each genome. These trees are computed on the base of diagnostic positions in block alignments. (4) All gene annotations, mapped on blocks. This data are useful for detection and correction mis-annotations, gene corruption etc. Using NPG-explorer we constructed nucleotide pangenomes of five sets of genomes: 17 complete genomes of Brucella genus (56 Mb totally), 39 partially completed genomes of Brucella genus (129 Mb totally),12 genomes of Yersinia pestis (55 Mb), 8 genomes of Rickettsia rickettsii (10 Mb), 5 genomes of Burkholderia cenocepacia (38.5 Mb). In Brucella pangenome there are 653 stable blocks covering 91.5% of sum of lengths of all genomes. Identity within joined alignment of s-blocks is 99.2% showing high sequence similarity of all genomes. Program detected 33 global blocks. Phylogenetic tree of genomes computed by NPG-explorer by using diagnostic positions is in agreement with published data for 10 Brucella genomes. The program found large translocation from first to second chromosome in Brucella suis ATCC 23445 and large inversion in chromosome 2 of Brucella abor tus, also described earlier. NPG-visualization tool presents interactively a list of blocks, the alignment with mapped genes, alignments of block identifiers. NPGexplorer is written in C++ and is licensed under the GNU GPL. Simple script language for program modules invocation is introduced.

15.30 15.50

O2PLS as an integrative tool in systems oncology E.Nevedomskaya and H.Keun

Altered metabolism is a universal characteristic of cancer that is implicated in such clinically relevant phenotypes as metastasis and chemotherapy resistance. Regulation of

24


MCCMB'2015

metabolic reprogramming in the context of heterogeneous genomic context of cancer is poorly understood. Systematic integration of omics data can unravel interconnectivity of multiple components of this regulation. We approached such integration through joint analysis of metabolic, gene expression and microRNA data. For this we employed a statistical integration method, O2PLS, for combining data from the well-characterized NCI-60 cancer cell line panel. O2PLS is a generalization of OPLS approach that combines or thogonal signal correction (OSC) and Partial Least Square (PLS) analyses. OPLS allows separating variation in the data matrix X into the following par ts: correlated to the response Y, systemically non-related (or thogonal) to Y and the residual variance. Such a segregation allows examining the sources of variation. With the use of the bidirectional O2PLS method we were able to focus on the correlations of interest between sets of multidimentional data and achieve improved interpretability of the results. With this work we demonstrate that O2PLS is a versatile tool for data integration through joint analysis of metabolomics, transcriptomic and microRNA data. We combined knowledge- and literature-based selection of molecules of interest (based on GWAS, metabolic reconstruction and target prediction) with rigorous cross-validation to identify correlations of interest between metabolites and microRNAs, as well as between metabolites and mRNAs. We identified microRNA modules associated with catabolic and anabolic processes, as well as defined NT5E as a novel regulator of cancer metabolism. We confirmed the observed correlations using other datasets and fur thermore demonstrated the implication of NT5E in intrinsic and acquired resistance to chemotherapy in ovarian cancer and various cancer subtypes. With this we present an integrative biology approach to the study of cancer cell molecular profiles (`systems oncology') that facilitates discover y of novel players in cancer metabolism, progression and therapy resistance.

Session: Medical bioinformatics (MSU, Biological Faculty, auditorium M1) 15.20 15.50 Building a Sustainable Bioinformatics Program Through Integrated Suppor t Michael Tar takovsky

The rapid growth of advanced computational research methods being applied across the sciences increasingly demands support beyond the capabilities of the average informational technology (IT) depar tment. Nowhere is this more evident than in the field of bioinformatics. The necessary cyberinfrastructure - both hardware and people - must be highly specialized, but also diverse, to meet the needs of a broad range of users and applications. The Office of Cyber Infrastructure and Computational Biology (OCICB) at the National Institute of Allergy and Infectious Diseases (NIAID), par t of the U.S. National Institutes of Health (NIH), coordinates IT resources and training for a staff of over 4000 people, including over 2300 research scientists and scientific support staff located in the US and abroad. The mission of OCICB is to strategically enhance the Institute's capabilities in clinical informatics and bioinformatics, and ensure that NIAID researchers can access and fully utilize the most advanced bioinformatics tools available. To accomplish this, OCICB brings together a multidisciplinary teams of engineers, developers, analysts, and specialists to provide a broad suite of scientific services and

25


MCCMB'2015
resources tailored to the NIAID research community. Highly-trained, doctoral-level research scientists are embedded throughout OCICB, specializing in structural biology, biostatistics, phylogenetics, systems biology, and the many `omics fields. By including these subject matter exper ts within day-to-day operations, OCICB ensures that the needs of the primary end users - NIAID research scientists - will be met. This collaborative approach has been critical to the success of OCICB. For example, it was the computational biologists at OCICB who advocated for the creation of a high-per formance computing cluster when they recognized the potential impact of next generation sequencing when it first became commercially available in 2005. The NIAID cluster provides robust, reliable, cost-effective, and scalable infrastructure. As additional compute power becomes a necessity, just-in-time modular upgrades provide continuous improvement and up-to-date systems, as opposed to filling rack space with outdated, unused servers. Researchers were polled on an Institute-level to determine unmet needs in using the data center. Based on the feedback provided, training series were to address need gaps. Additionally this feedback has helped OCICB focus on up and coming research areas. By reaching out to stakeholders in early stages and creating pilot programs, OCICB is able to get ahead of the curve to provide transformational tools, allowing organic growth is able to take place.

15.50 16.20

Network of the Countr y Tuberculosis Por tals Alexander Rosenthal

Tuberculosis (TB) is a major global public health. The recent escalation of the occurrence of the disease has been complicated due to the appearance and development of multi resistant tuberculosis (MDR TB) or extensively drug resistant tuberculosis (XDR TB), as well as HIV/TB coinfection. The needs for fast, precise diagnostics of resistant TB and new efficient antiTB drugs are calling for integral approach and multi center, multicountry collaborations. The ability for the worldwide community of TB researchers to understand the nature of the TB disease will be greatly improved by using a common database containing anonymized medical images, treatment information, lab work, clinical data, and bacterial genomes. Value of such database would be fur ther increased if it contained unique patient cohor ts, molecular information on coexistence of multiple Mycobacterium tuberculosis strains, and tools for genomic and bioimaging analysis. The scope of the Network of the Country Tuberculosis Por tals initiative is to maintain the network of open access tuberculosis centers that use common database architectures, user inter faces, programmatic solutions, medical and scientific nomenclature. This unified approach can facilitate adherence to the treatment protocols, serve as a consistent repository of records and present a rich source for tuberculosis related data mining and epidemiological studies.

Session: Cancer-1 (M1) 16.20 17.00 Deciphering gene interaction to explain tumor progression Emmanuel Barillot

26


MCCMB'2015

17.00-17.30 17.30 17.50

Cof fee break Session: Cancer-2 (M1) A model for scoring damaging mutations in the noncoding tumoral genome based on germline and tumor data Jia Li

Cancer driver mutations are somatic events that promote tumor growth or metastasis. Previous computational studies have largely focused on driver mutations located in protein-coding exons that change amino acid residues with damaging effects. However, non-coding RNA (ncRNA) genes and non-coding par ts of coding genes (introns, UTRs) now emerge as significant players in the regulation of gene expression and potentially in tumor progression. There is an urgent need for methods that can evaluate the effect of somatic mutations in such non-coding regions and prioritize mutations for fur ther scrutiny. Here we develop two random forest models for predicting germline and somatic mutation constraints in any non-coding region. These models combine functional features from Encode and other genome surveys, using as response variables the mutational constraints provided by the 1000 Genome Project (germline model) and by collections of tumor whole genome sequences (somatic model). We show that each model reflects a different set of constraints acting on the normal and tumor genome and we identify the specific features (such as conserved elements and histone marks) that most contribute to these constraints. Fur thermore, high scoring regions defined by each model are enriched in known disease-related mutations, indicating we can use the resulting scores as a proxy for damaging non-coding mutation. We combine both model to predict regions in ncRNAs and introns/UTRs of protein coding genes where mutations are most likely to be damaging. This system paves the way for the detection of non-coding driver genes and regulator y elements in cancer.

17.50 18.10

Selection pressure on breast cancer somatic mutations revealed by bioinformatics sequence analysis Ivan Kulakovskiy

Among different variations of the human genome single nucleotide variants, SNVs, are the most common. SNVs located in coding regions may directly affect function of a particular protein through alterations of the protein sequence and, consequently, the structure. Nucleotide substitutions occurring in regulatory regions do not alter the protein but may change expression of the corresponding genes. In par ticular, SNVs in promoters and enhancers may alter transcription factor (TF) DNA binding and thus affect efficiency of transcription initiation. With hundreds of human TFs binding patterns known it is finally possible to predict regulatory effects of mutations purely by sequence analysis in silico. In the past, SNVs were primarily studied in a population context as single-nucleotide polymorphisms, SNPs. The high-throughput sequencing gave bir th to principally new data on somatic mutations, in par ticular, those emerging in cancer. Here we discuss a new

27


MCCMB'2015
version of PERFECTOS-APE (Vorontsov et al., 2015), the software to PrEdict Regulatory Functional Effect of SNVs by Approximate P-value Estimation. We applied PERFECTOSAPE to analyze somatic mutations detected in 21 breast cancer samples by Nik-Zainal et al., 2012. Using HOCOMOCO (Kulakovskiy et al., 2013) collection of transcription factor binding patterns we identified TFs whose binding sites were affected by somatic substitutions in breast cancer cells. Binding sites of several transcription factors were damaged by mutations significantly more often than expected by chance. At the same time, for dozens of transcription factors binding sites were protected from mutations, i.e. were affected by them significantly less often than expected by chance. We believe this is the evidence for positive and negative selection of cancer somatic mutations in regulator y regions

18.10 18.30

Searching For Essential Cancer Proteins: Analysis Of Hypomutated Genes In Skin Melanoma Mikhail Pyatnitskiy

We propose an approach to detection of essential genes/proteins required for cancer cell survival. Gene is considered essential if mutation with high impact upon function of encoded protein causes death of cancer cell. We draw an analogy between essential cancer proteins and well-known Abraham Wald's work on estimating the plane critical areas using data on survivability of aircraft encountering enemy fire. Wald reasoned that parts hit least on the returned planes are critical and should be protected more. Similarly we propose that genes essential for tumor cell should carr y less high-impact mutations in cancer compared to polymorphisms found in normal cells. We used data on mutations from the Cancer Genome Atlas and polymorphisms found in healthy humans (from 1000 Genomes Project) to predict 91 protein-coding genes essential for melanoma. These genes were selected according to several criteria including negative selection, expression in melanocytes and decrease in the proportion of high-impact mutations in cancer compared with normal cells. Gene ontology analysis revealed enrichment of essential proteins related to membrane and cell periphery. We speculate that this could be a sign of immune system-driven negative selection of cancer neo-antigens. Another finding is overrepresentation of semaphorin receptors, which can mediate distinctive signaling cascades and are involved in various aspects of tumor development. Cytokine receptors CCR5 and CXCR1 were also identified as cancer essential proteins and this is confirmed by other studies. Overall our goal was to illustrate the idea of detecting proteins whose sequence integrity and functioning is important for cancer cell survival. Hopefully, this prediction of essential cancer proteins may point to new targets for anti-tumor therapies.

18.30 19.10

Revealing mechanisms of cancer progression by pan-cancer deconvolution of tumoral transcriptomes Andrei Zinovyev

Large-scale projects are generating massive amounts of molecular profiles for tumoural samples. There exists a big challenge to establish a "catalogue" of signals that can shape the tumoral transcriptomes in cancer type-specific manner and signals common for many cancer types, as well as to distinguish them from the commonly observed technological and other biases and signals coming from tumoural microenvironment. In other words, we need to decipher the tumoural transcriptome, in order to focus on

28


MCCMB'2015

specific mechanisms that can be targeted in therapy. One of the most suitable methodology for this decoding comes from the signal processing field, connected to linear matrix factorization, such as the method of Independent Component Analysis (ICA). We analysed data on nine different cancers from 21 patient cohor ts and 6671 tumours and identified their commonalities, as well as the cancer type-specific characteristics. By carefull interpretation of ICA results, we managed to distinguish the signals coming from tumoural cells from those coming from the tumour microenvironment, clearly identified signals associated with technology and related to different treatments of tumour tissue biases. New insights were obtained in bladder cancer. The projections of the tumors on the different components allow characterizing and comparing any predefined subgroups of tumors. We thus could distinguish for the first time FGFR3-mutated tumors and RAS-mutated tumors. The analysis of a bladder cancer-specific component led to identify PPARG as an oncogene both controlling differentiation and proliferation in bladder tumors, and verify this prediction in an experiment. We showed that the information captured in independent components is also reflected into anatomopathological staining microscopy images.

29


MCCMB'2015 Saturday, 18 July 2015 09.00-10.00 10.00 10.20 Morning cof fee Session: Methods and algorithms (M1) Analysis of variation in 3000 rice genomes project Tatiana Tatarinova

Rice is the staple food for half the world population, par ticularly for poor developing countries in Asia. Remarkably, rice has a significant within-species genetic diversity. Traditional rice varieties encompass a huge range of potentially valuable genes. These can be used to develop superior varieties for farmers to take part in the uphill battle of feeding an ever-increasing world population (estimated to reach 9.6 billion by 2050). The genes linked to valuable traits can help breeders create new rice varieties that have improved yield potential, higher nutritional quality, better ability to grow in problem soils, and improved tolerance of pests, diseases, and the stresses, such as flood and drought, that will be inevitable with future climate change. Much of this diversity is conserved within the International Rice Genebank Collection (IRGC) at the International Rice Research Institute (IRRI). In the framework of the 3,000 rice genomes project, IRRI and collaborators have completed the sequencing of 3,000 rice genomes of varieties and lines representing 89 countries. The 3,000 Rice Genomes Project Rice Genomes Project is funded by the Bill and Melinda Gates Foundation and the Chinese Ministry of Science and Technology. The project's entire 13.4-terabyte dataset was released in 2014 in an open-access database, GigaDB, which instantly quadrupled the previous amount of publicly available rice sequence data. The dataset contains genome sequences (averaging 14X depth of coverage) derived from 3,000 accessions of rice with global representation of genetic and functional diversity. Availability of 3K rice genomes provided a unique oppor tunity to explore variability of different functional regions of genome. We focused our analysis on those regions that are most likely enriched by transcription factor binding sites, such as promoters, 5' and 3'-UTRs. We have examined distribution of SNPs, known transcription factor binding sites, and DNA methylation in those regions. We observe increased sequence conservation in these regions and hypothesize that unusually conserved motifs in these regions have biological significance. We found the most conserved motifs and per formed an enrichment analysis for these motifs in various biological processes. We applied our reAdmix tool to analysis of 3000 rice genomes, using currently sequenced varieties of wild rice as a reference. We present a novel plantMix pipeline for analysis of domesticated species using their wild relatives.

10.20 10.40

Distance-based profiling aids in evaluation of ageing-related phenomena Ancha Baranova

In typical biological assay per formed in a high-throughput mode, either expression levels for individual genes or other quantifiable variables are assessed in parallel. These variables could be represented as dimensions of the information space that we study. In high dimensional space, the data become sparse. In other words, when a data set contains a large number of attributes, we are faced with a choice of either completely

30


MCCMB'2015

suppressing most of the data or losing the desired level of statistical significance for any possible finding. The problem outlined above is known as the "curse of dimensionality". There is a need to develop integrative approaches, capable of combining data from multiple high-throughput experiments to increase sample size or statistically sound and robust techniques to reduce the data to the most informative features. In our previous studies, we developed a novel approach based on the "distances" in the multidimensional space of gene expression values. As a proof-of-principle, we showed that this approach produces surprisingly good results in separation of normal and affected samples both for analysis of human malignancies and for chronic progressive conditions like psoriasis. In current work, we applied distance-based metrics to the problem of quantification of ageing and age-related phenotypes. Aging has been an intriguing field of study for biologists for decades. As cells experience stress and damages from internal and external factors, they normally progress toward cellular senescence at which point they cease to replicate, but acquire pro-inflammatory features. This process comes with significant changes in gene expression profile (GEP) of the cell. Here we per formed a systematic classification of gene expression profiles from 12 microarray dataset. Samples from multiple disorders and healthy controls that were taken from various tissues were included. The array data were grouped and analyzed by the age of the donor. Pearson and Kolmogorov-Smirnov and correlation coefficient were used to compare GEPs between different groups. In such way, we built a holistic marker taking into account the quantifiable expression levels of all genes assayed, rather than extracting top ranked features as markers. In our analysis, the cumulative gene expression pattern of an individual patient is considered as a whole and is represented as a data point in a multidimensional space formed by all gene expression features assayed in the given system. The degree of separation between samples indicates the drift of the testing samples away from the cellular stable state in the process of cellular senescence. The classifiers showed clear separation between different age groups, as verified by k-fold cross validation. The holistic marker was fur ther compared with specific markers extracted based on the ranking of statistical significance. The per formance of the classifiers was evaluated by receiver operating characteristic curve (ROC curve). As an example of analysis, here we show linear distance plots for datasets GSE13330. In respective experiment, human foreskin BJ fibroblasts were mock or Bleomycin sulfatetreated (100ug/ml, Sigma, St. Louis, MO) for 24 hrs, while replicatively senescent fibroblasts were obtained by continuous passage. After 72 hr serum-starvation, RNA was collected and biotinylated cRNA was hybridized to Affymetrix Human Genome U133 Plus 2.0 GeneChips (Affymetrix, Santa Clara, CA) in the Washington University Microarray Facility. There were 4, 6, and 6 samples for Stress-Induced Prematurely Senescent (SIPS), Replicative Senescent (RS), and Young respectively. Our distance-based marker demonstrates the predictive power of global signatures is as good as specific markers, yet with better robustness and reproducibility. The classifiers may be used to identify the aging status of tissues and verify whether disease-based aging models resemble normal aging process.

10.40 11.00

Scaf fold assembly based on genome rearrangement analysis Max Alekseyev

Advances in DNA sequencing technology over the past decade have increased the

31


MCCMB'2015
volume of raw sequenced genomic data available for fur ther assembly and analysis. While there exist many algorithms for assembly of sequenced genomic material, they often experience dif ficulties in constructing complete genomic sequences. Instead, they produce long genomic subsequences (scaffolds), which then become a subject to scaffold assembly aimed at reconstruction of their order along genome chromosomes. The balance between reliability and cost for scaf fold assembly is not there just yet, which inspires one to seek for new approaches to address this problem. We present a new method for scaffold assembly based on the analysis of gene orders and genome rearrangements in multiple related genomes (some or even all of which may be fragmented). Evaluation of the proposed method on ar tificially fragmented mammalian genomes demonstrates its high reliability. We also apply our method for incomplete anophelinae genomes, which expose high fragmentation, and fur ther validate the assembly results with referenced-based scaffolding. While the two methods demonstrate consistent results, the proposed method is able to identify more assembly points than the reference-based scaffolding.

11.00 11.20

A method for model comparison based on the parameter sensitivity measures Ekaterina Myasnikova

In modeling of complex biological systems one often faces a dilemma of trade-off between over-simplification of mechanisms underlying the modelled biological processes and the model over-parameterization. In the former case the model may turn to be unrealistic while in the latter case the fitting to experimental data may lead to nonidentifiable parameter estimates. Methods for analysis of parameter sensitivity and identifiability may give a clue to the correct choice of the level of model detail. In our previous work (Myasnikova & Kozlov, 2014) we have introduced quantitative measures of the model prediction power based on relative sensitivity to parameters. We propose a modified version of the method based on the similar principles and designed to compare models of different complexity describing the same biological system. An idea of the method is to make sure that the model complication is practically reasonable by checking the sensitivity of the model to the additionally introduced subset of parameters and their identifiability. The method per formance is demonstrated on the model of transcriptional control of the Drosophila melanogaster even-skipped gene published in (Janssens et al., 2006).

11.20 11.40 11.40-12.10 12.10 12.50

Method to predict the percentage of cell types in human blood Anna Igolkina Cof fee break Mammalian Systems biology Alistar Forrest

Session: Systems biology: Mammals and beyond (M1)

We are complex multicellular organisms composed of hundreds of different cell types.

32


MCCMB'2015

The specialization of cell types and division of labour allows us to have coordinated complex functions such as responding to pathogens, movement and maintaining homeostasis. In the FANTOM5 project we have been interested in identifying the complete set of transcribed objects in the human genome and then predicting how they work together in the context of transcriptional regulatory networks (TRN). Each primar y cell type runs a different version of the TRN based on the set of gene products it expresses. Not only this, but the FANTOM5 CAGE data reveal a wealth of cell-type-specific enhancers that are expressed in a very specific manner. Understanding the cell-typespecificity of these elements and promoters is key to building cell type specific TRNs. Lastly we go beyond the TRNs and examine cell-cell signaling within a multicellular organism. By identifying the sets of protein ligands and receptors expressed in any given human cell type we have made the first draft cell-cell communication network map.

12.50 13.30

Genomics of Lifespan Control Vadim Gladyshev

Understanding the mechanisms that control lifespan is among the most challenging biological problems. Many complex human diseases are associated with aging, which is both the most significant risk factor and the process that drives the development of these diseases. It is clear that the aging process and the maximum lifespan of species can be regulated and adjusted. For instance, mammals are characterized by >100-fold difference in lifespan, which can both increase and decrease during evolution. We employ this diversity in mammalian lifespan and the associated life-history traits to shed light on the mechanisms that regulate species lifespan. For this, we utilize methods of comparative genomics to examine the genomes of exceptionally long-lived species and carry out analysis of lifespan across a panel of mammals. We sequenced the genomes of several mammals with exceptional lifespan, including the naked mole rat, the Damaraland mole rat, and the Brandt's bat, and identified genes that may contribute to their longevity. We also apply transcriptomics and metabolmics approaches to analyze the molecular basis for adaptations associated longevity across mammals. These studies point to both lineage-specific and common processes involving various pathways. It is our hope that a better understanding of molecular mechanisms of mammalian lifespan control will lead to a better understanding of human diseases of aging.

13.30-15.00 15.00 15.20

Lunch break Session: Genomics of regulation (M1) Computer analysis of genome co-localization of transcription factor binding sites based on ChIP-seq dat a Ar thur Dergilev

A scientific problem being solved is to study transcription factor binding sites (TFBS) colocalization in mammalian genomes using ChIP-seq data. Technology ChIP-seq, which combines chromatin immunoprecipitation (ChIP) and highly efficient DNA sequencing, allows to determine transcription factor binding sites in genome scale. The tasks of analyzing genomewide ChIP-seq data rises are to identify the coordinates of TFBS and to

33


MCCMB'2015
compare their location with genomic annotation (relative location and distance to gene transcription star t sites, promoter regions etc.). In addition to determining the location of binding sites for a transcription factor, there are problems of determining the cluster sites of different transcription factors, clusters together or located at a shor t (100-200 nt) distances on chromosomes assuming similar function and regulator y mechanisms. Programs processing huge amounts of text data (bed, wig files) identifying areas of intersection of genomic annotations (coordinates), adapted to the respective model genomes are technically necessary. We developed set of programming script for TFBS location analysis. The study of clusters of sites ChIP-seq data on the status of binding sites of 15 different transcription factors in the mouse genome were used. The computer program in C ++ language is developed to calculate the relative position of the coordinate TFBS and their clusters. Methods of establishing complex signals and patterns of the algorithm "Discovery" (program GeneDiscovery), previously developed in the framework of the theory of data analysis (Data Mining, Knowledge Discovery) in the context of signals DNA segments were used for the analysis of clusters of binding sites. We confirmed separation of TFBS clusters in mouse genome (embryonic stem cells) onto classes presented by Oct4, Nanog, Sox2 from one side, and c-Myc from another side. This analysis was extended to exact location of nucleotide motifs in ChIP-seq peaks relative to each other and iterative correction of such motifs.

15.20 15.40

Search for simple and composite auxin responsive elements in Arabidopsis thaliana genome Victoria Mironova

The hormone auxin is a major regulator of plant growth and development. The influence of auxin on gene transcription is primarily mediated through Auxin Response Factors (ARFs). ARFs bind in target promoters to the specific sites called AuxREs (Auxin Response Elements) with the TGTCNN (most frequently TGTCTC) consensus core sequence. While Chip-seq data for most of ARFs are still unavailable, prediction of potential AuxRE is restricted by consensus models that detect too many false positive sites. About half of the Arabidopsis thaliana genes have at least one TGTCTC in any orientation within the first 1000 nt of their promoter regions. While single TGTCTC hexamer does not confer auxin inducibility (Ulmasov et al. 1997), this is provided by multimerized (Guilfoyle et al. 1998), or composite AuxREs (Ulmasov et al. 1995). In the composite AuxREs, TGTCNN adjoins or overlaps with coupling elements (Ulmasov et al. 1995; Guilfoyle et al. 1998). We per formed bioinformatical analysis of simple and composite AuxREs distribution in Arabidopsis thaliana genome. AuxREs were recognized by three different models: (1) simple TGTCNN consensus, (2) TGTCNN pairs with a cer tain distance between them and (3) combination of oPWM and SiteGA tools (AuxREP&S) (Mironova et al., 2014). To test which model predicts AuxREs associated with auxin response better, we per formed metaanalysis of publicly available 23 microarray experiments with auxin treatments (Mironova et al., 2014). First, we created a list of auxin-regulated genes which significantly changed their expression (by more/less than 1,5/0,67-fold, p<0,05) in at least four microarrays. The threshold for the number of microarrays was set by the binomial trial estimate. The resulting list contained 1301 up-regulated and 1262 down-regulated genes. Second, the fractions of the significantly up- or down-regulated genes with an AuxRE variant in their promoter were compared with that for all the genes tested in the experiment. The statistical significance of the difference between the fractions was estimated by the t-test

34


MCCMB'2015

for arcsine square-root transformed propor tions. This analysis showed that all three models predicted AuxREs, which were enriched in auxin responsive genes, but the genes sets differed. For example, a highly associated with auxin response AuxREP&S were predicted in about of 10% of auxin responsive genes and were associated only with upregulation. Where several variants of TGTCNN consensus were significantly associated with auxin down-regulation. Additionally, we per formed a context analysis of the flanks in experimentally proven AuxREs and found three distinct types of potential coupling motifs (Y-patch, AuxRE-like, and ABRE-like) (Mironova et al., 2014). The similar bioinformatical analysis of associations in a number of microarray datasets assured us that the composite elements with a specific orientation of AuxRE and the coupling motifs and the certain range of spacer length between them were associated with auxin responsiveness. The methodology proposed in this work suggested for the cis-regulatory elements annotation in the case the cis-element is associated with a response to physiological and ecological factors.

15.40 16.00

Antisense interactions of long noncoding RNAs in human cells Ivan Antonov

The hybridization of two RNA molecules is called antisense interaction. Theses interactions are usually based on long (>100 bp) highly complementary duplexes that correspond to transcripts produced from overlapping genes (cis-interactions) or based on Alu repeats. It was also hypothesized that RNA-RNA hybridization can be based on several shor t antisense sites (trans-interaction). Recently it has been demonstrated by RNA pull-down assay that a cytoplasmic long noncoding RNA (lncRNA) is capable to bind with hundreds of mRNAs in human cell lines. The identified transcripts did not have long antisense duplexes with the lncRNA but rather several short (< 30 bp each) duplexes thus suggesting the possibility for trans-antisense interactions. To check this hypothesis we used thermodynamics based tools to compute the energy of the putative lncRNAmRNA interactions. We have shown that the energies for all the pull-down genes are an order of magnitude weaker than for functional cases of cis and Alu-based duplexes of the similar total length. Moreover, the energies for the majority of these duplexes are comparable to the values observed in random simulations suggesting that such pulldown transcripts are indirectly associated with the lncRNA and should not be considered as RNA-RNA interactions. Never theless, in each of the two analyzed pull-down experiments we have found 12 and 17 cases of putative trans-antisense interactions ­ the lncRNA-mRNAs pairs with energies significantly stronger than for random simulation. We thus continued the search for functional trans-antisense duplexes focusing on regulator y RNA-RNA interactions. Ab initio analysis per formed for 71 lncRNAs expressed in HEK293 cell line identified 12 potential cases of regulatory trans-antisense interactions that are waiting for experimental validation.

Session: Viruses ­ 1 (M1) 16.00 16.40 Challenges in virus genomics Manja Marz

Computer-assisted studies of structure, function, and evolution of viruses remains a

35


MCCMB'2015
neglected area of research. The attention of bioinformaticians to this interesting and challenging field is far from commensurate with its medical and biotechnological importance. The purpose of this talk is to increase awareness among bioinformatics researchers about the pressing needs and unsolved problems of computational virology. I focus primarily on RNA viruses that pose problems to many standard bioinformatics analyses due to their compact genome organization, fast mutation rate, and low evolutionar y conservation.

16.40-17.10 17.10 17.30

Cof fee break Session: Viruses ­ 2 (M1) Sequence and structural analysis of related proteins in distant viral species Olga Kalinina

Unlike cellular organisms, viruses do not constitute a monophyletic group, in which the phylogenetic history can be traced back to a common ancestor. The origin and relatedness of different virus families is currently a subject of active discussion. It is unclear, whether viruses have evolved by reduction of many essential genes from cellular species, descend from mobile elements of other organisms, or whether they precede cellular life and are ancient self-replicating units. Possibly, all these hypotheses are true, for a subset of viral families. The recent discovery of giant viruses revived this discussion with suggestions that a certain clade of them may represent a four th domain of life. Analysis of evolutionary relationships between distant viral families presents par ticular difficulties, since the sequence similarity of viral proteins is rarely detectable outside the immediate viral family. We have per formed an all-to-all sequence and structural comparison of viral proteins, and focused on cases where similarity is detected between proteins from viruses that use different type of nucleic acid to encode their genome. We can split the corresponding proteins families into families with balanced and unbalanced distribution of viral genome types. For the former category, we recapitulate viral hallmark genes (i.e. genes characteristic to only viruses and present in diverse species) and other known wide-spread viral proteins, providing the first comprehensive analysis of these cases. The protein families of the latter categor y can be often characterized by horizontal gene transfer events. We could not detect any events of horizontal gene transfer between different viruses, however, we have identified several events of horizontal gene transfer from the host to an infecting virus. We have also identified proteins from several protein families that appear in very distant viruses, whose function is likely conserved but whose origin cannot be traced back to a single viral class, which hints at a much more complex network of kinship in the virus world than previously recognized

Session: Proteins (M1) 17.30 17.50 Bioinformatic analysis of diverse protein super families to design improved enzymes Dmitr y Suplatov

36


MCCMB'2015

17.50 18.10

Detecting the features of functional specificity in protein families based on the local sequence similarity Boris Sobolev

Functional specificity of different subgroups in protein family is determined by par ticular amino acid residues. Commonly, such residues are identified by the methods using Multiple Sequence Alignment (MSA). We propose the SPrOS method for estimating the specificity of the sequence positions based on independent comparisons of sequence fragment pairs. It is more suited for locating significant positions shifted in MSA and analyzing the intersected classes. The method was tested on data representing the various types of sequence-function relations. Using the ar tificially generated sequences with introduced position-specific exchanges, high accurate recognition of the groupsspecific positions was shown. Application of SPrOS to LacI/GalR protein family resulted in indicating positions, whose functional significance had been experimentally determined earlier. In a more complicated case of protein kinases classified by inhibitor specificity, SPrOS was able to predict group-specific positions by statistically significant estimates. Mapping our results on 3D structures, positions predicted with high significance were detected in ligand-binding areas. In many cases evolutionar y coupled mutations significantly complicate recognition of the positions actually determining the group specificity. In the case of protein kinases we showed that excluding the proximate homologues of the test sequence allowed overcoming this problem

18.10 18.30

Determination of the size of folding nuclei of protofibrils from the concentration dependence of the rate and lag-time of their formation Oxana Galzitskaya

In this work a kinetic model of the process of formation of amyloid protofibrils is suggested which allows calculation of the size of the nuclei using only kinetic data. In addition to the stage of primary nucleation, which is believed to be present in many protein aggregation processes, the given model includes both linear growth of protofibrils (proceeding only at the cost of attaching of monomers to the ends) and exponential growth of protofibrils at the cost of growth from the sur face, branching, and fragmentation with the secondar y nuclei. Theoretically, only the exponential growth is compatible with the existence of a pronounced lag-period (which can take much more time then the growth of aggregates themselves). According to our theory, one can distinguish some mechanism of growth on the basis of kinetic data.

18.30 19.10

Assessing protein synthesis with ribosome profiling Pavel Baranov

We used ribosome profiling (ribo-seq) to assess the gene expression response of mammalian cells to various stresses, such as increased eIF2 phosphorylation (the key step in Integrated Stress Response), and Oxygen and Glucose Deprivation (OGD). It enabled us to delineate the rapid translational response affecting thousands of genes. The response frequently involves translation of shor t regulatory ORFs usually located in

37


MCCMB'2015
the 5' leaders of mRNAs. We also observed translation of unannotated long ORFs that likely leads to the synthesis of novel protein products specific to stress conditions. To assist the research community in using ribo-seq data we are developing RiboSeq.Org suite of tools (http://riboseq.org) that currently consists of the GWIPS-viz browser for th visualization of genomic alignments of ribosome footprints and RiboGalaxy which is a Galaxy instance specifically tailored for the analysis of ribo-seq data. In addition we developed a simple computational approach for the characterization of ribo-seq datasets. This technique is resistant to irregular technical noise and aberrant footprint densities caused by ribosome pauses. Application of this approach to several ribo-seq datasets revealed the strong impact of sequencing biases and translation inhibitors on the distribution of aligned ribosome footprints as well as substantial non-biological variability between datasets obtained from different laboratories.

38


MCCMB'2015

Sunday, 19 July 2015 09.00-10.00 10.00 10.40 10.40 11.00 Morning cof fee Anhydrobiosis in the sleeping chironomids: where are we now Takahiro Kikawada Expression regulation of desiccation-resistance genes in Polypedilum vanderplanki Pavel Mazin Session: Genomics of anhydrobiosis (M1)

In my talk, I will show how sophisticated analysis of RNA-Seq data could help to understand the molecular mechanisms of dehydration tolerance on the example of Polypedilum vanderplanki, the insect that could survive almost complete water loss and revive during just tree hours of re-hydration. Our results reveal that heat shock transcription factor (HSTF) is responsible for desiccation-induced transcription activation of many genes in P. vanderplanki, but not in congeneric desiccation-sensitive P. nubifer. It is likely achieved by binding of the HSTF to doubled binding site in promoter region of its gene that results in self-activation of HSTF in P. vanderplanki, but not in P. nubifer, where HSTF-binding motif in promoter region of HSTF gene is absent. While HSTF seems to be responsible for activation of hundreds genes under desiccation, it is just a tiny fraction of genes that alter its expression in either direction under desiccation and/or re-hydration. For example, genes that encode heme-binding proteins, globins and cytochromes, are significantly enriched among both: desiccation-suppressed and desiccation-activated genes. Some of these genes are expressed only in animals that survived desiccation and almost silent before dehydration. Transforming growth factor beta, genes that encode proteins involved in DNA-repair, polyketide (chemicals involved in pheromone communications and defense) synthesis and many others are significantly enriched among genes activated soon after star t of re-hydration. Analysis of gene expression can provide some clues about desiccation-related regulation of these genes. For example, all four histone deacetylases encoded by P. vanderplanki genome are differentially expressed under desiccation as well as a single histone acetylase. Genes that encode nuclear hormone receptors are significantly enriched among genes activated after three hours of re- hydration, that points to possible role of steroid hormones in re-hydrationinduced gene expression changes.

11.00 11.20 11.20 11.40

Molecular basics of different mechanisms of desiccation tolerance in Chironomidae midges Olga Kozlova Adapting to extremes: linking metabolome and genome of an anhydrobiotic insects Elena Shagimardanova

39


MCCMB'2015 11.40-12.10 12.10 12.50 Cof fee break Single cell molecular toolkit for inducible resistance to complete desiccation Oleg Gusev

Session: Toolkits for anhydrobiosis research (M1)

Ability of larvae of the sleeping chironomid Polypedilum vanderplanki represent the most complex organism capable to tolerance to complete desiccation. Upon desiccation, the lar vae enter into ametabolic reversible state (anhydrobiosis). It was shown that during desiccation, the nonredundant sugar (trehalose) substitutes water in the cells, leading to a "vitrification ". This mechanism prevent damage of molecules, cell structures and organelles. It has been demonstrated that anhydrobiosis is the property of individual cells rather than hormonally controlled process (reviewed in Cornette and Kikawada, 2010). One of the resent achievements is the establishment of a protocol of P. vanderplanki embryonic cell line, capable to withstand complete desiccation, via inducible anhydrobiosis (Nakahara et.al, 2010). Sleeping chironomid genome sequencing revealed several peculiarities in its structure, associated with the ability to desiccation resistance (Gusev et al., 2014). It is suggested that anhydrobiotic clusters (ARIds) of genes that were not found in the genomes of other insects, including closely related chironomid Polypedilum nubifer, responsible for the formation of a "molecular shield" during dehydration. In the current project we aim to dissect the molecular background of the inducible desiccation resistance in the cells by combining data of whole genome cap analysis gene expression (CAGE) analysis, transcriptomics and comparative proteomics (iTraq). The first stage of the analysis revealed that in contrast to whole larvae, characterized by more than 15% of total number of genes altered by desiccation, the inducible anhydrobiosis in the cell line associated with less than 1% of total number of genes is differentially expressed under desiccation. We fur ther found that only selected members of ARIds gene islands are expressed and fur ther up-regulated in response to preconditioning with trehalose and fur ther desiccation in the cell line. Taking together the data suggest that the current approach is effective tool to define the minimum essential gene set needed for induction of anhydrobiosis in stand-alone cell line of chironomid and further would be useful for ar tificial anhydrobiosis methodology for other eukar yotic cell lines. In addition, tissues or organ specialization might be one of the explanation of anhydrobiosis-related genes paralogization in the sleeping chironomid.

12.50 13.30

Genetic toolkit for investigation of anhydrobiosis: promoters and RNAi Richard Cornette Lunch break Session: Genome structure (M1)

13.30-15.00

40


MCCMB'2015

15.00 15.20

Genome mapping revealed scaffold misassemblies and elevated gene shuf fling on the X chromosome in malaria mosquitoes Igor Sharakhov Detection of shor t size mutations and copy number alterations in ultra-deep targeted sequencing data Valentina Boeva

15.20 15.40

The emergence of the amplicon sequencing technique, which followed whole exome sequencing, promises a revolution in cancer diagnostics and treatment. Amplicon sequencing consists of the PCR amplification of a limited number of the genomic regions of interest (amplicons) followed by high throughput sequencing. These genomic regions generally correspond to exons of "actionable" cancer- related genes: ALK, BRAF, MYCN, ERBB2, etc. Due to the exceedingly high read coverage of amplicon sequencing data, there is no methodological issue in the identification of clonal point mutations and small inser tions or deletions (indels) in actionable genes targeted by amplicon sequencing. However, how to reliably detect copy number changes and identify subclonal mutations present in a very small propor tion of tumor cells from amplicon sequencing data is still open to discussion. Here we provide a solution, ONCOCNV, to the challenging question of extracting CNAs from amplicon sequencing data by (i) defining a method to normalize read coverage with a small set of normal control samples and (ii) assigning statistical significance to putative CNAs resulting from the segmentation of normalized profiles. We also propose a method, TargetZoom, to detect subclonal mutations in amplicon sequencing data.

15.40 16.00

Genomic structural instability and homologous recombination deficiency in breast and ovarian cancers Tatiana Popova Genome Track Analyzer : New tool for genome-wide study of correlations between distributed genome features Galina Kravatskaya

16.00 16.20

The broad class of tasks in genetics and epigenetics can be reduced to the study of various features that are distributed over the genome (genome tracks). The rapid and efficient processing of the huge amount of data stored in the genome-scale databases cannot be achieved without the advanced software based on analytical methods. However, strong inhomogeneity of genome tracks hampers the development of relevant statistics. We developed the analytical criteria for the assessment of genome track inhomogeneity and correlations between two genome tracks. We also developed a software package, Genome Track Analyzer, based on this theory. It contains the following tools applicable to genomic track investigations: *Correlations between point-wise and

41


MCCMB'2015
stretch-wise genomic tracks *Correlations between profiles (including expression and DNA-protein binding profiles) *Correlations between point-wise and stretch-wise genomic tracks and expression profiles *Statistical Kolmogorov-Smirnov and entropy tests for assessment of distribution of genomic tracks over the chromosomes. The theory and software were tested on simulated data, and were applied to the study of correlations between CpG islands and transcription star t sites in the Homo sapiens genome, between profiles of protein binding sites in chromosomes of Drosophila melanogaster, and between DNA double-strand breaks and histone marks in the Homo sapiens genome. Significant correlations between transcription star t sites on the forward and the reverse strands were observed in genomes of Drosophila melanogaster, Caenorhabditis elegans, Mus musculus, Homo sapiens, and Danio rerio. The observed correlations may be related to the regulation of gene expression in eukar yotes. Genome Track Analyzer is freely available at http://ancorr.eimb.ru/

16.20 16.40

Tale on the transposons on chromatin landscape Vladimir Babenko

We categorized human genome 100kb non-overlapping segments by their Dnase Hypersensitive Sites (DHS) counts based on data in (Sheffield et al., 2011). They fit a Weibull long tail distribution with a peak at around 14 DHSs per bin. The few (around 50) bins maintaining less than 14 DHSs were mostly gene deser ts, long introns, or some quite distinct gene clusters like ubiquitin peptidase family. Then we per formed linear regression analysis between categorized by families transposons counts and #DHS. We revealed two major classes of transposons families: those that prefer "silent" chromatin and those tending to reside in "open" chromatin bins with high confidence. Fur ther on, we discovered that number of Alu retroposons strongly correlates with the number of genes in the bin. Based on this observation, we worked out a method based on a nonlinear Alu-gene correlation to infer some non-linear evolution events like the emergence of tandem repeated gene clusters. We also crossed the family ­ categorized transposons with Txn table (transcription factor binding sites verified by Chip-Seq; genome.ucsc.edu) to elucidate their transposon specific propagation similar to (Jjingo et al., 2014). Fur ther on, we assessed chromosome wise bias of repeat families and found that most chromosome ­ specific repeat families expansions (LINEs in majority) are maintained at X chromosome. Some ctcf-related open chromatin LTR expansions were observed specifically at chromosome 19 in a way similar to B2 Sine in mouse (Lunyak et al., 2007). Overall we repor t that the properties of transposons distribution and density within a genomic segment can disclose its specific evolutionar y history and features.

16.40-17.10 17.10 17.30

Cof fee break Session: Evolution (M1) Assessing the impact of horizontal gene transfer on the evolution of prokaryotes Vladimir Makarenkov

Horizontal Gene Transfer (HGT) is one of the major evolutionar y processes affecting the evolution of prokaryotic species. Two known types of horizontal gene transfer are complete and par tial transfers. Par tial HGT can be viewed as a complete HGT followed by

42


MCCMB'2015

intragenic recombination and leading to the creation of a mosaic gene. The identification of the origins and the rates of horizontal gene transfers in the context of complete and partial HGT models, and for different phylogenetic families and ecological niches, is a ver y relevant and challenging problem. We will present a novel bioinformatics framework designed to estimate and compare the rates of complete and par tial HGTs at different phylogenetic and ecological levels. Our results suggest that par tial HGTs are almost twice more frequent than the complete ones. We also determined that the majority of prokaryotic genes (i.e. a gene was represented by a multiple alignment of the corresponding alleles) have been affected multiple times by gene transfers during its evolutionar y history: we found that 83% of the considered prokar yotic genes have been affected by at least one complete HGT and 96% - by at least one complete or par tial.

17.30 17.50

Rare amino acid changes fixation drives divergence in Metazoa evolution Konstantin Gunbin

This report answered the following questions: What is the physical nature of the two contrast groups of amino acid substitutions, atypical (statistically rare) and typical? What is the difference between protein sites fixing atypical substitutions and protein sites fixing any amino acid replacements? What is the difference between lineages fixing atypical replacements and lineages without these replacements? How are atypical amino acid substitutions distributed among protein functional groups? Is there a connection between the frequency of atypical replacements and genus bir th in the fossil record? Which branches of Metazoa tree enriched with atypical amino acid replacements and why it matters?

17.50 18.10 18.10 18.30

Evolution of TAG codon in Methanosarcina Margarita Meer A model of protein evolution within local fitness landscape changing with time Dinara Usmanova

Each amino acid in protein interacts with others. Thus fitness contribution of specific amino acid in par ticular site depends on the whole genetic background. This background changes over time resulting in change of allele fitness. In other words selection acting against particular alleles is not constant. We developed methods of analysis of long-term protein evolution which allow us to observe patterns of this altering selection. Then we formulate a covarion-like model of protein evolution, which describes this process mathematically. The model tracks not only the evolution of sequence but also the evolution of its local fitness landscape. In more details we allow fitness contribution of specific amino acid in par ticular site switch from being acceptable to being deleterious and vice versa. We calculated the rate of this switches for approximately 100 bacterial genes and 10000 ver tebrates' genes. It appears that fitness landscape changes ver y fast: on average 5 switches between allowed and blocked states occur on the same timeframe as a single amino acid substitution.

43


MCCMB'2015 18.30 19.10 Char tering the local fitness landscape of the green fluorescent protein Fedor Kondrashov

The nature of the genotype to phenotype connection, the fitness landscape, and the extent to which it is shaped by the non-independent contribution of mutations, epistasis, remain poorly understood. My talk will focus on an assay of the native function, fluorescence, of tens of thousands genotypes of the green fluorescent protein, including genotypes containing multiple missense mutations, allowing for the exploration of the local fitness landscape of an entire protein coding gene with an unprecedented detail. We find that the impact of multiple missense mutations on fluorescence was influenced by epistasis, especially those in functionally impor tant sites with a weak individual effect on fluorescence. Fur thermore, although the fitness landscape can be approximated with a relatively simple unidimensional function it is also affected by multidimensional epistasis, such that a multidimensional representation of the genotype space leads to a more accurate prediction of the level of fluorescence for each genotype. The broad congruence of the estimates of the prevalence of epistasis from long-term evolution with our data suggests that our query of the shape of the local fitness landscape can be extrapolated to a larger scale. However, the local fitness landscape does not appear to be influenced by epistasis between amino acid residues with a direct interaction in the protein structure. This observation appears to contrast with the general conclusions of the impor tance of structural interactions in long term evolution, suggesting that multidimensional epistatic interactions are rare in shor t term evolution but accumulate with protein divergence.

19-10-19.20 19.30

Closing Farewell par ty

44


MCCMB'2015

Poster session # 1 Name Shlikht Anatoliy Section algorithms Poster title Automated working place of bioinformatics DegenPrimer: a software for in silico simulation of multiplex PCR with degenerate primers NGS Data Analysis with Unipro UGENE The method for homologous recombination detection within bacterial species OLESA: Operon Loci Examination and Sorting Application A Novel Statistical Algorithm to Detection of Large-scale Deletions in PCR-enriched Target Sequencing Data Stochastic modeling of enhancer molecular configurations Characterization of highly diverse viral populations by fast reference selection and accurate read mapping

2

Taranov Evgeny

algorithms

3

Golosova Olga

algorithms

4

Kalinina Anastasia

algorithms

5

Klimchuk Olesya

algorithms

6

Demidov German

algorithms

7

Demidov German

algorithms

8

Fedonin Gennadiy

algorithms

45


MCCMB'2015 9 Flegontov Pavel algorithms A read mapper for investigation of Uinsertion/deletion RNA editing Human-guided genome assembly finishing software A method of detecting local gene synteny rearrangement Investigation of exon-intron structure multiple alignments Differential activity of polymerase ? associated with replication timing and gene bodies in humans: evidence from mutational signatures NPG-explorer, a tool for creating and exploring nucleotide pangenome for closely related prokaryotic genomes

10

Gerasimov Evgeny

algorithms

11 Lyubetsky Vassily

algorithms

12 Poverennaia Irina

algorithms

13

Soldatov Ruslan

algorithms

14

Nagaev Boris

algorithms

15

Zhuravleva Ekaterina

epigenetics Evaluation of the positional and chromatin correlations between structure whole genome annotations: novel statistical approaches development, advancement of the GenometriCorr methodologies 46


MCCMB'2015

16

Stavrovskaya Elena

epigenetics StereoGene: a tool for fast and chromatin correlation assessment structure and its application to the analysis of bivalent histone methylation epigenetics Spatial configuration of the and chromatin alpha-globin gene domain structure in three cell types of G.gallus epigenetics and chromatin structure Active chromatin regions are sufficient to define borders of topologically associated domains in D. melanogaster interphase chromosomes

17

Galitsyna Aleksandra

18

Khrameeva Ekaterina

19

Kulakova Ekaterina

epigenetics Computer analysis of and chromatin chromosome contacts structure obtained by ChIA-PET and Hi-C technologies evolution Analysis of prevalence of epistasis on the basis of huge phylogenies Accumulation of mutations in nonsense alleles of Drosophila melanogaster Estimation of selection pressure on degenerate sequences in genomes: choice of method Local variation of the mutation rate across the primate phylogeny

20

Klink Galya

21

Potapova Nadezhda

evolution

22

Rusinov Ivan

evolution

23

Terekhanova Nadezhda

evolution

47


MCCMB'2015 24 Teterina Anastasia evolution The evolution of cod protein coding genes: intra- and interspecies levels Autoimmune primed CRISPR adaptation in I-E and I-F systems: comparative analysis of new spacer selection mechanisms Phylogenomic analysis of the type I NADH:quinoneoxidoreductase Study of lactobacteria's genomes evolution Sequencing genomes of Saccharomyces cerevisiae strains belonging to the Peterhof Genetic Collection helps elucidate the origin of several widely used laboratory strains Moss phylogeny reconstructed from 24 full mitogenome sequences using new "pangenome" based approach. The Evolutionary Space of bacterial 16S rRNA gene Extremely high polymorphism level in fungi S. commune: the cause and the importance for population genomics

25

Savitskaya Ekaterina

evolution

26

Novakovsky German Olga Bondareva Tarasov Oleg

evolution

27 28

evolution evolution, taxonomy

29

Troitsky Aleksey

evolution, taxonomy

30 31

Korvigo Ilia Baranova Mariia

evolution, taxonomy medical and population genetics

48


MCCMB'2015

32

Bai Haihua

medical genetics

Identification of the susceptibility gene loci associated with ischemic stroke in a Mongolian population in China Studying of epileptic encephalopathies using NimbleGen-based target panels Finding of compound heterozygous mutations in the ALDH7A1 gene. Clinical case Evolutionary analysis of NPC1 improves accuracy of predicting disease causing missense mutations Mutation analysis of M. tuberculosis nucleotide sequences from patients in Belarus A bioinformatics pipeline for analysing germline mutations in human breast cancer by exome sequencing Improved gene annotations for microarray based identifications of reporter metabolites in recurrent breast cancer

33

Belenikin Maxim

medical genetics

34

Belenikin Maxim

medical genetics

35 Reznik Aleksandr

medical genetics

36

Sergeev Roman

medical genetics

37

Bizin Ilya

medical genetics, cancer

38

Milchevskaya Vladislava

medical genetics, cancer

49


MCCMB'2015 39 Moshkovskii Sergei Terskikh Anastasia medical genetics, cancer medical genetics, cancer Exome-based proteogenomics of human cancer cell lines Analysis of mutational landscape of patients with chronic lymphocytic leukemia

40

41

Popova Anfisa

metagenomics BCVISS: a web application for analyzing mixed 16S rRNA gene chromatograms metagenomics Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis of human gut microbiota

42

Dubinkina Veronika

43

Garushyants Sofya

metagenomics Comparative metagenomic profiling of two pilot-scale microbial fuel cells treating industrial wastewaters metagenomics metagenomics Recent genomic changes in the human gut microbiome Bacteria revived from an ancient bison gut MetaFast: fast referencefree graph-based comparison of shotgun metagenomic data Binding sites of miRNAs with transcription factors' genes of Camelus ferus and Homo sapiens

44 45

Kiseleva Larisa Kovarsky Boris

46 47

Shavkunov Konstantin

metagenomics

Kazakov Sergey metagenomics

48

Ivashchenko Anatoliy

miRNA

50


MCCMB'2015

49

Ivashchenko Anatoliy

miRNA

Features of miR-574-5p and miR-574-3p binding sites in mRNA of target genes At least 6% of conserved miRNAs` sites are misaligned. Interactions between miRNAs and mRNAs of apoptosis genes in lung cancer

50

Prosvirov Kirill

miRNA

51

Niyazova Raigul miRNA, cancer

52

Niyazova Raigul miRNA, cancer The interaction of miRNAs with mRNAs of the cell cycle genes in lung cancer Hadarovich Anna protein function protein function Quantitative comparison of functional properties in protein-protein complexes A novel Arg H52 and Tyr H33 conservative binding motif in antibodies: a correlation between sequence of immunoglobulins and their binding properties Computational prediction of MHC class I tumorspecific antigens Sequence analysis in short functionally important peptides by combination of bioinformatics, molecular dynamics and testing of biological activity

53

54

Petrov Artem

55 Alexandrov Anton

protein function, algorithms protein function, algorithms

56

Argun Dmitriy

51


MCCMB'2015 57 58 59 Bogatyreva Natalya Dudko Anna Gushchina Irina protein structure protein structure protein structure Methods for protein folding rate prediction TOM-complex structure modeling Molecular model of tyrosylDNA phosphodiesterase 1 for a structure-based screening for its inhibitors Constrained Modelling of an Intermediate Filament Dimer Testing applicability of machine learning for protein folding rate prediction Local protein structure prediction based on physicochemical properties of amino acids The 8-oxo-7,8-dihydro-2dGTP behavior in active site of human DNA polymerase : structural investigation in silico cAMP-induced conformational changes of Protein Kinase A Ia Adomain Docking method reveals binding patterns of -dioc acids by albumin

60

Guzenko Dmytro

protein structure

61

Ivankov Dmitry

protein structure

62

Milchevskiy Yury

protein structure

63

Nyporko Alex

protein structure

64

Rogacheva Olga

protein structure

65

Scherbakov Kirill

protein structure

52


MCCMB'2015

66

Shalaeva Daria

protein structure

Modeling the role of positively charged moieties in hydrolysis of nucleoside triphosphates Structural analisys of mutations assosiated with idiopatic restrictive cardiomyopathy in cytoskeletal and sarcomeric proteins Sequence alignment of non-superposable betasheets Changing the transcriptional activity of genome regulatory loci by PCR-mutagenesis Transcriptional Regulation of the Carbohydrate Metabolism in the Bifidobacterium Genus Reconstruction of GABA and taurine metabolic regulons, controlled by MocR-subfamily transcription factors Revealing and comparing regulons of homologues transcription factors UxuR and ExuR in Escherichia coli.

67

Tarnovskaya Svetlana

protein structure

68 Aksianov Evgeniy

protein structure, algorithms regulation of transcription

69

Bykov Alexander

70

Khoroshkin Matvei

regulation of transcription

71

Suvorova Inna

regulation of transcription

72

Tutukina Maria

regulation of transcription

53


MCCMB'2015 73 Zharov Ilya regulation of transcription Correlations of substitutions predict specific protein-DNA contacts in the MerR family of transcriptional factors Unraveling CD spectra of G-quadruplexes Hybridization energy of native and modified DNA duplexes calculated using molecular dynamics The evolution of 5' untranslated regions' structure in Bacilli and Clostridia genomes Comparative genomics analysis of thiaminepyrophosphate riboswitches in fungal genomes Long-range stem-based RNA tertiary motifs Secondary structures in the coding regions of mRNAs: literature survey and comparison of prediction methods Probing-directed structured elements detection in RNA sequences

74

Zalevsky Arthur

RNA and DNA structure RNA and DNA structure

75

Lomzov Alexander

76

Chervontseva Zoe

RNA structure

77 Moldovan Mikhail RNA structure

78

Baulin Eugene

RNA structure

79

Vasileva Aleksandra

RNA structure

80

Vinogradova Svetlana

RNA structure

54


MCCMB'2015

81

Volkova Oxana

RNA structure

Estimation of translational importance of mammalian mrna nucleotide sequence characteristics based on ribosomal profiling data Mathematical modeling of morphogenetic regulation of the meristem zone formation in the plant root A new method for identification of molecular motor role in endocytosis Dynamic modeling of genes for spatial patterning in embryo development on the example of the Drosophila segmentation gene hunchback Combined sequencedbased model of the Drosophila gap gene network Critical dynamics of gene networks is behind ageing and Gompertz law Atlas of Cancer Signaling Network: from intracellular networks to tumoral microenviroment

82

Lavrekha Viktoriya

systems biology, modeling

83

Novikov Konstantin

systems biology, modeling systems biology, modeling

84

Spirov Alexander

85

Dymova Arina

systems biology, networks

86

Kogan Valeria

systems biology, networks systems biology, networks

87 Kondratova Maria

55


MCCMB'2015 88 Misko Vladislav systems biology, networks The construction of gene networks for Mycobacterium Tuberculosis by analyzing next-generation sequencing data Modelling the Metabolic Pathways Globally connected networks of GEO transcriptional profiles reveal hypothesis generation and drug repurposing potential Comparative metabolomic profiling of desiccation tolerant midge

89

Mitra Chanchal

systems biology, networks systems biology, networks

90

Sergushichev Alexey

91

Shagimardanova Elena

systems biology, networks

92

Cherkasov Alexander

transcriptomics Whole genome analysis of variety and expression of heat-shock protein encoding genes during desiccation stress in an anhydrobiotic midge Polypedium vanderplanki How to escape from muscle atrophy: wholegenome analysis of gene expression in edible dormouse (Glis glis) during immobilization

93

Gazizova Guzel transcriptomics

94

Kuznetsova Svetlana

transcriptomics Transcriptomic of the leech Ozobranchus jantseanus

56


MCCMB'2015

95

Naumenko Sergey

transcriptomics

Building the set of orthologous genes for 66 Gammaridae transcriptomes Antioxidant system of desiccation-tolerant insect Polypedilum vanderplanki Computer tool for gene expression data processing and correlation analysis

96

Nesmelov Alexander Spitsina Anastasia

transcriptomics

97

transcriptomics , algorithms

98 99

Garanina Irina Speshilov Gleb

transcriptomics Splicing sites evolution in , splicing primates prefrontal cortex transcriptomics , splicing Comprehensive comparison of RNA-seq based methods for differential splicing analysis Alternative splicing in hepatocellular carcinoma

100

Vinogradov Dmitry

transcriptomics , splicing

57


MCCMB'2015

Conference materials are available at http://mccmb.belozersky.msu.ru/2015

250 copies

© ITTP RAS, 2015

58