The Student Council Symposium is a forum for students, post docs, and young researchers in the fields of Computational Biology and Bioinformatics. The 2nd Brazilian Student Council Symposium will be held at Colina Verde Hotel - São Pedro, SP on October 4th as a Satellite Meeting of the X-Meeting 2017 - 13th International Conference of the AB3C.
Registration for the 2nd Brazilian Student Council Symposium is free and available at: https://goo.gl/forms/l6JXZMMH8UhLgoYv1. If you have any question, contact us via the Contact Page; We will be glad to help you. Abstract submission for oral talk is closed.
Title: Bioinformatics investigation of non-coding RNAs and transposable elements in plants Abstract: Non-coding RNAs (ncRNAs) are transcripts that do not encode proteins. There are several classes of ncRNAs, which the most studied are microRNAs (miRNAs). Transposable Elements (TEs) are the major genomic component in eukaryotic genomes. They can comprise more than 45% of human and animal genomes, and in plants, they comprise up to 90% of the genome. Our research group recently developed the PlanTE-MIR DB, the first public database that studies the relationship between miRNA and TEs in plants. In this repository, users can search, extract and analyze these overlapping features in 10 plant species. Now, we intend to evaluate TEs relationship with all ncRNA classes, generating a new version of PlanTE-MIR DB. New bioinformatics analyses will use public genomic data available at Ensembl Plants portal and results will be accessible on a user-friendly website. Three steps cover the workflow of this investigation: a) Curate and intersect ncRNAs and repetitive DNA features from existing Ensembl annotation; b) Perform de novo TE prediction in plant genomes and intersect ncRNA annotation in order to find new potential overlaps; and c) Compare newly discovered TEs against public ncRNA databases. From 44 genomes available at Ensembl Plants, 25 species have ncRNA annotation. In 24 we found overlap with TEs. The species with most overlapped regions was Zea mays with 3,105 hits and the species with less hits was Sorghum bicolor, with one hit. Finally, we intend to develop a new method to identify plant TEs using deep learning techniques. These computational analyses will provide to the scientific community a friendly way to work with this knowledge.
Title: PFstats: An Open Tool for Evolutionary Protein Analysis Abstract: Introduction: PFstats is a software developed for the extraction of useful information from protein multiple sequence alignments. By analyzing positional conservation and residue coevolution networks, the software allows the identification of structurally and functionally important amino acid groups and the discovery of probable functional subclasses. Furthermore, it contain tools for the identification of the possible biological significance of these findings. Objectives: The goal of this project is to provide a computational tool with interactive graphical user interface and data visualization tools to predict global and specific functional amino acid residues and also find functional subclasses in protein families. Material and Methods: The software was developed under a client-server architecture. The client was developed in C++/QT and in the server side a java webservice is made to enable the communication between the client and repositories databases of UniprotKb, PFAM and PDB. PFstats includes methods for alignment filtering, residue conservation and coevolution analysis, automatic UniprotKb queries for residue-position annotation, amino acid alphabets reduction and many possible data visualizations. Results and Discussion: We have studied four protein family domains: lysozyme C/Alpha-lactoalbumin, phospholipases A2, nitrogen regulatory protein PII, and the DNA binding domain of the nuclear receptors IV. In all of them communities of residues related to catalytic and binding sites were found, and also communities related to structural importance, as hydrophobic putative channel and secondary structures, and communties reated to taxonomic specificity. Conclusions: PFstats is free and open source, being distributed in the terms of the GPLV3 licence. The software is available in GUI and terminal versions at http://www.biocomp.icb.ufmg.br/biocomp/software-and-databases/pfstats/. We provide binaries for Windows and Linux (debian), but also compilation instructions for other systems, in addition to the source code and a manual.
Title: Evolution of Bitopic Signal Transduction Proteins Abstract: Sensing environmental changes and relaying this information to inside the cell is very important for bacteria and other organisms. Bitopic signal transduction proteins are a diverse array of proteins that possess both exposed extracellular sensor domains and cytoplasmic domains connected by transmembrane regions. Changes in environmental conditions or the presence of external chemical stimuli are detected by the extracellular domains of these proteins and direct structural changes of their cytoplasmic portions. Structural rearrangements of the cytoplasmic domains result in intracellular activities that affect cellular behavior, such as protein post-translational modifications or synthesis of small molecules that act as secondary messengers. Our group has previously shown that XAC2382, a class I PBP protein of Xanthomonas citri, physically interacts with the periplasmic Cache domain of the bitopic GGDEF protein XAC2383, which is encoded by neighboring downstream loci. Similar PBP domains were found to encoded by genes located just upstream of other bitopic proteins,where distinct combinations Cache and distinct cytoplasmic output domains, such as histidine kinases, methyltransferases and cyclic-AMP synthases. We now extend this analysis to a broader range of sensory domains and notice that a sharply distinct, very distantly related class of PBPs, belonging to class II of PBPs, can also be found in the same arrangement, suggesting that independent events lead to the emergence of such distinct combinations of sensor and output domains.Using homology searches for PBPs, domains architecture and genomic context analysis, we demonstrate that distinct classes of PBP domains are often fused to output domains in proteins with varying levels of domain architectural complexity. Histidine kinases, cyclic-dinucleotide GMP synthases (GGDEF) and Sigma54 activators are often encoded by bitopic genes located downstream to PBP genes and, on some occasions, are present as a single gene that encodes a large fused polypeptide.Phylogenetic analysis of proteins harboring PBP’s and the Cache domains of bitopic neighbors suggest that the more complex complex bitopic proteins originated from events of gene fusion and possibly gave origin to new simpler architectures through loss of internal domains. Importantly, the same pattern was also observed for periplasmic-binding proteins of class II, thus suggesting that the same mode of evolution of new architectures through fusion and subsequent loss of internal domains could be a general feature of the evolution of bitopic signal transduction proteins. In order to evaluate this hypothesis, we will now consolidate our search strategies into a generic pipeline for comparative genomics and protein evolution and extend our analysis to other combinations of bitopic proteins and extracellular sensor domains. Our results will help us understand the relative impact of recombination and gene fusion on the evolution and shuffling of multidomain proteins.
Title: Computational gene expression environment by agent-based mRNA translation modeling Abstract: This work models mRNA translation using agent-based representation of the ribosome operation. The unique characteristic of this approach is its ability to quantify the protein synthesis process. The edeine inhibition effect on the mRNA translation as well as the influence of the termination rates are presented. Translation termination anomalies are the central mechanism responsible for a number of human genetic diseases. In addition, the model demonstrates the evolution of long-range correlations in the timing of the mRNA translation. These results allow for the quantification of the catalytic and toxic effects of EF-4 (LepA) – an extremely conserved bur poorly characterized protein. Besides quantification of the mRNA translation yield, the model allows for real-time observation of the entire process of the amino acid chain formation and serves as the optimization tool for gene expression problems like mRNA secondary structure formation. The calibrated model provided the realistic timing of the ribosome operation and amino acid movement within the ribosome exit tunnel. This timing serves as the clock for amino acid chain folding and interaction with the ribosome exit tunnel. The flexibility and adaptability of the presented model and computer simulation allow for reproduction of a number of behaviors observed in the in-vitro experimentation with gene expression. The resulting computational environment gave birth to the new technology, which virtualizes of the cell-free protein expression kits.
Title: In-silico Structural Characterization of Variants Found in PCSK9 gene Identified in Familial Hypercholesterolemic Patients Abstract: Familial Hypercholesterolemia (FH) is a genetic disorder of lipoprotein metabolism, mainly caused by mutations in three genes, LDLR, APOB, and PCSK9. PCSK9 acts regulating low density lipoprotein (LDL) levels by binding to LDL receptor (LDLR) and escorting it towards intracellular degradation compartments. Gain-of-function mutations in PCSK9 increase its proteolytic activity, reducing LDLR concentration, therefore resulting in high levels of LDL cholesterol in the plasma. Loss-of-function mutations lead to a higher concentration of the LDLR, resulting in lower LDL cholesterol levels. The aim of the present project is an in silico and in vitro characterization of the effect of variants in PCSK9 gene identified in FH patients. Forty-eight FH patients were sequenced using Next Generation Sequencing. The data were aligned to the reference genome using Burrows-Wheeler Aligner (BWA) and variant calling was performed using Genome Analysis Toolkit (GATK). After this, nine missense variants were identified in PCSK9 gene. Between them, four were chosen to further analysis because were visible in the crystal structure and presented MAF below 5% in three databases. Crystal structures of wild type PCSK9 and LDLR were retrieved from Protein Data Bank (PDB code: 2P4E and 1N7D, respectively) and site-directed mutagenesis was performed using PyMOL v. 188.8.131.52. to generate the following PCSK9 variants: R237W, A443T, R469W and Q619P. Structural analysis of molecular interactions of PCSK9 and its variants with LDLR was performed by protein-protein docking via ClusPro. The PCSK9-LDLR complexes were visualized using PyMOL v. 184.108.40.206. For R237W and R469W it was observed a possible conformational change that could increase the affinity of PCSK9 for LDLR, when compared with the wild type. In both cases, the arginine to tryptophan change allowed an interaction with a LDLR region featured by a hydrophobic pocket. For A443T and Q619P no conformational changes were observed, and both variants showed only interactions with PCSK9 amino acids itself, suggesting theses variants are probably neutral. R237W was already defined as a loss-of-function mutation by in vitro studies; however, no functional assays were performed on R469W. As previous genetic association studies indicate that R469W is a gain-of-function mutation, and led by our in silico result, an in vitro characterization will be conducted to further understand the possible pathogenicity of the R469W.
Title: MARVEL: A pipeline for recovery and analysis of viral genomes from metagenomic shotgun
sequencing data Abstract: The study of the viral diversity in environmental samples has become increasingly important due to the recognition of key roles played by these organisms in diverse ecosystems. Recent works provide evidence that viruses of bacteria (bacteriophages) are key players in biogeochemical cycles of large ecosystems, such as oceans and forests. Viruses may also be determinant in the flux of genes among microbial populations and in the plasticity of microbial communities, helping these communities to deal with environmental stresses. Knowing the genomes of viruses that are present in diverse environments can thus help to improve our understanding of the microbial ecology and evolution of these environments. Here we describe the MARVEL pipeline for recovery and analysis of viral genomes from metagenome shotgun sequencing data. The main steps in this pipeline are: sequence quality control, metagenome assembly, similarity searches against viral databases and hallmark-protein database of viruses, removal of false positives, and multisample contig binning. At the end, MARVEL generates an automatically curated set of contigs that correspond to draft and complete genomes of environmental viruses present in the analyzed sample. We have applied MARVEL to metagenomic datasets obtained in two environments (composting and a reservoir) of the Sao Paulo Zoo. We retrieved 37 viral genomes from reservoir samples and 36 viral genomes from composting. Most of these genomes have low or no similarity with viral genomes in public databases. Therefore these results are a contribution for shedding light on the gigantic viral dark matter that exists in our planet. MARVEL can be applied to any shotgun metagenomic dataset for which Illumina reads are available. Funding for this research is provided by FAPESP and CAPES.
Title: Clustering of Euclidean Distance Matrices: an Alternative Method for Analysis of Protein Molecular Dynamics Abstract: Molecular dynamics (MD) simulation is a powerful technique used to studying recurrent conformations, transitions states and predictions of physicochemical and geometric properties of proteins and other biomolecules, being important to characterize how these molecules perform their functions. However, analysis of MD simulations has been difficult since this method generates a lot of conformations. Therefore, scientists have created innovative techniques to reduce the number of MD structures without losing information about protein dynamics. Thus clustering algorithms have been applied to group similar structures from MD simulations, but the choice of the information to be clustered is still a challenge. In this work, we propose to use Euclidean distance matrices (EDM) from conformations as input data to clustering algorithms. We used approaches combining non-reduction (NR) or reduction of data dimensionality (MDS and isomap methods), and different clustering algorithms (kmeans, ward, mean-shift and affinity propagation). In order to evaluate different clustering methods, we performed four different production-phase MD simulations to provide the MD trajectory data. MD calculations were performed using two crystal structures of proteins (PDB1CLL and PDB1L2Y). For each protein were performed simulations in 310K and 510K. MD simulations were performed for 20ns using the NAMDv2.12 program and CHARMM27 force field. Data from MD ensemble were collected at every 2ps, resulting in four sets of 10,000 trajectories. These sets were selected four sets of 500 conformations for clustering. In addition, we selected the last conformation of equilibration step as a reference frame. For each conformation were extracted solvent-accessible surface area (SASA) and EDM between atoms of protein (backbone and only Cα). SASA and time were used as connectivity to ward algorithm. According to analyses, we observed that tests using distances of backbone atoms have not different from those of between Cα, both 310K and 510K simulation. Considering dimensionality reduction, we observed that clustering algorithms had similar results in NR and reduced EDM for 310K and 1CLL-510K simulations. Although approaches using isomap had better silhouette values than MDS and NR methods. For data from MD where proteins assume stable conformations (310K), the mean-shift algorithm had the best results, because of smooth distribution of data. However, for data with largely unstable conformations (510K), methods that work better with higher distribution data (kmeans and ward) had the best results. In conclusion, our results have been indicated that EDM could be a good alternative to be used in clustering MD conformations.
Title: A new method based on structural signatures to propose mutations for enzymes β-glucosidase used in biofuel production Abstract: β-glucosidases (E.C. 220.127.116.11) are key enzymes in the second-generation biofuel production process. They act synergically with endoglucanases and exoglucanases to convert cellulose of biomass in fermentable glucose used in biofuel production. However, it has been reported in the literature that the majority of known β-glucosidases is inhibited by high concentrations of glucose. Hence, it has increased the search for mutations that improve the activity and glucose tolerance. In this study, we present a method to propose mutations for enzymes β-glucosidase that may improve the activity and tolerance to glucose inhibition. Our method is based on structural signatures: numerical representations of proteins extracted from the number of pairwise residues. We hypothesized that proteins with similar structural signatures of catalytic pockets present similar characteristics. Hence, mutations that approximate non-tolerant β-glucosidase structural signatures of other enzymes classified in the literature as glucose-tolerant may improve the activity of these enzymes. We used Euclidian distance to calculate signature variations. If the signature variation was negative, the distance between signatures reduced, so we consider as a beneficial mutation. If the signature variation was positive, the distance between signatures increased, so we consider as a not beneficial mutation. We collected 27 mutations in β-glucosidases from literature and classified them in beneficial or not beneficial based on the experimental effects reported. Then, we calculated the signature variation for every mutation and compared the predicted result with the real result. We obtained a precision value of 0.89. In addition, we proposed 15 mutations for Bgl1B, a non-tolerant β-glucosidase extracted from marine metagenome. We detected experimental data in the literature for three of these mutations: H228C, H228T e H228V. The experimental data demonstrate that these mutations improve the activity even in high glucose concentrations. These results show that our method is efficient to detect mutations that increase the activity of β-glucosidases and it can help to produce new mutant enzymes that may improve the second-generation biofuel production.