Poster abstracts

Poster number 22 submitted by Doaa Hassan Salem

Penguin: Predicting RNA Pseudouridine Sites in Nanopore Sequencing Data

Doaa Hassan (Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University), Daniel Acevedo (Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University), Swapna Vidhur Daulatabad (Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University), Quoseena Mir (Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University), Sarath Chandra Janga (Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University)

Abstract:
Pseudouridine is one of the abundant RNA modification that occurs at the uridine site and catalyzed by Pseudouridine synthase. It plays an important role in many biological processes and also has an importance in drug development. Recently, the single-molecule sequencing techniques such the direct RNA sequencing platform offered by Oxford Nanopore Technologies enable direct detection of RNA modifications on the molecule that is being sequenced, but to our knowledge this technology has not been used to identify RNA Pseudouridine sites. To this end, in this paper, we address this limitation by introducing a tool called Penguin that integrates several developed ML learning models (i.e., predictors) to identify RNA Pseudouridine sites in Nanopore direct RNA sequencing reads. Penguin extracts a set of features from the raw signal measured by the Oxford Nanopore and the corresponding basecalled kmer. Those features will be used to train the predictors included in penguin platform which will be in turn able to predict whether the signal is modified by the presence of Pseudouridine sites or not in the testing phase. We have included various predictors in Penguin including Support vector machine (SVM), Random Forest (RF), and Neural network (NN). In comparison with the state-of-the-art predictors presented in the literature for identifying Pseudouridine sites, the results on the two benchmark data sets show that Penguin is able to identify Pseudouridine sites with higher accuracy of 93.38% and 92.61% using SVM in random split testing and independent validation testing respectively. A github of the tool is accessible at https://github.com/daniel235/Penguin.

References:
[1] Zhibin Lv, Jun Zhang, Hui Ding and Quan Zou. RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites. Frontiers in Bioengineering and Biotechnology, Volume 8, Article 134, February 2020.
[2] Kewei Liu, Wei Chen, and Hao Lin. XG-PseU: an eXtreme Gradient Boosting based method for identifying pseudouridine sites. Molecular Genetics and Genomics,295, 13-21 (2020).
[3] M. Tahir, H. Tayara, and K.T. Chong iPseU-CNN: identifying RNA pseudouridine sites using convolutional neural networks. Molecular Therapy—Nucleic Acids, 16, 463-470.
[4] J. J., Fang, T., Zhang, Z. Z., Huang, B., Zhu, X. L., and Xiong, Y. PseUI:pseudouridine sites identification based on RNA sequence information. BMC Bioinformatics, 19:11.

Keywords: RNA modification , Pseudouridine, Nanopore