RNAmountAlign: Efficient software for local, global, semiglobal pairwise and multiple RNA sequence/structure alignment


Autoři: Amir H. Bayegan aff001;  Peter Clote aff001
Působiště autorů: Biology Department, Boston College, Chestnut Hill, MA, United States of America aff001
Vyšlo v časopise: PLoS ONE 15(1)
Kategorie: Research Article
doi: 10.1371/journal.pone.0227177

Souhrn

Alignment of structural RNAs is an important problem with a wide range of applications. Since function is often determined by molecular structure, RNA alignment programs should take into account both sequence and base-pairing information for structural homology identification. This paper describes C++ software, RNAmountAlign, for RNA sequence/structure alignment that runs in O(n3) time and O(n2) space for two sequences of length n; moreover, our software returns a p-value (transformable to expect value E) based on Karlin-Altschul statistics for local alignment, as well as parameter fitting for local and global alignment. Using incremental mountain height, a representation of structural information computable in cubic time, RNAmountAlign implements quadratic time pairwise local, global and global/semiglobal (query search) alignment using a weighted combination of sequence and structural similarity. RNAmountAlign is capable of performing progressive multiple alignment as well. Benchmarking of RNAmountAlign against LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, MXSCARNA, and MUSCLE shows that RNAmountAlign has reasonably good accuracy and faster run time supporting all alignment types. Additionally, our extension of RNAmountAlign, called RNAmountAlignScan, which scans a target genome sequence to find hits having high sequence and structural similarity to a given query sequence, outperforms RSEARCH and sequence-only query scans and runs faster than FOLDALIGN query scan.

Klíčová slova:

Multiple alignment calculation – RNA alignment – RNA folding – RNA structure – Sequence alignment – Sequence databases – Transfer RNA – RNA sequences


Zdroje

1. Levenshtein VI. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady. 1966;10:707.

2. Moulton V, Zuker M, Steel M, Pointon R, Penny D. Metrics on RNA secondary structures. Journal of Computational Biology. 2000;7:277–292. doi: 10.1089/10665270050081522 10890402

3. Shapiro BA. An algorithm for comparing multiple RNA secondary structures. Comput Appl Biosci. 1988;4(3):387–393. doi: 10.1093/bioinformatics/4.3.387 2458170

4. Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF, et al. ViennaRNA Package 2.0. Algorithms Mol Biol. 2011;6:26. doi: 10.1186/1748-7188-6-26 22115189

5. Voss B, Meyer C, Giegerich R. Evaluating the predictability of conformational switching in RNA. Bioinformatics. 2004;20(10):1573–1582. doi: 10.1093/bioinformatics/bth129 14962925

6. Barsacchi M, Baù A, Bechini A. Extensive Assessment of Metrics on RNA Secondary Structures and Relative Ensembles. In: Proceedings of the 31st Annual ACM Symposium on Applied Computing. SAC’16. New York, NY, USA: ACM; 2016. p. 44–47. Available from: http://doi.acm.org/10.1145/2851613.2851868.

7. Ding Y, Lawrence CE. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res. 2003;31(24):7280–7301. doi: 10.1093/nar/gkg938 14654704

8. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48(3):443–453. doi: 10.1016/0022-2836(70)90057-4 5420325

9. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147(1):195–197. doi: 10.1016/0022-2836(81)90087-5 7265238

10. Gusfield D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University; 1997.

11. Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999;27(13):2682–2690. doi: 10.1093/nar/27.13.2682 10373585

12. Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA. 1990;87(6):2264–2268. 2315319

13. Karlin S, Dembo A, Kawabata T. Statistical composition of high-scoring segments from molecular sequences. Annals of Statistics. 1990;18(2):571–581. doi: 10.1214/aos/1176347616

14. Bauer M, Klau GW, Reinert K. Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization. BMC Bioinformatics. 2007;8:271. doi: 10.1186/1471-2105-8-271 17662141

15. Havgaard JH, Lyngsø R, Stormo G, Gorodkin J. Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics. 2005;21(9). doi: 10.1093/bioinformatics/bti279 15657094

16. Havgaard J, Kaur S, Gorodkin J. Comparative ncRNA gene and structure prediction using Foldalign and FoldalignM. Curr Protoc Bioinformatics. 2012;0(O):O.

17. Sundfeld D, Havgaard JH, De Melo AC, Gorodkin J. Foldalign 2.5: multithreaded implementation for pairwise structural RNA alignment. Bioinformatics. 2016;32(8):1238–1240. doi: 10.1093/bioinformatics/btv748 26704597

18. Mathews DH, Turner DH. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J Mol Biol. 2002;317(2):191–203. doi: 10.1006/jmbi.2001.5351 11902836

19. Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol. 2007;3(4):e65. doi: 10.1371/journal.pcbi.0030065 17432929

20. Smith C, Heyne S, Richter AS, Will S, Backofen R. Freiburg RNA Tools: a web server integrating INTARNA, EXPARNA and LOCARNA. Nucleic Acids Res. 2010;38(Web):W373–W377. doi: 10.1093/nar/gkq316 20444875

21. Hofacker IL, Bernhart SH, Stadler PF. Alignment of RNA base pairing probability matrices. Bioinformatics. 2004;20(14):2222–2227. doi: 10.1093/bioinformatics/bth229 15073017

22. Dalli D, Wilm A, Mainz I, Steger G. STRAL: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time. Bioinformatics. 2006;22(13):1593–1599. doi: 10.1093/bioinformatics/btl142 16613908

23. Tabei Y, Kiryu H, Kin T, Asai K. A fast structural multiple alignment method for long RNA sequences. BMC Bioinformatics. 2008;9(1):33. doi: 10.1186/1471-2105-9-33 18215258

24. Tabei Y, Tsuda K, Kin T, Asai K. SCARNA: fast and accurate structural alignment of RNA sequences by matching fixed-length stem fragments. Bioinformatics. 2006;22(14):1723–1729. doi: 10.1093/bioinformatics/btl177 16690634

25. Torarinsson E, Havgaard JH, Gorodkin J. Multiple structural alignment and clustering of RNA sequences. Bioinformatics. 2007;23(8):926–932. doi: 10.1093/bioinformatics/btm049 17324941

26. Xu Z, Mathews DH. Multilign: an algorithm to predict secondary structures conserved in multiple RNA sequences. Bioinformatics. 2011;27(5):626–632. doi: 10.1093/bioinformatics/btq726 21193521

27. Xu ZZ, Mathews DH. Prediction of Secondary Structures Conserved in Multiple RNA Sequences. Methods Mol Biol. 2016;1490:35–50. doi: 10.1007/978-1-4939-6433-8_3 27665591

28. Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000;302(1):205–217. doi: 10.1006/jmbi.2000.4042 10964570

29. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22(22):4673–4680. doi: 10.1093/nar/22.22.4673 7984417

30. Klein RJ, Eddy SR. RSEARCH: Finding homologs of single structured RNA sequences. BMC Bioinformatics. 2003;4:44. doi: 10.1186/1471-2105-4-44 14499004

31. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29(22):2933–2935. doi: 10.1093/bioinformatics/btt509 24008419

32. Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982;162(3):705–708. doi: 10.1016/0022-2836(82)90398-9 7166760

33. Ferre F, Ponty Y, Lorenz WA, Clote P. DIAL: a web server for the pairwise alignment of two RNA three-dimensional structures using nucleotide, dihedral angle and base-pairing similarities. Nucleic Acids Res. 2007;35(Web):W659–W668. doi: 10.1093/nar/gkm334 17567620

34. Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, et al. Rfam 12.0: updates to the RNA families database. Nucleic Acids Res. 2015;43(Database):D130–D137. doi: 10.1093/nar/gku1063 25392425

35. Turner DH, Mathews DH. NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic Acids Res. 2010;38(Database):D280–D282. doi: 10.1093/nar/gkp892 19880381

36. Hogeweg P, Hesper B. Energy directed folding of RNA sequences. Nucleic Acids Res. 1984;12(1):67–74. doi: 10.1093/nar/12.1part1.67 6198625

37. Huynen MA, Perelson A, Vieira WA, Stadler PF. Base pairing probabilities in a complete HIV-1 RNA. J Comput Biol. 1996;3(2):253–274. doi: 10.1089/cmb.1996.3.253 8811486

38. Gardner PP, Wilm A, Washietl S. A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res. 2005;33(8):2433–2439. doi: 10.1093/nar/gki541 15860779

39. Smith TF, Waterman MS. Comparison of biosequences. Advances in Applied Mathematics. 1981;2:482–489. doi: 10.1016/0196-8858(81)90046-4

40. Sellers PH. On the theory and computation of evolutionary distances. SIAM J Appl Math. 1974;26:787–793. doi: 10.1137/0126070

41. Waterman MS. Introduction to Computational Biology. Chapman and Hall/CRC; 1995.

42. Bashford D, Chothia C, Lesk AM. Determinants of a protein fold: Unique features of the globin amino acid sequences. Journal of Molecular Biology. 1987;196(1):199–216. https://doi.org/10.1016/0022-2836(87)90521-3. 3656444

43. Sievers F, Wilm A, Dineen D, Gibson JJ, Karplus K, Li W, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011;7(539):1–6.

44. Sneath PHA, Sokal RR. Numerical taxonomy. The principles and practice of numerical classification. San Francisco, W.H. Freeman and Company., USA: Taylor & Francis, Ltd.; 1973.

45. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)—Protein Structure. 1975;405(2):442–451. https://doi.org/10.1016/0005-2795(75)90109-9.

46. Bernhart SH, Hofacker IL, Will S, Gruber AR, Stadler PF. RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics. 2008;9:474. doi: 10.1186/1471-2105-9-474 19014431

47. Clote P, Ferre F, Kranakis E, Krizanc D. Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA. 2005;11(5):578–591. doi: 10.1261/rna.7220505 15840812

48. Tabei Y, Asai K. A local multiple alignment method for detection of non-coding RNA sequences. Bioinformatics. 2009;25(12):1498–1505. doi: 10.1093/bioinformatics/btp261 19376823

49. Pang H, Tang J, Chen SS, Tao S. Statistical distributions of optimal global alignment scores of random protein sequences. BMC Bioinformatics. 2005;6:257. doi: 10.1186/1471-2105-6-257 16225696

50. Hertel J, De Jong D, Marz M, Rose D, Tafer H, Tanzer A, et al. Non-coding RNA annotation of the genome of Trichoplax adhaerens. Nucleic Acids Res. 2009;37(5):1602–1615. doi: 10.1093/nar/gkn1084 19151082

51. Smith MA, Seemann SE, Quek XC, Mattick JS. DotAligner: identification and clustering of RNA structure motifs. Genome Biol. 2017;18(1):244. doi: 10.1186/s13059-017-1371-3 29284541

52. Lowe TM, Chan PP. tRNAscan-SE On-line: integrating search and context for analysis of transfer RNA genes. Nucleic Acids Research. 2016;44(W1):W54–W57. doi: 10.1093/nar/gkw413 27174935

53. Huynen M, Gutell R, Konings D. Assessing the reliability of RNA folding using statistical mechanics. Edited by Draper D. E. Journal of Molecular Biology. 1997;267(5):1104–1112. https://doi.org/10.1006/jmbi.1997.0889.


Článek vyšel v časopise

PLOS One


2020 Číslo 1