RNAmountAlign: Efficient software for local, global, semiglobal pairwise and multiple RNA sequence/structure alignment

Authors: Amir H. Bayegan ^aff001; Peter Clote ^aff001
Authors place of work: Biology Department, Boston College, Chestnut Hill, MA, United States of America ^aff001
Published in the journal: PLoS ONE 15(1)
Category: Research Article
doi: https://doi.org/10.1371/journal.pone.0227177

Summary

Alignment of structural RNAs is an important problem with a wide range of applications. Since function is often determined by molecular structure, RNA alignment programs should take into account both sequence and base-pairing information for structural homology identification. This paper describes C++ software, RNAmountAlign, for RNA sequence/structure alignment that runs in O(n³) time and O(n²) space for two sequences of length n; moreover, our software returns a p-value (transformable to expect value E) based on Karlin-Altschul statistics for local alignment, as well as parameter fitting for local and global alignment. Using incremental mountain height, a representation of structural information computable in cubic time, RNAmountAlign implements quadratic time pairwise local, global and global/semiglobal (query search) alignment using a weighted combination of sequence and structural similarity. RNAmountAlign is capable of performing progressive multiple alignment as well. Benchmarking of RNAmountAlign against LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, MXSCARNA, and MUSCLE shows that RNAmountAlign has reasonably good accuracy and faster run time supporting all alignment types. Additionally, our extension of RNAmountAlign, called RNAmountAlignScan, which scans a target genome sequence to find hits having high sequence and structural similarity to a given query sequence, outperforms RSEARCH and sequence-only query scans and runs faster than FOLDALIGN query scan.

Keywords:

Sequence alignment – Sequence databases – Multiple alignment calculation – Transfer RNA – RNA structure – RNA folding – RNA alignment – RNA sequences

Introduction

A number of different metrics exist for comparison of RNA secondary structures, including base pair distance (BP), string edit distance (SE) [1], mountain distance (MD) [2], tree edit distance (TE) [3], coarse tree edit distance (HTE) [4], morphological distance [5] and a few other metrics. In what appears to be the most comprehensive published comparison of various secondary structure metrics [6], it was shown that all of these distance measures are highly correlated with respect to Pearson correlation when computing distances between structures taken from the Boltzmann low-energy ensemble of secondary structures [7] for the same RNA sequence—so-called intra-ensemble correlation. In contrast, these distance measures have low Pearson correlation when computing distances between structures taken from Boltzmann ensembles of different RNA sequences of the same length—so-called inter-ensemble correlation. For instance, the intra-ensemble correlation between base pair distance (BP) and mountain distance (MD) is 0.822, while the corresponding inter-ensemble correlation drops to 0.210. Intra-ensemble correlation between string edit distance (SE) and the computationally more expensive tree edit distance (TE) is 0.975, while the corresponding intra-ensemble correlation drops to 0.590—see Table 1.

**Tab. 1. Pearson correlation between various secondary structure metrics.**

Due to poor inter-ensemble correlation of RNA secondary structure metrics, and the fact that most secondary structure pairwise alignment algorithms depend essentially on some form of base pair distance, string edit distance, or free energy of common secondary structure, we have developed the first RNA sequence/structure pairwise alignment algorithm that is based on (incremental ensemble) mountain distance. Our software, RNAmountAlign, uses this distance measure, since the Boltzmann ensemble of all secondary structures of a given RNA of length n can represented as a length n vector of real numbers, thus allowing an adaptation of fast sequence alignment methods. Depending on the command-line flag given, our software, RNAmountAlign can perform pairwise alignment, (Needleman-Wunsch global [8], Smith-Waterman local [9] or semiglobal [10] alignment) as well as progressive multiple alignment (global and local), computed using a guide tree as in CLUSTAL [11]. Expect values E for local alignments are computed using Karlin-Altschul extreme-value statistics [12, 13], suitably modified to account for our new sequence/structure similarity measure. Additionally, RNAmountAlign can determine p-values (hence E-values) by parameter fitting for the normal (ND), extreme value (EVD) and gamma (GD) distributions.

We benchmark the performance of RNAmountAlign on pairwise and multiple global sequence/structure alignment of RNAs against the widely used programs LARA, FOLDALIGN, DYNALIGN, LocARNA, STRAL and MXSCARNA. LARA (Lagrangian relaxed structural alignment) [14] formulates the problem of RNA (multiple) sequence/structure alignment as a problem in integer linear programming (ILP), then computes optimal or near-optimal solutions to this problem. The software FOLDALIGN [15–17], and DYNALIGN [18] are different O(n⁴) approximate implementations of Sankoff’s O(n⁶) optimal RNA sequence/structure alignment algorithm. FOLDALIGN sets limits on the maximum length of the alignment as well as the maximum distance between subsequences being aligned in order to reduce the time complexity of the Sankoff algorithm. DYNALIGN [18] implements pairwise RNA secondary structural alignment by determining the common structure to both sequences that has lowest free energy, using a positive (destabilizing) energy heuristic for gaps introduced, in addition to setting bounds on the distance between subsequences being aligned. In particular, the only contribution from nucleotide information in Dynalign is from the nucleotide-dependent free energy parameters for base stacking, dangles, etc. LocARNA (local alignment of RNA) [19, 20] is a heuristic implementation of PMcomp [21] which compares the base pairing probability matrices computed by McCaskill’s algorithm. Although the software is not maintained, STRAL [22] which is similar to our approach, uses up- and downstream base pairing probabilities as the structural information and combines them with sequence similarity in a weighted fashion. MXSCARNA [23] is a progressive multiple alignment algorithm based on the pairwise alignment algorithm of SCARNA [24]. In contrast to Sankoff-type methods, SCARNA is a heuristic algorithm that performs alignment based on the detection of fixed-length stem candidates that belong to the consensus secondary structure of given sequences.

LARA, mLocARNA (extension of LocARNA), FOLDALIGNM [16, 25] (extension of FOLDALIGN), Multilign [26, 27] (extension of DYNALIGN), STRAL and MXSCARNA support multiple alignment. LARA computes all pairwise sequence alignments and subsequently uses the T-Coffee package [28] to construct multiple alignments. Both FOLDALIGNM and mLocARNA implement progressive alignment of consensus base pairing probability matrices using a guide tree similar to the approach of PMmulti [21]. For a set of given sequences, Multilign uses DYNALIGN to compute the pairwise alignment of a single fixed index sequence to each other sequence in the set, and computes a consensus structure. In each pairwise alignment, only the index sequence base pairs found in previous computations are used. More iterations in the same manner with the same index sequence are then used to improve the structure prediction of other sequences. The number of pairwise alignments in Multilign is linear with respect to the number of sequences. STRAL and MXSCARNA perform multiple alignment in a fashion similar to CLASTALW [29]. Table 2 provides an overview of various features, to the best of our knowledge, supported by the software benchmarked in this paper.

In this paper we provide a very fast, comprehensive software package capable of pairwise/multiple local/global/semiglobal alignment with p-values and E-values for statistical significance. Moreover, due to its speed and relatively good accuracy, the software can be used for whole-genome scans for homologues of an RNA as query, a similar modality as in the software RSEARCH [30]. This type of whole-genome scan is in contrast to Infernal [31], which requires a multiple alignment of distinct RNAs to construct a covariance model for whole-genome searches. Searching for homologues from only a single sequence is potentially useful in the case of orphan RNAs, very recently discovered RNAs that do not belong to any known RNA family.

Materials and methods

Quadratic time alignment using affine gap cost is implemented in RNAmountAlign using the Gotoh method [32] with the following pseudocode, shown for the case of semiglobal alignment. In our query scan form of semiglobal alignment, if a = a₁, …, a_n is the query sequence and b = b₁, …, b_m is the current genomic window being searched for an occurrence of a, then there is no penalty for gaps occurring to the left of the nucleotide of b aligned with a₁ nor for gaps occurring to the right of the nucleotide of b aligned with a_m, although internal gaps are penalized. This is obtained by the following pseudocode, as also found in [33]. Let g(k) denote an affine cost for size k gap, defined by g(0) = 0 and g(k) = g_i + (k − 1) ⋅ g_e for positive gap initiation [resp. extension] costs g_i [resp. g_e]. For query a = a₁, …, a_n and target b = b₁, …, b_m, define (n + 1) × (m + 1) matrices M, P, Q as follows: M_i,0 = g(i) for all 1 ≤ i ≤ n, M_0,j = 0 for all 1 ≤ j ≤ m, while for positive i, j we have M_i,j = max (M_i−1,j−1 + sim (a_i, b_j), P_i,j, Q_i,j), where sim is a similarity measure formally defined later in Eq (17). For 1 ≤ i ≤ n, 1 ≤ j ≤ m, let P_0,j = 0 and P_i,j = max(M_i−1,j + g_i, P_i−1,j + g_e), and define Q_i,0 = 0 and Q_i,j = max(M_i,j−1 + g_i, Q_i,j−1 + g_e, 0). Determine the maximum semiglobal alignment score in row n, then perform backtracking to obtain an optimal semiglobal alignment.

Incremental ensemble mountain height

A secondary structure for a given RNA nucleotide sequence a = a₁, …, a_n is a set s of base pairs (i, j), where 1 ≤ i < j ≤ n, such that:

if (i, j) ∈ s then a_i, a_j form either a Watson-Crick (AU,UA,CG,GC) or wobble (GU,UG) base pair,
if (i, j) ∈ s then j − i > θ = 3 (a steric constraint requiring that there be at least θ = 3 unpaired bases between any two positions that are paired),
if (i, j) ∈ s then for all i′ ≠ i and j′ ≠ j, (i′, j) ∉ s and (i, j′) ∉ s (nonexistence of base triples),
if (i, j) ∈ s and (k, ℓ) ∈ s, then it is not the case that i < k < j < ℓ (nonexistence of pseudoknots).

Secondary structures can be depicted in several equivalent manners, but in this paper, we use the dot-bracket notation in which matching left and right parenthesis positions indicate base-paired nucleotides. For instance, the EMBL accession code, sequence and secondary structure of a 53 nt type III hammerhead ribozyme from peach latent mosaic viroid (PLMVd), obtained from the Rfam database [34], is given as follows

> AJ550901.1/282-334

12345678901234567890123456789012345678901234567890123

GAAGAGUCGCGCUAAGCGCACUGAUGAGUCUUUGAGAUAAGACGAAACUCUUC

.((((((.((((…))))……‥((((……‥))))…)))))).

Positions 1 and 53 (for instance) are unpaired, as indicated by a dot, while positions 2 and 52 are paired and form the outermost base pair (2, 52), positions 12, 16 are paired and base pair (12, 16) constitutes one of the two apical (hairpin) loops, while the other apical (hairpin) loop is closed by the base pair (31, 40), etc.

Given an RNA sequence a = a₁, …, a_n, the base pairing probabilities pi,ja are defined by

where Z is the partition function, Sa denotes the set of secondary structures of a, E(s) is the free energy of secondary structure s using energy parameters from [35]. Given a = a₁, …, a_n of length n, the mountain height [36] h_s(k) of a secondary structure s of a at position k is defined as the number of base pairs in s that lie between an external loop and k, formally given by

We follow [2, 36] in our definition of mountain height, and related notions of ensemble mountain height and distance, while [37] and Vienna RNA package [4] differ in an inessential manner by defining h_s(k) = |{(i, j) ∈ s: i < k}| − |{(i, j) ∈ s: j ≤ k}|. If we consider the secondary structure s, defined as a set of base pairs (i, j) where 1 ≤ i < j ≤ n, as a dot-bracket notation, then h_s(k) is simply the running count, where in scanning from left-to-right we add 1 to the count for each open parenthesis (, and subtract 1 from the count for each closed parenthesis ), encountered between 1, …, k.

The ensemble mountain height 〈h(k)〉 [37] for RNA sequence a = a₁, …, a_n at position k is defined as the average mountain height, where the average is taken over the Boltzmann ensemble of all low-energy structures s of sequence a. If base pairing probabilities p_i,j have been computed, then it follows that

and hence the incremental ensemble mountain height, which for values 1 < k ≤ n is defined by m_a(k) = 〈h(k)〉 − 〈h(k − 1)〉 can be computed by

It is clear that −1 ≤ m_a(k) ≤ 1, and that both ensemble mountain height and incremental ensemble mountain height can be computed in time that is quadratic in sequence length n, provided that base pairing probabilities p_i,j have been computed. Except for the cubic time taken by a C-library function call of export_bppm() from from Vienna RNA package [4], the software RNAmountAlign has quadratic time and space requirements. The following pairwise alignment of 72 nt tRNA AL671879.2 and 69 nt tRNA D16387.1, both taken from the BRAliBase 2.1 K2 database [38], was generated by the RNAmountAlign web server, with default parameters of gap initiation −3, gap extension −1 and structural weight γ = 1/2, which latter equally weights the contributions from sequence and structural similarity. The alignment produced by RNAmountAlign is identical to the reference alignment from BRAliBase 2.1 K2, although sequence identity is only 28% (twilight zone). The consensus structure, shown in S1 Fig, is produced by a call of RNAalifold from Vienna RNA Package, given the alignment produced by RNAmountAlign; Fig 1 is an alternative display of this alignment as a superimposition of the (gapped) ensemble mountain heights of tRNA AL671879.2 and tRNA D16387.1.

> AL671879.2

GGGGAUGUAGCUCAGUGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCCCCGGGUUCGAUCCCCGGCAUCUCCA

> D16387.1

GUUUCAUGAGUAUAGC---AGUACAUUCGGCUUCCAACCGAAAGGUUUUUGUAAACAACCAAAAAUGAAAUA

> RNAalifold consensus structure

(((((((‥((((…….)))).(((((…….)))))….((((((……))))))))))))).

Transforming distance into similarity

In [39], Seller’s (distance-based) global pairwise alignment algorithm [40] was rigorously shown to be equivalent to Needleman and Wunsch’s (similarity-based) global pairwise alignment algorithm [8]. Recalling that Seller’s alignment distance is defined as the minimum, taken over all alignments of the sum of distances d(x, y) between aligned nucleotides x, y plus the sum of (positive) weights w(k) for size k gaps, while Needleman-Wunsch alignment similarity is defined as the maximum, taken over all alignments of the sum of similarities s(x, y) between aligned nucleotides x, y plus the sum of (negative) gap weights g(k) for size k gaps, Smith and Waterman [39] show that by defining

and by taking the minimum distance, rather than maximum similarity, the Needleman-Wunsch algorithm is transformed into Seller’s algorithm. Though formulated here for RNA nucleotides, equivalence holds over arbitrary alphabets and similarity measures (e.g. BLOSUM62).

To improve the intuitive understanding of our structural distance measure STRSIM, we initially define a simple distance measure d₀ between dot-bracket symbols. For dot-bracket symbol x ∈ { (, •, ) }, define the sign function by

Now define the distance d₀(x, y) = |sign(x) − sign(y)| between dot-bracket symbols x, y ∈ { (, •, ) } by

Let A=(s1*⋯sN*t1*⋯tN*) denote a fixed alignment between two arbitrary secondary structures s, t of (possibly different) lengths n, m, where si*,ti*∈{(,•,),−}, the dash − denotes the gap symbol, and where a gap is never aligned above another gap—this follows the notational convention for representation of alignments in [41]. We define the structural alignment distance for A by summing d0(si*,ti*) over those positions i where neither character si*,ti* is a gap symbol, then adding w(k) for all size k gaps in A.

Using incremental ensemble mountain height from Eq (4), we can generalize structural alignment distance from the simple case of comparing two dot-bracket representations of secondary structures to the more representative case of comparing the low-energy Boltzmann ensemble of secondary structures for RNA sequence a to that of RNA sequence b. Given RNA sequences a = a₁, …, a_n and b = b₁, …, b_m, and given a fixed sequence alignment (a1*⋯aN*b1*⋯bN*) let A=(ma(1)*⋯ma(N)*mb(1)*⋯mb(N)*) denote the corresponding alignment between the incremental ensemble mountain heights m_a(1) ⋯ m_a(n) of a and and the incremental ensemble mountain heights m_b(1) ⋯ m_b(m) of b. Generalize structural distance d₀ defined in Eq (8) to d₁ defined by d₁(a_i, b_j) = |m_a(i) − m_b(j)|, where m_a(i) and m_b(j) are real numbers in the interval [−1, 1], defined by Eq 4. Define ensemble structural alignment distance for A by summing d₁(a_i, b_j) over all aligned positions i, j for which neither character is a gap symbol, then adding positive weight w(k) for all size k gaps. By Eqs (5) and (6), it follows that an equivalent ensemble structural similarity measure between two positions a_i, b_j, denoted STRSIM(a_i, b_j), is obtained by multiplying d₁ and w(k) by −1:

This equation will be used later, since our algorithm RNAmountAlign combines both sequence and ensemble structural similarity. Indeed, −|m_a(i) − m_b(j)| ∈ [−2, 0] with maximum value of 0 while RIBOSUM85-60, shown in Table 3, has similarity values in the interval [−1.86, 2.22]. In order to combine sequence with structural similarity, both ranges should be rendered comparable as shown in the next section.

**Tab. 3. RIBOSUM85-60 similarity scores.**

Pairwise alignment

In order to combine sequence and ensemble structural similarity, we determine a multiplicative scaling factor α_seq and an additive shift factor α_str such that the mean and standard deviation for the distribution of sequence similarity values from a RIBOSUM matrix [30] (after being multiplied by α_seq) are equal to the mean and standard deviation for the distribution of structural similarity values from STRSIM (after additive shift of α_str). The RIBOSUM85-60 nucleotide similarity matrix used in this paper is given in Table 3, and the distributions for RIBOSUM and STRSIM values are shown in Fig 2 for the 72 nt transfer RNA AL671879.2. Given query a = a₁, …, a_n [resp. target b = b₁, …, b_n], let p_A, p_C, p_G, p_U [resp. pA′,pC′,pG′,pU′] denote the nucleotide relative frequencies for a [resp. b], i.e. the proportion of occurrences of each nucleotide A,C,G,U in query a [resp. target b]. Define the mean μ_seq and standard deviation σ_seq of RIBOSUM nucleotide similarities by

For 72 nt tRNA query sequence AL671879.2, there are 12 A’s, 20 C’s, 24 G’s, and 16 U’s so nucleotide relative frequencies are approximately <i>p</i><sub><i>A</i></sub> ≈ 0.167, <i>p</i><sub><i>C</i></sub> ≈ 0.278, <i>p</i><sub><i>G</i></sub> ≈ 0.333, <i>p</i><sub><i>U</i></sub> ≈ 0.222, and for 69 nt tRNA target sequence D16498.1, there are 26 A’s, 12 C’s, 12 G’s, and 19 U’s so nucleotide relative frequencies are approximately <i>p</i><sub><i>A</i></sub> ≈ 0.377, <i>p</i><sub><i>C</i></sub> ≈ 0.174, <i>p</i><sub><i>G</i></sub> ≈ 0.174, <i>p</i><sub><i>U</i></sub> ≈ 0.275. — Fig. 2. For 72 nt tRNA query sequence AL671879.2, there are 12 A’s, 20 C’s, 24 G’s, and 16 U’s so nucleotide relative frequencies are approximately p_A ≈ 0.167, p_C ≈ 0.278, p_G ≈ 0.333, p_U ≈ 0.222, and for 69 nt tRNA target sequence D16498.1, there are 26 A’s, 12 C’s, 12 G’s, and 19 U’s so nucleotide relative frequencies are approximately p_A ≈ 0.377, p_C ≈ 0.174, p_G ≈ 0.174, p_U ≈ 0.275.

Compute the base pairing probabilities pi,ja of query sequence a for 1 ≤ i ≤ j ≤ n, and pi,jb of target sequence b for 1 ≤ i ≤ j ≤ m by a call to the matrix bppm, using the Vienna RNA Package C-library.

Define the expected left p(a and right p)a base pairing probabilities, and the expected unpaired probability p•a of query sequence a by the following

Analogously define the corresponding probabilities p(b, p)b, p•b for the target sequence b.

Setting s₀(x, y) = −d₀(x, y), where d₀(x, y) is defined in Eq (8), for given query [resp. target] base pairing probabilities p₍, p_•, p₎ [resp. p(′,p•′,p)′] of dot-bracket characters, it follows that the mean μ_str and standard deviation σ_str of structural similarities can be computed by

Now we compute a multiplicative factor α_seq and an additive shift term α_str, both dependent on frequencies p_A, p_C, p_G, p_U and p₍, p_•, p₎, such that the mean [resp. standard deviation] of nucleotide similarity multiplied by α_seq is equal to the mean [resp. standard deviation] of structural similarity after addition of shift term α_str:

Given the query RNA a = a₁, …, a_n and target RNA b = b₁, …, b_m with incremental ensemble mountain heights m_a(1) ⋯ m_a(m) of a, m_b(1) ⋯ m_b(m) of b, and user-defined weight 0 ≤ γ ≤ 1, our final similarity measure is defined by

where α_seq, α_str are computed by Eqs (15) and (16) depending on probabilities p_A, p_C, p_G, p_U [resp. pA′,pC′,pG′,pU′] and p₍, p_•, p₎ [resp. p(′,p•′,p)′] of the query [resp. target]. All benchmarking computations were carried out using γ = 1/2, although it is possible to use position-specific weight γ_i,j defined as the average probability that i is paired in a and j is paired in b.

Our structural similarity measure is closely related to that of STRAL, which we discovered only after completing a preliminary version of this paper. Let plia=∑j<ipj,ia and pria=∑j>ipi,ja be the probability that position i of sequence a is paired to a position on the left or right, respectively. The similarity measure used in STRAL is defined by

From Eqs (17) and (4) our measure can be defined as

Though clearly very related, RNAmountAlign was developed independently, without knowledge of STRAL, and offers a number of functionalities unavailable in STRAL, which latter appears to be no longer maintained. For instance, RNAmountAlign supports local and semiglobal alignment, and reports p-values and E-values; these features are not available in STRAL. Since we were unable to compile STRAL, our benchmarking results for STRAL use an adaptation of our code to support Eq (18). There are nevertheless some differences in how progressive alignment is implemented in STRAL that could affect run time.

To illustrate the method, suppose that the query [resp. target] sequence is the 72 nt tRNA AL671879.2 [resp. 69 nt tRNA D16498.1] with sequence GGGGAUGUAG CUCAGUGGUA GAGCGCAUGC UUCGCAUGUA UGAGGCCCCG GGUUCGAUCC CCGGCAUCUC CA. Then nucleotide query [resp. target] probabilities are (approximately) p_A ≈ 0.167, p_C ≈ 0.278, p_G ≈ 0.333, p_U ≈ 0.222. For the 69 nt tRNA target sequence D16498.1 with sequence GUUUCAUGAG UAUAGCAGUA CAUUCGGCUU CCAACCGAAA GGUUUUUGUA AACAACCAAA AAUGAAAUA the nucleotide relative frequencies are approximately pA′≈0.377, pC′≈0.174, pG′≈0.174, pU′≈0.275. From the base pairing probabilities returned by RNAfold -p [4], we determine that p₍ = 0.3035, p_• = 0.3930, p₎ = 0.3035 [resp. p(′=0.2835, p•′=0.433, p)′=0.2835]. Using these probabilities in Eqs (10)–(14), we determine that μ_seq = −0.9098, σ_seq = 1.5871 and μ_str = −0.8298, σ_str = 0.6967. By Eqs (15) and (16), we determine that RIBOSUM scaling factor α_seq = 0.4390 and α_str = 0.4305 (all values shown rounded to 4-decimal places). Using these values, the scaled RIBOSUM mean is now −0.39936, now equal to the shifted STRSIM mean, and both the scaled RIBOSUM standard deviation and shifted STRSIM standard deviation equal 0.6967. Since the mean and standard deviation of α_seq-scaled RIBOSUM values are identical with that of α_str-shifted STRSIM values, hence can be combined in Eq (17). Although sequence identity of the reference alignment of these tRNAs from BRAliBase 2.1 K2 is only 28%, the global alignment produced by RNAmountAlign is identical to that the reference alignment, using default parameters of gap initiation −3, gap extension −1, and structural weight γ = 1/2 in Eq (17).

Fig 2 depicts the distribution of RIBOSUM85-60 [resp. STRSIM] values in this case, both before and after application of scaling factor α_seq [resp. shift α_str]—recall that α_seq and α_str] depend on p_A, p_C, p_G, p_U, p₍, p_•, p₎ of tRNA AL671879.2 and pA′,pC′,pG′,pU′,p(′,p•′,p)′ of tRNA D16498.1.

Statistics for pairwise alignment

Karlin-Altschul statistics for local pairwise alignment

For a finite alphabet A and similarity measure s, suppose that the expected similarity ∑x,y∈Apxpy·s(x,y) is negative and that s(x, y) is positive for at least one choice of x, y. In the case of BLAST, amino acid and nucleotide similarity scores are integers, for which the Karlin-Altschul algorithm was developed [12]. In contrast, RNAmountAlign similarity scores are not integers (or more generally values in a lattice), because Eq (17) combines real-valued α_seq-scaled RIBOSUM nucleotide similarities with real-valued α_str-shifted STRSIM structural similarities, which depend on query [resp. target] probabilities p_A, p_C, p_G, p_U, p₍, p_•, p₎ [resp. pA′,pC′,pG′,pU′,p(′,p•′,p)′]. For that reason, we use Theorem 1 of Karlin, Dembo and Kawabata [13], reformulated using the notation of this paper, where the similarity score s(x, y) for RNA nucleotides x, y is defined by Eq (17).

Theorem 1 (Karlin, Dembo, Kawabata [13])

Given similarity measure s between nucleotides in alphabet A = {A, C, G, U}, let λ* be the unique positive root of E[es(x,y)]=∑x,y∈Apxpy′·eλs(x,y), and let random variable S_k denote the score of a length k gapless alignment. For large local alignment score z,

where M denotes maximal segment scores for local alignment of random RNA sequences a₁, …, a_n and b₁, …, b_m, and where

Fitting data to probability distributions

Data were fit to the normal distribution (ND) by the method of moments (i.e. mean and standard deviation were taken from data analysis). Data were fit to the extreme value distribution (EVD)

by an in-house implementation of maximum likelihood to determine λ, K, as described in supplementary information to [30]. Data were fit to the gamma distribution by using the function fitdistr(x, ‘gamma’) from the package MASS in the R programming language, which determines rate and shape parameters for the density function

with where α is the shape parameter, the rate is 1/λ, where λ is known as the scale parameter.

Multiple alignment

Suppose p_A, p_C, p_G, p_U are the nucleotide probabilities obtained after the concatenation of all sequences. Let p₍, p_•, p₎ be computed by individually folding each sequence and taking the arithmetic average of probabilities of (, • and ) over all sequences. The mean and standard deviation of sequence and structure similarity are computed similar to Eqs (10)–(14).

Sequence multiplicative scaling factor α_seq and the structure additive shift factor α_str are computed from these values using Eqs (15) and (16).

As in CLUSTAL [42] and the CLUSTAL Omega [43], our software RNAmountAlign implements progressive multiple alignment using the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [44] and p. 360 of [41]. In UPGMA, one first defines a similarity matrix S, where S[i, j] is equal to (maximum) pairwise sequence similarity of sequences i and j. A rooted tree is then constructed by progressively creating a parent node of the two closest siblings. Parent nodes are profiles (PSSMs) that represent alignments of two or more sequences, hence can be treated as pseudo-sequences in a straightforward adaptation of pairwise alignment to the alignment of profiles. Let’s consider an alignment of N sequences A=(a11*⋯a1M*⋯aN1*⋯aNM*) composed of M columns. Let Ai={a1i*,a2i*,…,aNi*} denote column i of the alignment (for 1 ≤ i ≤ M). Suppose p(i, x), for x ∈ {A, C, G, U, −}, indicates the probability of occurrence of a nucleotide or gap at column i of alignment A. Then sequence similarity SEQSIM between two columns is defined by

where

The structural measure for a profile is computed from the incremental ensemble heights averaged over each column. Let m_A(i) denote the arithmetic average of incremental ensemble mountain height at column A_i

where maj*(i) is the incremental ensemble mountain height at position i of sequence aj* obtained from Eq (4). Here, let maj*(i)=0 if aji* is a gap. Structural similarity between two columns is defined by

Finally, the combined sequence/structure similarity is computed from

Benchmarking

Accuracy measures

Sensitivity, positive predictive value, and F1 score for pairwise alignments were computed as follows. Let A=(a1*⋯aN*b1*⋯bN*) denote an alignment, where a_i, b_i ∈ {A, C, G, U, −}, and the aligned sequences include may contain occurrences of the gap symbol ‘−’, provided that not both ai* and bi* are gap symbols. The number TP of true positives [resp. FP of false positives] is the number of alignment pairs (ai*,bi*) in the predicted alignment that belong to [resp. do not belong to] the reference alignment. The sensitivity (Sen) [resp. positive predictive value (PPV)] of a predicted alignment is TP divided by reference alignment length [resp. TP divided by predicted alignment length]. The F1 score is the harmonic mean of sensitivity and PPV, so F1=2(1/Sen)+(1/PPV). For the computation of Sen, PPV, and F1, we consider not only nucleotide pairs (a_i, b_j), but also include pairs of the form (a_i, —) [resp. (—, b_j)], which represent an insertion in the top sequence of alignment A (equivalently a deletion from the bottom sequence of A) [resp. deletion from the top sequence of A (equivalently an insertion in the bottom sequence of A)]. Suppose that the (toy) alignment A

is produced by an algorithm (such as RNAmountAlign, FOLDALIGN, MXSCARNA, etc.). Then a = a₁a₂a₃a₄a₅, a = ACGUA, b = b₁b₂b₃b₄, b = ACUA. The length of alignment A is N = 6, while a1*⋯a6*=ACG−UA and b1*⋯b6*=A−−CUA. If B is the (toy) alignment of the same sequences

taken from a reference database (such as BRAliBase 2.1), then sequences a, b are as above, the length of alignment B is M = 5, while c1*⋯c5*=ACGUA and d1*⋯d5*=AC−UA. The number TP of correctly aligned pairs is 4, since

Because TP = 4, rather than 3 (if we had instead counted only aligned nucleotide pairs), the computations of F1 score, Sen, and PPV are duly affected.

To compare predicted structures with consensus Rfam structures, we computed the Matthews correlation coefficient (MCC) [45] as follows:

where TP_str is the number of correctly predicted base pairs, FP_str is the number of incorrectly predicted base pairs, TN_str is the number of potential base pairs absent in both predicted and reference structures and FN_str is the number of base pairs in the reference structure that were not predicted.

In the case of local alignment, since the size of the reference alignment is unknown, only the predicted alignment length and PPV are reported. To compute the accuracy of multiple alignment, we used sum-of-pair-scores (SPS) [11], defined as the number of correctly nucleotide pairs in the alignment produced by a given algorithm, divided by the total number of aligned nucleotide pairs in the reference alignment. For completeness, and to contrast this with our definition of sensitivity, PPV and F1 score, we formally define SPS as follows. Suppose that A denotes a multiple alignment of the form A=(a1,1*⋯a1,M*⋯aN,1*⋯aN,M*). For 1 ≤ i, j ≤ M, 1 ≤ k ≤ N define p_i,j,k = 1 if the nucleotide ai,k* is aligned with the nucleotide aj,k* in both the reference and predicted alignments, and p_i,j,k = 0 otherwise. Sum-of-pairs score SPS is then the sum, taken over all i, j, k, of the p_i,j,k. Though SPS can be considered as the average sensitivity, taken over all sequence pairs in the alignment, this is not technically correct in our case, since our definition of sensitivity also counts pairs of the form (X, —) and (—, X), where X ∈ {A, C, G, U}, from the reference alignment.

To measure the conservation of secondary structures in alignments, structural conservation index (SCI) was computed using RNAalifold [46]. RNAalifold computes SCI as the ratio of the free energy of the alignment, computed by RNAalifold, with the average minimum free energy of individual structures in the alignment. SCI values close to 1 [resp. 0] indicate high [resp. low] structural conservation. All computations made with Vienna RNA Package used version 2.1.7 [4] using default Turner 2004 energy parameters [35]).

Dataset for global and local alignment comparison

For pairwise global alignment benchmarking in Table 4, Figs 3 and 4, and S2–S4 Figs all 8, 976 pairwise alignments in k2 from BRAliBase 2.1 database [38] were used. Note that BRAliBase 2.1 does not include consensus secondary structures for the reference alignments, which are required in the computation of MCC. However, since BRAliBase 2.1 alignments are obtained from Rfam 7.0, we searched for the exact occurrences BRAliBase 2.1 sequences in Rfam 7.0 and computed their Rfam consensus structure. Some BRAliBase alignments are manually curated and thus different from Rfam alignments for which consensus structure could not be obtained. This produced 7, 154 pairwise BRAliBase alignments with their consensus structures. For the computation of MCC in S5 Fig, this subset of reference alignments were used.

**Tab. 4. F1 scores for pairwise global alignment.**

(Left) F1 score and (Right) structural conservation index (SCI) for <i>pairwise global alignments</i> using RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, MXSCARNA, and sequence-only(<i>γ</i> = 0). — Fig. 3. (Left) F1 score and (Right) structural conservation index (SCI) for *pairwise global alignments* using RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, MXSCARNA, and sequence-only(γ = 0).

Run time of <i>pairwise global alignment</i> for RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, and MXSCARNA. — **Fig. 4. Run time of *pairwise global alignment* for RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, and MXSCARNA.**

For each pairwise alignment, a consensus structure was determined from the Rfam family consensus structure by removing gaps, removing base pairs (i, j) which are noncanonical or for which j − i ≤ θ = 3. These derived consensus structures were used as the reference structures in the computation of MCC. To be explicit, consider the following toy example of multiple alignment in Stockholm format—for illustrative purposes, pretend that this is an Rfam seed alignment of a small RNA family (recall that additional requirements must be met concerning number of sequences and average nucleotide length for an Rfam family to be selected).

# STOCKHOLM 1.0

positions 123456789012345678

Eg1 GGACAAGCAAUGCUUGCC

Eg2 GG-CAAGCA-UGCUUGAC

Eg3 GGACAAGCAAUGCUUGCC

#=GC SS_cons ≪-⋘≪___⋙⋙>

#=GC RF GGACAAGCAAUGCUUGCC

This multiple alignment would then produce the following consensus structures for the pairwise alignment of Eg1 with Eg2.

> Eg1

123456789012345678

GGACAAGCAAUGCUUGCC

((.(((((…)))))))

> Eg2

1234567890123456

GGCAAGCAUGCUUGAC

(.((((….)))).)

Note that the Rfam consensus indicates that positions 2 and 17 are paired in the multiple sequence alignment, hence the reference structure for Eg1 has the GC base pair (2, 17) as expected. However, the the corresponding positions in Eg2 are G,A. To avoid violating the requirement (1) in the definition of secondary structure, the reference structure for Eg2 has dots at positions 2 and 15. Note that the Rfam consensus indicates a hairpin loop at positions 8 and 12 of the multiple sequence alignment, so the consensus structure for Eg1 contains the base pair (8, 12). However, since Eg2 contains A-U, this hairpin would contain only two nucleotides A,U, which would violate condition (2) of the definition of secondary structure. For that reason, there is no base pair (7, 10) in the reference structure of Eg2.

For multiple global alignment benchmarking, we used k5 BRAliBase 2.1, which includes 2,405 reference alignments, each composed of 5 sequences. For pairwise local alignment benchmarking, 75 pairwise alignments having sequence identity ≤ 70% were randomly selected from each of 20 well-known families from the Rfam 12.0 database [34], many of which were considered in a previous study [47], yielding a total of 1500 alignments. Following [48], these alignments were trimmed on the left and right, so that both first and last aligned pairs of the alignment do not contain a gap symbol. For sequences a = a₁, …, a_n [resp. b = b₁, …, b_m] from each alignment, random sequences a′ [resp. b′] were generated with the same nucleotide frequencies, then a random position was chosen in a′ [resp. b′] in which to insert a [resp. b], thus resulting in a pair of sequences of lengths 4n and 4m. Finally, since sequence identity was at most 70%, the RIBOSUM70-25 similarity matrix was used in RNAmountAlign. Preparation of the benchmarking dataset for local alignment was analogous to the method used in multiple local alignment of [48].

Dataset for correlation of p-values for different distribution fits

A pool of 2220 sequences from the Rfam 12.0 database [34] was created as follows. One sequence was selected from each Rfam family having average sequence length at most 200 nt, with the property that the base pair distance between its minimum free energy (MFE) structure and the Rfam consensus structure was a minimum. For example, the 102 nt sequence with EMBL accession code AAOX01000028.142464-42565 was selected from family RF00167 since its MFE structure was identical to its Rfam consensus structure, while the 178 nt sequence with EMBL accession code AE016827.11325416-1325239 was selected from RF00168 since the base pair distance between its MFE structure and its Rfam consensus structure was 11—the base pair distance between MFE and consensus structure for all other sequences from RF00168 was greater than or equal to 11. Subsequently, for each of 500 randomly selected query sequences from the pool of 2220 sequences, 1000 random target sequences of length 400 nt were generated to have the same expected nucleotide frequency as that of the query. For each query and random target, five semiglobal alignments were created using gap initiation costs of g_i ∈ {−1, −2, −3, −4, −5} with gap extension cost g_e equal to one-third the gap initiation cost. For each alignment score x for query and random target, the p-value was computed as 1 − CDF(x) for ND, EVD and GD, where CDF(x) is the cumulative density function evaluated at x. Additionally, a heuristic p-value was determined by calculating the proportion of alignment scores for given query that exceed x.

Software version and hardware specs

For benchmarking, we used LocARNA (version 1.8.7), FOLDALIGN (version 2.5), FOLDALIGNM (version 1.0.1), LARA (version 1.3.2), DYNALIGN (from version 5.7 of RNAstructure), the sequence alignment algorithm MUSCLE (version 3.8.31), STRAL (in-house implementation due to unavailability of the software [22], and MXSCARNA (version 2.1). For genome query scan benchmarking, we used RSEARCH 1.1, FOLDALIGN (version 2.5), and RNAmountAlignScan. Recommended default parameters were used for each software, including RNAmountAlign from this paper. The commands used to run the software are given in S1, S2 and S3 Tables. Each software package was run on a cluster of identically configured Intel Xeon 2.66 GHz 4-core nodes with 16GB of memory, running CentOS Linux release 6.10.

In contrast to all other benchmarking work described in this paper, benchmarking tests for RNAmountAlignScan, RSEARCH, and FOLDALIGN in genome scanning mode, as described in Section Query Scan, were conducted on a 24-processor Intel Xeon CPU E5-2440 2.40 GHz system with 198GB memory.

Results

We benchmarked RNAmountAlign’s performance for pairwise and multiple alignments on BRAliBase K2 and K5 datasets, respectively.

Pairwise alignment

Fig 3, S2 and S3 Figs depict moving averages of pairwise global alignment sensitivity, positive predictive value (PPV) and F1-score for the software described in this paper, as well as for LocARNA, FOLDALIGN, LARA, DYNALIGN, STRAL, and MXSCARNA. For pairwise benchmarking, reference alignments of size 2, a.k.a. K2, were taken from the BRAliBase 2.1 database [38]. BRAliBase 2.1 K2 data are based on seed alignments of the Rfam 7.0 database, and consist of 8,976 alignments of RNA sequences from 36 Rfam families.

Moving averages (window size 11) of sensitivity, positive predictive value, and F1 score were computed as a function of sequence identity, where it should be noted that the number of pairwise alignments for different values of sequence identity can vary for the BRAliBase 2.1 data (e.g. there are only 35 pairwise alignments having sequence identity < 20%). In computing moving averages, each value represents the average over a symmetric window (k − 5, k + 5) of size 11 nt centered at the value from the x-axis. For 504 pairwise global alignments, FOLDALIGN produced “No global alignment was found. This can either be due to the pruning, or because no structural alignment exists” and for 29, DYNALIGN yielded no alignments. The only BRAliBase alignment that both tools failed was Hammerhead_1.apsi36.sci72.no1. In our benchmarking the sensitivity, PPV and F1 values for such failures of FOLDALIGN and DYNALIGN were all assigned the value of 0, which explains the weaker performance of these two methods in Fig 3 compared to [14], where such failures are simply excluded from the analysis. Moreover, we observed a larger number of failed alignments in FOLDALIGN 2.5 than in the previous work of [14]. Default parameters were used for all software. For our software RNAmountAlign, gap initiation cost was -3, gap extension -1, and sequence/structure weighting parameter γ was 0.5 (value obtained by optimizing on a small set of 300 random alignments from Rfam 12.0, not considered in training or testing set). The sequence-only alignment is computed from RNAmountAlign with the same gap penalties, but for γ = 0. While its accuracy is high, RNAmountAlign is faster by an order of magnitude than LocARNA, LARA, FOLDALIGN, and DYNALIGN—indeed, algorithmic time complexity of our method is O(n³) for two sequences of length n. Since STRAL could not be compiled on any of our systems, we implemented its algorithm by modifying RNAmountAlign and obtained results for STRAL’s default parameter settings. Therefore, the run time of STRAL is identical to RNAmountAlign but RNAmountAlign achieves higher F1 score, sensitivity and PPV. MXSCARNA and RNAmountAlign have similar average F1 scores. Indeed, a paired 2-tailed Wilcoxson signed-rank test of the difference between F1 scores, as computed by MXSCARNA and RNAmountAlign, for the 8,976 pairwise global alignments mentioned in Table 4. It follows that the (null) hypothesis H₀ cannot be rejected, where H₀ asserts that the difference is 0 between F1 scores of MXSCARNA and RNAmountAlign—see S4 Table. Note however that the software MXSCARNA has slower run time than RNAmountAlign. Both MXSCARNA and RNAmountAlign support global and local alignment; however, unlike MXSCARNA, RNAmountAlign also supports semiglobal alignment and reports p-values. The right panel of Fig 4 depicts actual run times of the fastest software, RNAmountAlign, with the next fastest software, MXSCARNA and LARA. Unlike the graph in the left panel, actual run times are shown, graphed as a function of sequence length, rather than logarithms of moving averages.

In addition, Table 4 displays average pairwise global alignment F1 scores for RNAmountAlign, LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, and MXSCARNA when benchmarked on 36 families from the BRaliBase K2 database comprising altogether 8,976 RNA sequences with average length of 249.33. Averaging over all sequences, the F1 scores for the programs just mentioned were respectively 0.8370, 0.7808, 0.8406, 0.7977, 0.6822, 0.8247, 0.8402; i.e. F1 score 0.8406 of LARA and 0.8402 of MXSCARNA slightly exceeded the F1 score 0.8370 of RNAmountAlign and 0.8247 of STRAL, while other methods trailed by several percentage points. S4 Table indicates the statistical significance of difference between all 8,976 F1 scores. Morevoer, S6 and S7 Tables display values for global alignment sensitivity and positive predictive value, benchmarked on the same data for the same programs—these results are similar to the F1 scores in Tables 2 and 4.

Although there appears to be no universally accepted criterion for quality of local alignments, Table 5 shows pairwise local alignment comparisons for the above-mentioned methods supporting local alignment: RNAmountAlign, FOLDALIGN, and LocARNA. We had intended to include SCARNA_LM [48] in the benchmarking of multiple local alignment software; however, SCARNA_LM no longer appears to be maintained (The SCARNA_LM web server is no longer functional, and the authors did not respond to our request for the source code of SCARNA_LM). Since the reference alignments for the local benchmarking dataset are not known, and sensitivity depends upon the length of the reference alignment, we only report local alignment length and positive predictive value. Abbreviating RNAmountAlign by MA, FOLDALIGN by FA, and LocARNA by LOC, Table 5 shows average run time in seconds of MA (2.30 ± 2.12), FA (625.53 ± 2554.61), LOC (5317.96 ± 8585.19), average alignment length of reference alignments (118.67 ± 47.86), MA (50.35 ± 42.33), FA (114.86 ± 125.33), LOC (556.82 ± 227.00), and average PPV scores MA (0.53 ± 0.42), FA (0.64 ± 0.36), LOC (0.03 ± 0.04). S5 Table presents p-values for a 2-tailed paired Wilcoxon signed-rank test whether the difference in positive predictive values fo 1,500 pairwise local alignments.

**Tab. 5. Pairwise local alignment benchmarking.**

Taken together, these results suggest that RNAmountAlign has comparable accuracy, but much faster run time, hence making it a potentially useful tool for genome processing applications. Here it should be stressed that all benchmarking results used equally weighted contributions of sequence and ensemble structural similarity; i.e. parameter γ = 1/2 when computing similarity by Eq (17).

Statistics for pairwise alignment

Fig 5 shows fits of the relative frequency histogram of alignment scores with the normal (ND), extreme value (EVD) and gamma (GD) distributions, where local, semiglobal and global alignment scores are shown in panels from left to right. The EVD provides the best fit for local alignment sequence-structure similarity scores, as expected by Karlin-Altschul theory [12, 13]. Moreover, Fig 6 shows a 96% correlation between (expect) E-values computed by our implementation of the Karlin-Altschul method, and E-values obtained by EVD fitting of local alignment scores. In contrast, the ND provides the best fit for semiglobal sequence/structure alignment similarity scores, at least for the sequence considered in Fig 5. This is not an isolated phenomenon, as shown in Fig 6, which depicts scatter plots, Pearson correlation values and sums of squared residuals (SSRs) when computing p-values for semiglobal alignment scores between Rfam sequences and random RNA. As explained earlier, a pool of 2220 sequences from the Rfam 12.0 database [34] was created by selecting one sequence of length at most 200 nt from each family, with the property that base pair distance between its minimum free energy (MFE) structure and the Rfam consensus structure was a minimum. Then 500 sequences were randomly selected from this pool, and for each of five gap initiation and extension costs g_i = −5, −4, −3, −2, −1 with ge=gi3. Taking each of the 500 sequences successively as query sequence and for each choice of parameters, 1000 random 400 nt RNAs were generated with the same expected nucleotide relative frequency as that of the query. For each alignment score z for query and random target, the p-value was computed as 1 minus the cumulative density function, 1 − CDF(z), for fitted normal (ND), extreme value (EVD) and gamma (GD) distributions, thus defining 1000 p-values. Additionally, a heuristic p-value was determined by calculating the proportion of alignment scores for given query that exceed z. For each set of 2.5 million (500 × 5 × 1000) p-values (heuristic, ND, EVD, GD), Pearson correlation values were computed and displayed in the upper triangular portion of Fig 6, with SSRs shown in parentheses. Note that residuals were computed for regression equation row = m ⋅ column + b, where column values constitute the independent variable. Assuming that heuristic p-values constitute the reference standard, it follows that p-values computed from the normal distribution correlate best with semiglobal alignment scores computed by RNAmountAlign. Earlier studies have suggested that protein global alignment similarity scores using PAM120, PAM250, BLOSUM50, and BLOSUM62 matrices appear to be fit best by the gamma distribution (GD) [49], and that semiglobal RNA sequence alignment similarity scores (with no contribution from structure) appear to be best fit by GD [50]. However, in our preliminary studies (not shown), it appears that the type of distribution (ND, EVD, GD) that best fits RNAmountAlign semiglobal alignment depends on the gap costs applied (indeed, for certain choices, EVD provides the best fit). Since there is no mathematical theory concerning alignment score distribution for global or semiglobal alignments, it must be up to the user to decide which distribution provides the most reasonable p-values.

Relative frequency histograms of alignment scores for <i>local</i> (left), <i>semiglobal</i> (middle) and <i>global</i> (right) alignments of random RNAs produced by RNAmountAlign. — **Fig. 5. Relative frequency histograms of alignment scores for *local* (left), *semiglobal* (middle) and *global* (right) alignments of random RNAs produced by RNAmountAlign.**

Multiple alignment

We benchmarked RNAmountAlign with the software LARA, mLocARNA, FOLDALIGNM, Multilign, MXSCARNA, and sequence only MUSCLE for multiple global K5 alignments in BRAliBase 2.1. STRAL is not included since the source code could not be compiled. Fig 7 indicates average SPS and SCI as a function of average pairwise sequence identity (APSI). We used the -sci flag of RNAalifold to compute SCI from the output of each software without reference to the reference alignment. Fig 7 indicates that SCI values for outputs from various alignment algorithms is higher than the SCI value from reference alignments, suggesting that the consensus structure obtained from sequence/structure alignment algorithms has a larger number of base pairs than the the consensus structure obtained from reference alignments (this phenomenon was also in [51]). Fig 7 indicates that RNAmountAlign produces SPS scores comparable to mLocARNA, LARA and MXSCARNA, and higher than Multilign and FOLDALIGNM while the SCI score obtained from RNAmountAlign are lower than other software. Averaging over all sequences, the SPS scores for RNAmountAlign, LARA, mLocARNA, FOLDALIGNM, Multilign, MXSCARNA, and MUSCLE were respectively: 0.81 ± 0.18, 0.85 ± 0.15, 0.86 ± 0.15, 0.74 ± 0.24, 0.63 ± 0.17, 0.84 ± 0.15, and 0.82 ± 0.17, while SCI scores are respectively: 0.84 ± 0.24, 0.92 ± 0.22, 0.96 ± 0.21, 0.88 ± 0.23, 0.96 ± 0.21, 0.91 ± 0.20, and 0.79 ± 0.26. SPS score for each reference alignments is 1 by definition, and average SCI score over all reference alignments is 0.79 ± 0.26. Fig 8 indicates software run time in seconds on a logarithmic scale. This figure clearly shows that RNAmountAlign has faster run time than all other structural alignment software in our benchmarking tests, thus confirming the earlier result from pairwise benchmarking.

Sum-of-pairs (SPS) score (left) and structural conservation index (SCI) (right) for <i>multiple global alignments</i> using RNAmountAlign, LARA, mLocARNA, FoldalignM, Multilign, MXSCARNA and MUSCLE. — Fig. 7. Sum-of-pairs (SPS) score (left) and structural conservation index (SCI) (right) for *multiple global alignments* using RNAmountAlign, LARA, mLocARNA, FoldalignM, Multilign, MXSCARNA and MUSCLE.

Run time of <i>multiple global alignment</i> for RNAmountAlign, mLocARNA, LARA, FoldalignM, Multilign, MXSCARNA and MUSCLE. — **Fig. 8. Run time of *multiple global alignment* for RNAmountAlign, mLocARNA, LARA, FoldalignM, Multilign, MXSCARNA and MUSCLE.**

Query scan

We developed an extension of RNAmountAlign, called RNAmountAlignScan, which scans a target genome sequence to find hits having high sequence and structural similarity to a given query sequence. RNAmountAlignScan functions by computing optimal semiglobal alignments of the query to sliding windows of the target, then returns the aligned target segments sorted by p-value. For a query of length n, target genome of length m, window size w > n, and window step size of s, a total of ms many semiglobal alignments must be computed, making the total run time of RNAmountAlignScan O(msw3). In order to show the utility of RNAmountAlignScan, we compared it with RSEARCH [30], FOLDALIGN and RNAmountAlignScan sequence-only where only sequence homology is considered. RSEARCH takes a single query sequence with its secondary structure and performs local alignment with the target genome, in order to search for homologous RNA. For our comparison, we used the query sequence from Fig 6, a 71 nt tRNA from Rfam 12.0, selected with the property that the base pair distance between its MFE structure and its Rfam consensus structure is a minimum compared with other tRNAs from Rfam 12.0. As target genome, we used the E. coli K12 MG1655 genome having 4, 641, 652 nt. Since tRNAscan-SE [52] is generally considered to be the gold standard in tRNA prediction, we measured accuracy of the software by the amount of overlap between returned predictions and predicted tRNAs according to tRNAscan-SE. tRNAscan-SE found 49 full tRNA sequences on the forward strand and 37 on the reverse strand. Default parameters of each software package were used to search the query in the forward strand of E. coli genome (see S3 Table); in the case of RNAmountAlignScan this means gap initiation of −3, gap extension of −1, and sequence/structure weighting parameter γ = 0.5. RNAmountAlignScan was also used to scan the E. coli genome using sequence-only semiglobal alignments, in which gap parameters were unchanged, but with γ = 0. Genome scanning was performed using windows of 300 nt, with step size 200, thus ensuring an overlap of 100 nt between target segments—this resulted in a total of 23, 209 semiglobal alignments. We also tested FOLDALIGN, which provides an option to scan two sequences and return the best local alignment score for each pair of positions in the two sequences. The output from FOLDALIGN with this option then requires subsequent postprocessing using the LocateHits script included in FOLDALIGN in order to generate a list of non-overlapping local alignments. For the query tRNA described above, FOLDALIGN could not process the E. coli genome, and instead produced memory errors. There is a newly developed, specialized form of FOLDALIGN, called “Foldalign version 2.5.1_long_sequences” that we recently learned of from Jakob Havgaard through personal communication and can process long RNA sequences (of more than 30,000 nt) without such memory errors. Fig 9 indicates the precision-recall plot for the top 49 and 37 tRNA predictions on the forward and reverse strands, respectively. RSEARCH is not shown in the plot because it reported 89 hits, none of which had any overlap with tRNAs predicted by tRNAscan-SE and thus its sensitivity and PPV are equal to zero. Sequence length of the predictions returned by RSEARCH was 11.73 ± 0.58. For calculation of PPV as a function of sensitivity, for each of the three methods on each strand, we sorted the predictions in descending order by p-value. Given the i-th prediction, we determined the number of predictions preceding the current i-th prediction (including the i-th), and calculated the number of true positives in that collection then divided this number by 49 for positive or 37 for reverse strands to obtain sensitivity. True positives are predictions with overlap proportion of greater than 0.8 to any of the tRNAscan-SE sequences. The overlap proportion between predictions and tRNAscan-SE is defined as the ratio |A ∩ B|/|B|, where A is a tRNA predicted by each software and B is a tRNA detected by tRNAscan-SE, the latter assumed to be the gold standard. FOLDALIGN with area under the curve of 0.90 and 0.64 for forward and reverse strands outperforms RNAmountAlignScan with 0.27 and 0.28 and sequence-only with 0.08 and 0.16. However, only one the FOLDALIGN predictions on each strand had p-value less than 0.1 while RNAmountAlignScan reported significant p-values for all of the true positives Fig 9. Moreover, FOLDALIGN finished in 12.06 hours while run time of RNAmountAlign was 1.63.

Fig 10 depicts the sequence and MFE structure of the first false positive of RNAmountAlignScan; i.e. that prediction having statistically significant sequence and structural similarity to the query tRNA, but which shares no overlap with a tRNA predicted by tRNAscan-SE. This prediction has 42% sequence identity to the query tRNA, and its MFE forms a cloverleaf secondary structure. In contrast to all other benchmarking computations of this paper, due to the memory requirements for genome scanning mode, each program for query search was run on a 24-processor Intel Xeon CPU E5-2440 2.40 GHz system with 198GB memory. The commands to run the software are given in S3 Table. Run time for FOLDALIGN, RNAmountAlignScan, and RSEARCH were respectively 12.06, 1.63, and 0.62 hours and memory usage were 0.57, 20.61 and 0.49 Gigabytes. The number of random RNAs used for p-value computation in both RSEARCH and RNAmountAlignScan was 1000, and both used the RIBOSUM 85-60 matrix. The outputs of all the program are available at http://bioinformatics.bc.edu/clotelab/RNAmountAlign/.

The MFE secondary structure of an <i>E. coli</i> segment which is predicted to be a tRNA by RNAmountAlignScan and not by tRNAscan-SE. — **Fig. 10. The MFE secondary structure of an *E. coli* segment which is predicted to be a tRNA by RNAmountAlignScan and not by tRNAscan-SE.**

Conclusion

RNAmountAlign is a new C++ software package for RNA local, global, and semiglobal sequence/structure alignment, which provides accuracy comparable with that of a number of widely used programs, but provides much faster run time. RNAmountAlign additionally computes E-values for local alignments, using Karlin-Altschul statistics, as well as p-values for normal, extreme value and gamma distributions by parameter fitting.

Supporting information

S1 Table [tiff]
Commands used for benchmarking various software packages for global alignment.

S2 Table [tiff]
Commands used for benchmarking various software packages for local alignment.

S3 Table [tiff]
Commands used for benchmarking software packages for query scan.

S4 Table [tiff]
-values from two-tailed paired Wilcoxon signed rank test of all 8,976 F1 scores for pairwise global alignments indicated in of the main text.

S5 Table [tiff]
-values from two-tailed paired Wilcoxon signed rank test of all 1500 positive predictive values for pairwise local alignments indicated in of the main text.

S6 Table [seqid]
Average sensitivity scores (± one standard deviation) for of and four widely used RNA sequence/structure alignment algorithms on the benchmarking set of 8,976 pairwise alignments from the database [].

S7 Table [seqid]
Average positive predictive value (PPV) scores (± one standard deviation) for alignment of and four widely used RNA sequence/structure alignment algorithms on the benchmarking set of 8,976 pairwise alignments from the database [].

S8 Table [tiff]
Initial portion of a table that determines expected base pairing probabilities , , as a function of nucleotide probabilities , , , .

S1 Fig [tiff]
Consensus structure for the pairwise alignment indicated in of the main text.

S2 Fig [tiff]
F1-score for , , , , , , and sequence-only alignments( = 0) for .

S3 Fig [tiff]
Average sensitivity (Sen) and positive predictive value (PPV) for , , , , , , and sequence-only alignments ( = 0) for .

S4 Fig [tiff]
Average pairwise sensitivity (left) and positive predictive value (right) for using , , , and in the k5 database used for benchmarking.

S5 Fig [tiff]
Matthews Correlation Coefficient (MCC) for the quality of secondary structure prediction (see text) from each of , , , , , , and sequence-only alignments ( = 0) for .

S6 Fig [tiff]
F1-score for , , , , , , and sequence-only alignments( = 0) for , where any failure of the benchmarked program to output an assignment is simply ignored.

S7 Fig [tiff]
Average sensitivity (Sen) and positive predictive value (PPV) for , , , , , , and sequence-only alignments ( = 0) for , where any failure of the benchmarked program to output an assignment is simply ignored.

S8 Fig [tiff]
Illustration of a potential weakness of .

S1 Appendix [pdf]
Software tutorial.

Zdroje

1. Levenshtein VI. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady. 1966;10:707.

2. Moulton V, Zuker M, Steel M, Pointon R, Penny D. Metrics on RNA secondary structures. Journal of Computational Biology. 2000;7:277–292. doi: 10.1089/10665270050081522 10890402

3. Shapiro BA. An algorithm for comparing multiple RNA secondary structures. Comput Appl Biosci. 1988;4(3):387–393. doi: 10.1093/bioinformatics/4.3.387 2458170

4. Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF, et al. ViennaRNA Package 2.0. Algorithms Mol Biol. 2011;6:26. doi: 10.1186/1748-7188-6-26 22115189

5. Voss B, Meyer C, Giegerich R. Evaluating the predictability of conformational switching in RNA. Bioinformatics. 2004;20(10):1573–1582. doi: 10.1093/bioinformatics/bth129 14962925

6. Barsacchi M, Baù A, Bechini A. Extensive Assessment of Metrics on RNA Secondary Structures and Relative Ensembles. In: Proceedings of the 31st Annual ACM Symposium on Applied Computing. SAC’16. New York, NY, USA: ACM; 2016. p. 44–47. Available from: http://doi.acm.org/10.1145/2851613.2851868.

7. Ding Y, Lawrence CE. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res. 2003;31(24):7280–7301. doi: 10.1093/nar/gkg938 14654704

8. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48(3):443–453. doi: 10.1016/0022-2836(70)90057-4 5420325

9. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147(1):195–197. doi: 10.1016/0022-2836(81)90087-5 7265238

10. Gusfield D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University; 1997.

11. Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999;27(13):2682–2690. doi: 10.1093/nar/27.13.2682 10373585

12. Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA. 1990;87(6):2264–2268. 2315319

13. Karlin S, Dembo A, Kawabata T. Statistical composition of high-scoring segments from molecular sequences. Annals of Statistics. 1990;18(2):571–581. doi: 10.1214/aos/1176347616

14. Bauer M, Klau GW, Reinert K. Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization. BMC Bioinformatics. 2007;8:271. doi: 10.1186/1471-2105-8-271 17662141

15. Havgaard JH, Lyngsø R, Stormo G, Gorodkin J. Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics. 2005;21(9). doi: 10.1093/bioinformatics/bti279 15657094

16. Havgaard J, Kaur S, Gorodkin J. Comparative ncRNA gene and structure prediction using Foldalign and FoldalignM. Curr Protoc Bioinformatics. 2012;0(O):O.

17. Sundfeld D, Havgaard JH, De Melo AC, Gorodkin J. Foldalign 2.5: multithreaded implementation for pairwise structural RNA alignment. Bioinformatics. 2016;32(8):1238–1240. doi: 10.1093/bioinformatics/btv748 26704597

18. Mathews DH, Turner DH. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J Mol Biol. 2002;317(2):191–203. doi: 10.1006/jmbi.2001.5351 11902836

19. Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol. 2007;3(4):e65. doi: 10.1371/journal.pcbi.0030065 17432929

20. Smith C, Heyne S, Richter AS, Will S, Backofen R. Freiburg RNA Tools: a web server integrating INTARNA, EXPARNA and LOCARNA. Nucleic Acids Res. 2010;38(Web):W373–W377. doi: 10.1093/nar/gkq316 20444875

21. Hofacker IL, Bernhart SH, Stadler PF. Alignment of RNA base pairing probability matrices. Bioinformatics. 2004;20(14):2222–2227. doi: 10.1093/bioinformatics/bth229 15073017

22. Dalli D, Wilm A, Mainz I, Steger G. STRAL: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time. Bioinformatics. 2006;22(13):1593–1599. doi: 10.1093/bioinformatics/btl142 16613908

23. Tabei Y, Kiryu H, Kin T, Asai K. A fast structural multiple alignment method for long RNA sequences. BMC Bioinformatics. 2008;9(1):33. doi: 10.1186/1471-2105-9-33 18215258

24. Tabei Y, Tsuda K, Kin T, Asai K. SCARNA: fast and accurate structural alignment of RNA sequences by matching fixed-length stem fragments. Bioinformatics. 2006;22(14):1723–1729. doi: 10.1093/bioinformatics/btl177 16690634

25. Torarinsson E, Havgaard JH, Gorodkin J. Multiple structural alignment and clustering of RNA sequences. Bioinformatics. 2007;23(8):926–932. doi: 10.1093/bioinformatics/btm049 17324941

26. Xu Z, Mathews DH. Multilign: an algorithm to predict secondary structures conserved in multiple RNA sequences. Bioinformatics. 2011;27(5):626–632. doi: 10.1093/bioinformatics/btq726 21193521

27. Xu ZZ, Mathews DH. Prediction of Secondary Structures Conserved in Multiple RNA Sequences. Methods Mol Biol. 2016;1490:35–50. doi: 10.1007/978-1-4939-6433-8_3 27665591

28. Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000;302(1):205–217. doi: 10.1006/jmbi.2000.4042 10964570

29. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22(22):4673–4680. doi: 10.1093/nar/22.22.4673 7984417

30. Klein RJ, Eddy SR. RSEARCH: Finding homologs of single structured RNA sequences. BMC Bioinformatics. 2003;4:44. doi: 10.1186/1471-2105-4-44 14499004

31. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29(22):2933–2935. doi: 10.1093/bioinformatics/btt509 24008419

32. Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982;162(3):705–708. doi: 10.1016/0022-2836(82)90398-9 7166760

33. Ferre F, Ponty Y, Lorenz WA, Clote P. DIAL: a web server for the pairwise alignment of two RNA three-dimensional structures using nucleotide, dihedral angle and base-pairing similarities. Nucleic Acids Res. 2007;35(Web):W659–W668. doi: 10.1093/nar/gkm334 17567620

34. Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, et al. Rfam 12.0: updates to the RNA families database. Nucleic Acids Res. 2015;43(Database):D130–D137. doi: 10.1093/nar/gku1063 25392425

35. Turner DH, Mathews DH. NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic Acids Res. 2010;38(Database):D280–D282. doi: 10.1093/nar/gkp892 19880381

36. Hogeweg P, Hesper B. Energy directed folding of RNA sequences. Nucleic Acids Res. 1984;12(1):67–74. doi: 10.1093/nar/12.1part1.67 6198625

37. Huynen MA, Perelson A, Vieira WA, Stadler PF. Base pairing probabilities in a complete HIV-1 RNA. J Comput Biol. 1996;3(2):253–274. doi: 10.1089/cmb.1996.3.253 8811486

38. Gardner PP, Wilm A, Washietl S. A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res. 2005;33(8):2433–2439. doi: 10.1093/nar/gki541 15860779

39. Smith TF, Waterman MS. Comparison of biosequences. Advances in Applied Mathematics. 1981;2:482–489. doi: 10.1016/0196-8858(81)90046-4

40. Sellers PH. On the theory and computation of evolutionary distances. SIAM J Appl Math. 1974;26:787–793. doi: 10.1137/0126070

41. Waterman MS. Introduction to Computational Biology. Chapman and Hall/CRC; 1995.

42. Bashford D, Chothia C, Lesk AM. Determinants of a protein fold: Unique features of the globin amino acid sequences. Journal of Molecular Biology. 1987;196(1):199–216. https://doi.org/10.1016/0022-2836(87)90521-3. 3656444

43. Sievers F, Wilm A, Dineen D, Gibson JJ, Karplus K, Li W, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011;7(539):1–6.

44. Sneath PHA, Sokal RR. Numerical taxonomy. The principles and practice of numerical classification. San Francisco, W.H. Freeman and Company., USA: Taylor & Francis, Ltd.; 1973.

45. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)—Protein Structure. 1975;405(2):442–451. https://doi.org/10.1016/0005-2795(75)90109-9.

46. Bernhart SH, Hofacker IL, Will S, Gruber AR, Stadler PF. RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics. 2008;9:474. doi: 10.1186/1471-2105-9-474 19014431

47. Clote P, Ferre F, Kranakis E, Krizanc D. Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA. 2005;11(5):578–591. doi: 10.1261/rna.7220505 15840812

48. Tabei Y, Asai K. A local multiple alignment method for detection of non-coding RNA sequences. Bioinformatics. 2009;25(12):1498–1505. doi: 10.1093/bioinformatics/btp261 19376823

49. Pang H, Tang J, Chen SS, Tao S. Statistical distributions of optimal global alignment scores of random protein sequences. BMC Bioinformatics. 2005;6:257. doi: 10.1186/1471-2105-6-257 16225696

50. Hertel J, De Jong D, Marz M, Rose D, Tafer H, Tanzer A, et al. Non-coding RNA annotation of the genome of Trichoplax adhaerens. Nucleic Acids Res. 2009;37(5):1602–1615. doi: 10.1093/nar/gkn1084 19151082

51. Smith MA, Seemann SE, Quek XC, Mattick JS. DotAligner: identification and clustering of RNA structure motifs. Genome Biol. 2017;18(1):244. doi: 10.1186/s13059-017-1371-3 29284541

52. Lowe TM, Chan PP. tRNAscan-SE On-line: integrating search and context for analysis of transfer RNA genes. Nucleic Acids Research. 2016;44(W1):W54–W57. doi: 10.1093/nar/gkw413 27174935

53. Huynen M, Gutell R, Konings D. Assessing the reliability of RNA folding using statistical mechanics. Edited by Draper D. E. Journal of Molecular Biology. 1997;267(5):1104–1112. https://doi.org/10.1006/jmbi.1997.0889.

Článek ETAPOD: A forecast model for prediction of black pod disease outbreak in Nigeria

Článek Disparate effects of antibiotic-induced microbiome change and enhanced fitness in Daphnia magna

Článek Deliver on Your Own: Disrespectful Maternity Care in rural Kenya

Článek Quantification of microaerobic growth of Geobacter sulfurreducens

Článek Number of days required to estimate physical activity constructs objectively measured in different age groups: Findings from three Brazilian (Pelotas) population-based birth cohorts

Článek Identifying site- and stimulation-specific TMS-evoked EEG potentials using a quantitative cosine similarity metric

Článek Design and assessment of TRAP-CSP fusion antigens as effective malaria vaccines

Článek Phanerochaete chrysosporium strain B-22, a nematophagous fungus parasitizing Meloidogyne incognita

Článek Non-disclosure of tuberculosis diagnosis by patients to their household members in south western Uganda

Článek Patch testing in Lao medical students

Článek A competence-regulated toxin-antitoxin system in Haemophilus influenzae

Článek Exploring the mechanism of olfactory recognition in the initial stage by modeling the emission spectrum of electron transfer