- 6 - 
basic concepts since bioinformatics is a borderline area: a computer scientist 
will find that biological notions are particularly useful and vice versa. 
The second part of the thesis will first analyse the software instruments 
applied in the effort to develop a procedure as fast and accurate as possible. 
The variety of programs and scripts involved will be described in details as 
well as the general strategy based on the idea of a two-step searching 
method. Results achieved through the method will be presented and 
discussed in the main part of the section. 
An extensive bibliography is given for every major subject examined as 
well as web references to resources and databanks. 
Support material to this thesis can be retrieved from the web site 
http://bio.lundberg.gu.se/srpscan/paper.html [117]. 
  
- 7 - 
Part  A: Biological and 
bioinformatics background 
  
- 8 - 
2. Biological background: noncoding RNA 
2.1 Flow of genetic information 
The genetic information stored in DNA is expressed through two 
essential processes: transcription of DNA into mRNA and translation of 
mRNA into proteins. 
 Messenger RNA works hence as a template for protein synthesis: 
triplets of bases in the mRNA code for amino acids in proteins following 
specific rules known as the “genetic code”. 
Other RNAs (transfer RNA, tRNA, and ribosomal RNA, rRNA) are 
involved in this process playing a key role in the recognition of the triplets 
and being the major component of ribosome, with both a catalytic and 
structural role in protein synthesis. All forms of cellular RNA are 
synthesized by RNA polymerase, which use DNA as a template; proteins 
are then synthesized by the ribosome using mRNA templates. 
2.2 Structure of nucleic acids 
Despite their importance in cellular function, nucleic acid structure is 
unexpectedly simple. RNA and DNA are long polymers of only four 
nucleotides: adenine, guanine, cytosine and thymine (or uracil for RNA). 
The nucleotide structure can be broken up into two parts: the sugar-
phosphate backbone and the base. All nucleotides share the sugar-phosphate 
backbone. Nucleotide polymers are formed by linking the monomer units 
together using an oxygen atom on the phosphate, and a hydroxyl group on 
the sugar. A, T (or U), G and C are capable of being linked together forming 
a long chain. The 3'-hydroxyl group on the ribose unit reacts with the 5'-
phosphate group on its neighbour to form a chain. The base on each 
  
- 9 - 
nucleotide is different, but they still show similarity: adenine (A) and 
guanine (G) are purines, with a two-ring structure, with the differences in 
the molecules coming in the groups attached to the ring. Similarly, cytosine 
(C), thymine (T) and uracil (U) are pyrimidines and share a similar 
structure, but differ in their side groups. 
If two strands of nucleic acid are adjacent to one another, the bases along 
the polymer can interact with complementary bases in the other strand. 
Adenine forms hydrogen bonds with thymine and cytosine can base pair 
with guanine. Adenine forms two hydrogen bonds with thymine while 
cytosine forms three with guanine. 
2.2.1  DNA structure 
Cells contain two strands of DNA that are exact mirrors of each other. 
When correctly aligned, A can pair with T and G can pair with C: in 
solution, the two strands will usually find each other and form a double 
helix. This reaction is favourable because of the numerous hydrogen bonds 
that can be formed between the complementary bases. The DNA molecule 
can stretch for millions of base pairs and the DNA sizes of different 
organisms can vary greatly. 
2.2.2 RNA structure 
RNA is similar in structure to DNA except that uracil takes the place of 
thymine and that the ribose unit on each sugar contains a hydroxyl group. 
The RNA in most cells exists as single-stranded, but, if complementary base 
sequences are present in the RNA, it can fold back upon itself and base pair. 
This secondary structure of RNA often results in loops and stems that 
drastically affect the function of the molecule. RNA with an extensive 
amount of secondary structure plays important physiological roles in 
translation, in transcription and in DNA replication. 
  
- 10 - 
2.3 Functions of RNA molecules 
A number of RNAs that do not function as mRNAs, transfer RNAs 
(tRNAs), or ribosomal RNAs have been discovered. In the literature, the 
non-mRNAs have been referred in many different ways: the term small 
RNAs (sRNAs) has been more common in bacteria, whereas the term 
noncoding RNAs (ncRNAs) has been preferred in eukaryotes. 
ncRNAs vary in size from 21 to 25 nt for the big family of microRNAs 
(miRNAs) that modulate development in C. elegans, Drosophila, and 
mammals [2], up to 200 nt for sRNAs commonly found as translational 
regulators in bacterial cells [3] and to >10,000 nt for RNAs implicated in 
gene silencing in higher eukaryotes [4]. The functions described for 
ncRNAs so far are extremely varied. 
The mechanisms of action for the characterized ncRNAs can be grouped 
into numerous general categories. There are ncRNAs where base pairing 
with another RNA or DNA molecule is central to function. The snoRNAs 
that direct RNA modification and the bacterial RNAs that modulate 
translation by forming base pairs with specific target mRNAs are examples 
of this category. 
Some ncRNAs resemble the structures of other nucleic acids: the 6S 
RNA structure is reminiscent of an open bacterial promoter and the tmRNA 
has characteristics of both tRNAs and mRNAs. 
Other ncRNAs, such as the RNase P RNA, have catalytic functions. 
Most ncRNAs are associated with proteins that augment their functions; 
however, some ncRNAs, such as the snRNAs and the SRP RNA, serve key 
structural roles in RNA-protein complexes. Several ncRNAs fit into more 
than one mechanistic category.  
The mechanisms of action for a number of ncRNAs are unknown, and it 
is probable that some ncRNAs act in ways that have not yet been 
established. 
  
- 11 - 
Some investigators have suggested that many ncRNAs are residues of a 
world in which RNA carried out all of the functions in a primitive cell. 
Nevertheless, given the versatility of RNA and the fact that the properties of 
RNA provide advantages over peptides for some mechanisms, it is likely 
that a number of ncRNAs have evolved more recently [5, 6]. 
2.4 Secondary structure of RNA 
DNA molecules are usually encountered as two complementary strands, 
forming a double helix over long stretches [7]. In contrast to DNA, RNA 
prevails as a single strand. Due to small self-complementary regions, the 
RNA commonly exhibits a complex secondary structure, consisting of 
relatively short, double helical segments alternated with single stranded 
regions. 
Many complex tertiary interactions fold the RNA in its final three-
dimensional form. The folded RNA molecule is stabilized by a variety of 
interactions, the most important being hydrogen bonding of the bases and 
stacking. Some of the commonly found structural elements are illustrated in 
Fig. 2.1. 
2.4.1 Base interactions in RNA 
Canonical pairs. The bulk of secondary structure interactions is formed 
by the normal (Watson-Crick) type of base pairing. These are formed by a 
double hydrogen bond between A and U, or a triple hydrogen bond between 
G and C. 
Wobble pairs. The wobble hypothesis formulated that other interactions 
than the mere canonical pairings were possible between the third base of the 
anticodon and the first base of the codon. Of these 'wobble' interactions, the 
non-canonical pairing between G and U is often found in RNA secondary 
structure. It usually appears to play an incidental role. It can be seen at a 
variety of positions in the tRNAs, which are usually represented by 
  
- 12 - 
canonical pairs in homologous tRNAs. This suggests that they generally do 
not present characteristic fixed features of RNA structure, but rather behave 
as canonical pairings. However, this is not always the case, and in some 
instances, the wobble base pairing is highly conserved, and has e.g. been 
shown to be a specific feature in the identity of alanine-tRNA [8] 
Other non-canonical pairs. Non-canonical base pairing has mainly 
been demonstrated and studied in short DNA duplexes using X-ray 
diffraction [9], NMR [10] and thermodynamic studies [11]. However, also 
in RNA non-canonical pairs have been experimentally observed, e.g. the U-
C pair has been detected in an X-ray diffraction study by Holbrook et al. 
[12]. Several comparative and experimental results indicate that G:A pairs 
are not rare in RNA structure. 
2.5 Elements of the RNA secondary structure 
2.5.1 Duplexes 
Duplex RNA consists of a right-handed double helix stabilized by 
hydrogen bonds between the bases on opposite complementary strands and 
by stacking between adjoining bases. X-ray diffraction studies of fibres and 
crystals [13] have shown that the helices are of the A-form. The A-form 
RNA helix has 11 bp per turn, as opposed to 10 bp per turn for the usual B-
form DNA helix. 
  
- 13 - 
 
Fig. 2.1 Illustration of some of the structure elements found in rRNA. The RNA 
backbone is depicted as a thick line, whereas bases are shown as thin lines: a = 
hairpin, b = internal loop, c = bulge loop, d = junction, e = duplex (long range 
interaction), f = pseudoknot. Picture adapted from [7]. 
2.5.2 Single stranded regions 
Single stranded regions are formed by unpaired nucleotides. In absence 
of tertiary interactions to constrain the single stranded regions, they are 
assumed to be roughly ordered by base stacking in a helical geometry. 
2.5.3 Hairpins 
A hairpin consists of a duplex bridged by a loop of unpaired nucleotides. 
The smallest possible loop in a hairpin was originally thought to be three 
nucleotides but there is growing evidence that in some sequences, two 
unpaired nucleotides suffice [14]. Thermodynamic studies of hairpins with 
loop sequences (U)n, (C)n and (A)n (n=3 to 9) showed that loops containing 
four or five nucleotides are the most stable [15]. However, the stability of a 
  
- 14 - 
hairpin loop changes with different loop sequences and sizes. Some of the 
tetra-loops seem more abundant in RNA structure. Two of these have been 
shown to form unusually stable hairpins: UUCG [16] and GAAA [17]. 
NMR studies of the hairpin GGAC(UUCG)GUCC demonstrated that 
interactions between loop bases and the sugar-phosphate backbone 
contribute to this unusual stability [14]. The backbone angles of the 
nucleotides in small hairpin loops differ significantly from A-form 
geometry. In longer loops, some of the nucleotides have been shown to 
stack in normal A-form geometry.  
2.5.4 Bulge loops 
Bulge loops are formed by unpaired nucleotides in one strand of a 
double-stranded region, where the other strand has contiguous base pairing. 
Single base bulges can intercalate into the helix or loop out of the helix 
depending on the temperature, the identity of the bulged nucleotide and the 
sequence of the surrounding duplex. Bulge loops can affect the long-range 
structure of RNA by creating a bend in a duplex. Bending has been detected 
by the altered mobility in non-denaturating gel electrophoresis of RNA's 
containing bulge loops. Distortions due to bulges may extend into the 
surrounding duplex region [18].  
2.5.5 Internal loops 
Internal loops contain several nucleotides not capable of forming 
Watson-Crick base pairs. Symmetrical internal loops contain an equal 
number of unpaired nucleotides in each strand. Asymmetrical internal loops 
contain an unequal number.  
2.5.6 Junctions 
Junctions or multi-branched loops are formed where three or more 
duplexes come together, separated by single stranded stretches with a 
variable number of unpaired nucleotides. Different helical regions can stack 
coaxially in these junctions. The conformation of the unpaired nucleotides 
  
- 15 - 
in the junctions has a great impact on the three-dimensional structure by 
orienting the stem regions that meet. They also have been implicated in the 
catalysis of specific reactions [19]. 
2.6 Tertiary structure interactions 
Tertiary interactions could be defined in terms of chord crossing: the 
RNA sequence is drawn on a plane following a circle. Interactions between 
bases are depicted as straight lines (chords) connecting the bases. A 
secondary structure can be represented without any line crossing. Tertiary 
interactions occur when lines do cross. In some cases, it is difficult to 
discern which is the secondary and which is the tertiary interaction though. 
2.6.1 Tertiary base pairing 
There are several examples of RNA molecules containing tertiary 
contacts between nucleotides in loop regions of the secondary structure 
called loop-loop interactions. These interactions can show conformations 
uncommon in secondary structure, and even a parallel strand pair has been 
identified. 
A typical tertiary interaction is the pseudoknot. It is formed by the 
interaction of bases in a hairpin loop with bases just outside this hairpin 
structure. Pseudoknots have been found in an increasing number of 
biological systems [20]. 
2.6.2 Other tertiary interactions 
Intercalation of an individual base of one strand between two bases of an 
adjacent strand has been demonstrated in tRNA. Base triples also occur in 
tRNA. The third base of a base triple may bind to a Watson-Crick pair in 
either the major or the minor groove. It is stabilized by hydrogen bonding 
and stacking. Helix-helix interactions have also been observed in the crystal 
structure of some RNA duplexes [13]. 
  
- 16 - 
2.7 Three-dimensional structure of RNA 
In solution, noncoding RNA molecules have a well-defined three-
dimensional structure that is critical for their physiological function. The 
general architecture resembles proteins with structured, rigid domains 
connected by less structured and more flexible stretches [21]. 
 
 Fig. 2.2 The crystal structure of yeast phenylalanine tRNA at 1.93 Å resolution 
(PDB entry 1EHZ) [22]. 
Fig. 2.2 provides an example of an RNA three-dimensional structure: 
yeast phenylalanine tRNA. Bases form both Watson-Crick pairs and non-
Watson-Crick pairs, which pile together to form stems. In tRNA, four stems 
pile together pairwise to make the two arms of the L-shaped tRNA. Bigger 
RNAs can have much more complex structures, e.g. ribosomal RNA. 
  
- 17 - 
23S ribosomal RNA, like others RNA components called ribozymes 
[23], has an active catalytic function. The catalytic activity of RNA also 
encouraged the proposition that RNA predates DNA and proteins as the 
information storage and catalytic component of life. In this view, the 
presence of catalytically active RNA in the ribosome is readily explained by 
its ancient and crucial function. The first molecules performing the 
functions now carried out by the ribosome could have been mere RNAs, 
which acquired the proteins during their evolution. The function of the 
acquired proteins might then be to tune the RNA structure and functions, or 
even to take over some of the functions [7].