5 
 
 
 
 
 
 
 
 
 
 
 
 
 
PART 1 
 
GENERAL BACKGROUND 
 6 
 
In the following sections we’ll describe the problems and the architecture to be simulated, the 
simulation environment, the performance metrics and the applications we want to measure the 
behaviour. 
 
The Cache Memory and the Multiprocessor Systems 
 
THE CACHE MEMORY: WHY, WHAT AND HOW 
 
Over the last decades, the trend in processor technologies has not been adequately supported by 
advances in memory industries: clearly there is a processor-memory performance gap that computer 
architects must try to close. This, plus the guideline that smaller hardware is more expensive but 
faster, led to the hierarchy based on memories of different speed and sizes. The goal is to provide a 
memory system with cost almost as low as the cheapest level of memory and speed almost as fast as 
the fastest level. Cache is the name generally given to the first level of the memory hierarchy 
encountered once the address leaves the CPU. 
 
The principle of locality is the second consideration behind the idea of caches: researchers 
discovered that nearly all code was extremely repetitive in nature. If anything repetitive can be 
stored in a small, very high-speed memory, then the effect of wait states can be limited only to the 
less repetitive portions of the program, which can be stored in slower, less expensive memory. We 
distinguish between the spatial and the temporal locality. The first regards a phenomenon by which 
most computer code executes out of a small area repetitively; this space is not necessarily in a 
single address range of main memory but may be spread around quite significantly. The temporal 
locality is about the fact that instructions execute in close sequence with each other, rather being 
spread through time: a processor is much more likely to need to access a memory location which it 
accessed 10 cycles ago than one which it accessed 10000 cycles ago (that is only 0.12 ms on a 166 
MHz Pentium!!!). Without locality in time and in space caches would not be able to work. 
For reason of cost and speed, all the mechanisms to assure that the currently useful parts of main 
memory are copied into the cache memory must be hardware. 
 
As said before, for our systems we need to have a deep structure; in order to understand the 
different performances between the levels we’ll show a memory hierarchy for a common system 
with the relative sizes and access times. The size of each layer is of course limited by the price per 
bytes. 
 
• Registers; a few or a few tens; 2 ns 
• Cache  
♦ On-chip (SRAM); today’s microprocessors have between 8 and 64 Kbytes; they are pretty 
fast because they don’t need to have an off-chip access, but they are slower than registers 
since it’s not possible to read and write from them simultaneously; in this case the size is a 
limiting factor also because the processor is made using the same chip 
♦ Off-chip (economical SRAM); range from 256 K to 16 Mbytes; about 10 ns 
• Main Memory (DRAM); between 16 and 512 Mbytes; 70 ns 
• Hard Magnetic Disks; from hundreds and thousands Mbytes; tens of ms 
• Magnetic Tape; infinite size; several seconds 
 
The cache memory behaviour is given by the collaboration among a number of components, 
illustrated by figure 1 and here described. 
 
 7 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
The cache data memory is exactly the small and fast memory we were referring to before; it is used 
to store replicas of instructions and data that could be accessed more slowly from the main memory. 
The number of bytes (or memory words) it can contain is the cache size. 
The cache directory is a list of the main memory addresses of the data stored in corresponding 
locations of the cache data memory.  This means that with each cache location, not only is data 
stored, but so is an address, making the combined directory and data RAMs behave like a single, 
very wide memory. 
The bus buffers are now controlled in such a way that if the cache can supply a copy of the 
requested main memory location (we call that case cache hit), then the main memory is not allowed 
to put its data onto the CPU’s data pins. The opposite of a hit is a cache miss. 
Finally there is the cache controller, whose task is to tie the whole system together, using a certain 
policy. 
When an address is presented to the cache directory, a table is interrogated and, if a matching entry 
is found, another address is output, showing where the location exists in the data memory. In this 
way the CPU is fooled: from its perspective the data always come from the main memory, although 
at some times it arrives faster than other times.  It is important to notice also that the cache must 
intercept all of the CPU’s signals, both inputs and outputs, and determine whether these signals are 
to be routed to/from main memory or whether they should stay local to the cache. In a way, the 
cache is an insulating layer between the CPU and the outside world. 
 
The goal of the cache is then to shorten the time that CPU needs to wait in getting the data from the 
main memory. In order to have good results it’s required that the miss ratio, defined as the fraction 
of accesses that are not in the cache, is as low as possible. Moreover the miss penalty too, that is the 
additional clock cycles to service the miss, must be kept the lowest. Hit time is the time to hit in the 
cache. 
Particularly the formula to evaluate the cache performance is the following: 
 
 
 
 
 
 
 
CPU 
Cache 
Directory 
Cache  
Controller 
 
Cache  
Data 
Memory 
Address 
Data 
Address 
Buffers 
Data 
Buffers 
System
Bus
penaltyMissrateMisstimeHittimeaccessmemoryAverage ______ ×+=
Figure 1: The components of a cache memory 
 8 
A problem that cache designers have to face is the associativity, that is where to place a given 
memory location. If any main memory address has the full freedom to be replicated anywhere 
within the cache, then the cache is said to be fully associative. Instead when a block has only one 
place it can appear in the cache, we say that we have a direct mapped cache. These are the two 
extremes, but the most popular mapping algorithm in today’s cache design is set associative (within 
any set the entry is associative): the address which is output from the processor is first divided 
between the block address and the block offset. The block frame address is further split into the 
equivalent of a prefix and a suffix at a location determined by the size and architecture of the cache. 
The lower address bits are called the set bits since they contain the set address. The remaining bits 
are compared against the directory entry for the selected set and are referred to as the tag bits. 
Figure 2 illustrates a possible address partition.  
If there are n blocks in a set, the cache placement is called n-way set associative; then direct 
mapped is simply one-way set associative and a fully associative cache with m blocks could be 
called m-way set associative. 
 
 
 
 
 
  
 
 
 
 
 
The block offset field selects the desired data from the block, the set field selects the set, and the tag 
field is compared for a hit. It is common procedure to add a valid bit to the tag in the cache array to 
say whether or not the entry contains a valid address; if the bit is not set, there cannot be a match on 
this address. Moreover, all possible tags are searched in parallel because speed is critical. 
 
If the total cache size is kept the same, increasing associativity increases the number of blocks per 
set, thereby decreasing the size of the set field and increasing the size of the tag. The formula 
relating all this values is the following, while figure 3 shows the difference among the three types of 
mapping. 
 
 
In the example in the figure the cache has eight blocks and the memory has sixteen blocks. Indeed 
real caches contain hundreds of block frames and real memories contain millions of blocks.  
Assume the address in question is 12. The three options for caches are shown left to right. In fully 
associative, block 12 from memory can go into any of the eight block frames of the cache. With 
direct mapped, block 12 can only be placed into block frame 4 (12 modulo 8). Set associative, 
which has some of both features, allows the block to be placed anywhere in set 0 (12 modulo 4). 
With two blocks per set, this means block 12 can be placed either in block 0 or block 1 in the cache. 
 
 
 
 
Block Address 
Tag Set 
Block 
offset 
ityassociativblocksize
cachesizebitsset
×
=_2
Figure 2: How a memory address is subdivided 
 9 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Another decision that computer architects have to take in designing a cache memory system regards 
the replacement algorithm. In fact, every time the CPU can’t find what it needs within the cache, it 
must perform a main memory access: a commonly used location is then displaced by another 
commonly used location, causing what is called as thrashing. Except for direct mapped caches, 
where there is no choice, the cache controller must select a block to be replaced with the desired 
data, and several algorithms have been proposed. The two primary strategies employed are the 
Random and the Least Recently Used (LRU): in the latter the block replaced is the one that has 
been unused for the longest time. Often it’s not worth to increment the circuital complexity for 
implementing an LRU, and Random is chosen; some other times LRU is only approximated. 
 
Usually reads dominate processor cache access, but high performance design cannot neglect the 
speed of writes, which normally take longer than reads. Another complexity is that the processor 
also specifies the size of the write, only that portion of a block can be changed. In contrast, reads 
can access more bytes the necessary without fear. The write policies often distinguish cache 
designs. There are two basic options when writing to the cache: 
• Write through: the information is written to both the block in the cache and to the block in the 
lower-level memory. 
• Write back: the information is written only to the block in the cache. The modified cache block 
is written to main memory only when it is replaced. 
To reduce the frequency of writing back blocks on replacement, a feature called the dirty bit is 
commonly used. This status bit indicates whether the block is dirty, that is modified, or clean. Both 
the strategies have their advantages, but since with write back some writes don’t go to memory, it 
uses less memory bandwidth and therefore it is the most attractive in multiprocessors. Write 
through, instead, helps in keeping the cache consistent with lower levels of the memory hierarchy. 
Moreover, since the data are not needed on a write, for both policies there are two common options 
on a write miss. With write allocate case the behaviour is similar to read misses: the block is loaded 
and it is written above. The other option implies that the block is modified in the lower level and not 
loaded into the cache. 
This decison, as the others, is strongly dependent on the applications for which cache are to be used. 
 
  Fully associativity        Direct mapped        2-way set associative 
         0 1 2  - - -     7            0 1 2 - - -     7            0 1 2 - - -     7 
 0 1 2  - - -                  12    15 
Block 
No. 
Block 
No. 
CACHE 
MEMORY 
 
 Set  Set  Set  Set 
  0      1     2     3 
Figure 3: The differences among the mapping algorithms  
 10 
Having only one data cache is not a good idea, because the processor also needs instructions. 
Although a single cache could try to supply both, it can be a bottleneck. Separate caches are found 
in most recent processors: one cache is dedicated to instructions and another to data. This fact also 
offers the opportunity of optimising each cache separately: different capacities, block sizes, and 
associativities may lead to better performance. 
 
As shown above, speed is a critical factor in these systems: particularly a commonly used technique 
to reduce the miss penalty is to build more cache levels, usually two. By adding another level of 
cache between the original cache and the memory, the first-level cache can be small enough to 
match the clock cycle time of the fast CPU, while the second-level cache can be large enough to 
capture many accesses that would go to main memory, thereby lessening the effective miss penalty. 
With the new structure, the performance can be evaluated using the more general formula: 
 
 
with 
 
When planning to introduce the second level we have to be aware that very probably the secondary 
cache will be generally idle, while a primary cache will be almost never idle. But sometimes this is 
a good price to pay in order to have a more efficient memory system. 
The crucial parameters to play with in secondary caches are the size of the blocks and the size of the 
cache itself. These considerations concern also with whether all data in the first level cache are 
always in the second level cache (multilevel inclusion property). Inclusion is desirable because 
consistency can be determined just by checking the second level, but it also involves some 
drawbacks. Multiprocessor systems always use the inclusion property. 
Summarising the second-level cache considerations, the essence of cache design is balancing fast 
hits and few misses: this leads to larger caches with higher associativity and larger blocks. 
 
Another important factor is the number of misses. For analysing it, we must say that there are three 
categories, and reducing one of them often involves an increment of another one: 
• Compulsory Misses: the very first access to a block cannot be in the cache, so the block must be 
brought into the cache. 
• Capacity Misses: if the cache cannot contain all the block needed during execution of a 
program, capacity misses will occur because of blocks being discarded and later retrieved. 
• Conflict Misses: if the block placement strategy is set associative or direct mapped, conflict 
misses (in addition to compulsory and capacity misses) will occur because a block can be 
discarded and later retrieved if too many blocks map to its set. 
 
The goal in cache design is to optimise the performance for a given set of applications: architects hit 
it by careful, quantitative analysis, trying to tune the listed factors in the best fashion and perhaps 
using other tricks that reference [2] clearly explains. 
 
In this section we referred mainly to [2] and to [3], in which the reader can find much more details 
and related considerations. 
 
 
111 ______ LLL penaltyMissrateMisstimeHittimeaccessmemoryAverage ×+=
2221 ____ LLLL penaltyMissrateMisstimeHitpenaltyMiss ×+=
 11 
 
THE CACHE COHERENCE PROBLEM 
 
As seen in the previous section, most modern processors use caches to hide the growing disparity 
between processor and memory speeds. On a uniprocessor, the effectiveness of a cache depends 
primarily on the hit rate, which in turn depends on such factors as cache and working set sizes, the 
amount of temporal and spatial locality in the reference stream, the degree of associativity in the 
cache, and the replacement and write policies. 
One reason for adding a cache to a system has not yet been mentioned: multiprocessors can take 
advantage of cache memories to improve the effective bus bandwidth, or simply bus traffic, into 
their main memory. 
We will next see that the use of cache memories to reduce a processor’s main memory bandwidth is 
a very important benefit and should become much more important in the near future with the 
growing acceptance of multiprocessor architectures as a mean of dramatically improving system 
performance. Multiple processors can get into trouble if their utilisation approaches 100%, and 
perform worse than systems with fewer processors once the 100% limit is reached. 
 
Since microprocessors are likely to remain the dominant uniprocessor technology, the logical way 
to improve performance beyond a single processor is by connecting multiple microprocessors 
together. This is likely to be more cost-effective than designing a custom processor. 
The most prevalent form of parallel architecture is the multiprocessor of small or moderate scale 
that provides a global physical address space and symmetric access to all of main memory from any 
processor, often called a symmetric multiprocessor (SMP) or tightly coupled system. Every 
processor has its own cache, and all the processors and memory modules attach to the same 
interconnect, which is usually a shared bus (Figure 4). Task allocation is performed by software at 
run time, offering complete flexibility in the range of application programs that can be run on the 
machine, while the shared address space programming model is supported directly by hardware. 
SMPs dominate the server market and are becoming more common on the desktop. They are also 
important building blocks for larger-scale systems. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Processor
 
Processor
 
Processor
 
One or more  
levels of cache 
 
One or more 
levels of cache 
 
One or more 
levels of cache 
 
Main memory 
 
I/O System 
Bus 
Figure 4: Basic structure of a shared-memory multiprocessor 
 12 
This kind of architecture is used where the required scalability is up to a few dozen processors. To 
support larger processor counts, memory must be distributed among the processors rather than 
centralised. Distributing the memory among the nodes has two major benefits. First, it is a cost-
effective way to scale the memory bandwidth, if most of the accesses are to the local memory in the 
node. Second, it reduces the latency for accesses to the local memory. The key disadvantage is that 
communicating data between processors becomes somewhat more complex and has higher latency. 
We won’t go deeper in the discussion of this kind of systems, often called distributed shared-
memory (DSM) or non-uniform memory access (NUMA), and the interested reader is invited to 
take a look at [2] and [29].  
 
The reason that SMP systems need cache is not hard to understand. In fact each processor has its 
own cache; the more often the cache is accessed, the less often the processor uses the bus to access 
main memory. This helps to keep the bus from saturating; that is, utilisation is lowered to the point 
that the bus becomes more available in a timely manner to those processors that need it. The amount 
of bus bandwidth required by a cached processor is strongly related to the cache’s miss rate, so 
designers of multiple-processor systems focus much more attention on miss rates than would 
designers of single-processor systems. A designer of a single-processor system might hesitate to 
spend extra money to improve the hit ratio of a cache design from 96% to 98%. In a multiple-
processor system, however, the difference between a hit rate of 96% and a hit rate of 98% translates 
to a difference in miss rates of 4% and 2%, or double. In other words, one design would require half 
the bandwidth of the other, so half the number of processors could be used in the system before the 
effects of saturation would be noticed. Moreover caches play an essential role in reducing the 
average data access time as seen by the processors. 
 
A critical challenge due to the addition of caches to multiple-processor systems is named cache 
coherence. The problem arises when copies of the same memory block are present in the caches of 
one or more processors; if a processor writes to and hence modifies that memory block, then, unless 
special action is taken, the other processors will continue to access the old, stale copy of the block 
that is in their caches. Example in table 1 helps in the explanation. 
 
Time           Event    Cache  
 Contents 
For  CPU A 
  Cache   
Contents 
For CPU B  
    Memory     
   Contents 
For location X 
0               1 
1 CPU A reads X         1             1 
2 CPU B reads X         1        1            1 
3 CPU A stores 0 into X         0        1            0 
           Table 1: Example of a coherency problem 
 
Cache coherence problems arise even in uniprocessors when I/O operations occur, because data, 
using DMA, is moved between memory and peripheral component without involving the processor. 
Since I/O operations are much less frequent than memory operations, several coarse solutions have 
been adopted in uniprocessors. In multiprocessors, reading and writing of shared variables by 
different processors is expected to be a frequent event since it is the way that multiple processes 
belonging to a parallel application communicate with each other.  
Fortunately, the techniques and support used to solve the multiprocessor cache coherence problem 
also solve the I/O coherence problem. Essentially all microprocessors today provide support for 
multiprocessor cache coherence. The operating system itself benefits greatly from transparent, 
hardware-supported coherence of its data structures.  
 
 13 
Informally, we could say that a memory system is coherent if any read of a data item returns the 
most recently written value of that data item.  
More formally, a multiprocessor memory system is coherent if the results of any execution of a 
program are such that, for each location, it is possible to construct a hypothetical serial order of all 
operations to the location (i.e. put all reads/writes issued by all processes into a total order) that is 
consistent with the results of the execution and in which 
1. operations issued by any particular process occur in the order in which they were issued to the 
memory system by that process, and 
2. the value returned by each read operation is the value written by the last write to that location in 
the serial order 
Two properties are implicit in the definition of coherence: write propagation means that writes 
become visible to other processes; write serialisation means that all writes to a location (from the 
same or different processes) are seen in the same order by all processes. 
For a more detailed discussion about the coherence issue, we suggest [1]. 
 
Small-scale multiprocessors solve the problem with a hardware solution, by introducing a protocol 
to maintain the caches coherent. These protocols are called cache-coherence protocols (CCP) and 
they work tracking the state of any sharing of a data block. There are two classes of protocols, 
which use different techniques to track the sharing status, in use: 
• Directory based: The sharing status of a block of physical memory is kept in just one location, 
called the directory. 
• Snooping: Every cache that has a copy of the data from a block of physical memory also has a 
copy of the sharing status of the block, and no centralised state is kept. The caches are usually 
on a shared-memory bus, and all cache controllers monitor or snoop on the bus to determine 
whether or not they have a copy of a block that is requested on the bus. 
Snooping protocols became popular with multiprocessors using microprocessors and caches 
attached to a single shared memory because these protocols can use a pre-existing physical 
connection to interrogate the status of the caches. 
Directory based protocols, that won’t be further analysed here, are well described by [2].  
There are two ways to maintain the coherence requirement. One method is to ensure that a 
processor has exclusive access to a data item before it writes that item. This style of protocol is 
called a write invalidate protocol because it invalidates other copies on a write. The alternative is to 
update all the cached copies of a data item when that item is written. This type of protocol is called 
a write update or write broadcast protocol. Because bus and memory bandwidth is usually the 
commodity most in demand in a bus-based multiprocessor, invalidation has become the protocol of 
choice for almost all implementations. 
 
The key to implementing an invalidate protocol in a small-scale machine is the use of the bus to 
perform invalidates. To perform an invalidate the processor simply acquires bus access and 
broadcasts the address to be invalidated on the bus. All processors continuously snoop on the bus 
watching the addresses. The processors check whether the address on the bus is in their cache. If so, 
the corresponding data cache is invalidated.  
In addition to invalidating outstanding copies of a cache block that is being written into, we also 
need to locate a data item when a cache miss occurs. As said before, since write-back caches 
generate lower requirements for memory bandwidth, they are greatly preferable in a multiprocessor, 
despite the slight increase in complexity.  
For a write-back cache the problem of finding the most recent data value is quite hard, since it can 
be in a cache rather than in memory. Fortunately, write-back caches can use the same snooping 
scheme both for cache misses and for writes. Each processor snoops every address placed on the 
 14 
bus. If a processor finds it has a dirty copy of the requested cache block, it provides that cache block 
in response to the read request and causes the main memory access to be aborted. 
 
At this point, what we need to do is just to introduce an example protocol. We’ll discuss the 
simplest; after this, many others have been proposed, based on the features of this one, but changing 
some characteristics mainly depending on the application they were created for.  
The name of the protocol, due to the name of the three states it uses, is MSI and it is a “write 
invalidate” having a “write-back after first write” memory-write policy. 
A bus based cache coherence protocol is usually implemented by incorporating a finite state 
controller in each node. This controller responds to requests from the processor and from the bus, 
changing the state of the selected cache block, as well as using the bus to access data or to 
invalidate it. 
The finite state machine associated to the protocol is shown in figure 5. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
 
 
Cache Blocks States: 
 
1. MODIFIED (M): also called DIRTY, only this cache has a valid copy of the block, and the copy 
in the memory is stale; read/write 
2. SHARED (S): the block is present in an unmodified state in this cache, main memory is up-to-
date, and zero or more other caches may also have an up-to-date (shared) copy; read only 
3. INVALID (I) 
 
Before a shared or invalid block can be written and placed in the modified state, all the other 
potential copies must be invalidated via a read-exclusive bus transaction. 
 
When the processor tries to read or write a memory block that doesn’t exist in the cache, a block 
currently in the cache will have to be replaced by the newly requested block, and if the existing 
block is in the modified state, its contents will have to be written back to main memory. 
M 
S 
I 
 
     PrRd/-       PrWr/- 
 
BusRd/Flush 
BusRdX/Flush 
PrRd/- 
BusRd/- 
BusRdX/- 
PrWr/BusRdX
PrRd/BusRd 
PrWr/BusRdX 
Figure 5: The finite state machine associated to 
the cache coherence protocol MSI 
 15 
 
Bus transactions: 
 
� Bus Read (BusRd): This transaction is generated by a Processor Read (PrRd) that misses in the 
cache, and the processor expects a data response as a result. The cache controller puts the 
address on the bus and asks for a copy that it does not intend to modify. The memory system 
(possibly another cache) supplies the data. 
� Bus Read Exclusive (BusRdX): This transaction is generated by a Processor Write (PrWr) to a 
block that is either not in the cache or is in the cache but not in the modified state. The cache 
controller puts the address on the bus and asks for an exclusive copy that it intends to modify. 
The memory system (possibly another cache) supplies the data. All other caches are invalidated. 
Once the cache obtains the exclusive copy, the write can be performed in the cache. This 
transaction serves to order the write as well as cause the invalidations and hence ensure that the 
write becomes visible to others (write propagation). The processor may require an 
acknowledgement as a result of this transaction. 
� Bus Write Back (BusWB): This transaction is generated by a cache controller on a write back; 
the processor does not know about it and does not expect a response. The cache controller puts 
the address and the contents for the memory block on the bus. The main memory is updated 
with the latest contents 
 
The new action needed to support write-back protocols is that, in addition to changing the state of 
cached blocks, a cache controller can intervene in an observed bus transaction and flush the 
contents of the referenced block from its cache onto the bus rather than allowing the memory to 
supply the data. The cache controller can also initiate bus transactions, supply data for write backs, 
or pick up data supplied by the memory system. 
 
State transitions: 
 
v A processor read to a block that is invalid (or not present) causes a BusRd transaction to service 
the miss.   
� The newly loaded block is promoted, moved up in the state diagram, from invalid to the 
shared state in the requesting cache, whether or not any other cache holds a copy. 
� Any other caches with the block in the shared state observe the BusRd but take no 
special action, allowing main memory to respond with the data. 
� If a cache has the block in the modified state (there can only be one) and it observes a 
BusRd transaction on the bus, then it must get involved in the transaction since the copy 
in main memory is stale. This cache flushes the data onto the bus, in lieu of memory, and 
demotes its copy of the block to the shared state. 
• The memory and the requesting cache both pick up the block. 
 
v Writing into an invalid block is a write miss, which is serviced by first loading the entire block 
and then modifying the desired bytes within in. The write miss generates a read-exclusive bus 
transaction. 
� The block of data returned by the read exclusive is promoted to the modified state, and the 
desired bytes are then written into it. 
� All other cached copies of the block are invalidated, thereby granting the requesting 
cache exclusive ownership of the block. 
� If another cache later requests exclusive access, then in response to its BusRdX transaction 
this block will be invalidated (demoted to the invalid state) after flushing the exclusive copy 
to the bus. 
 
 16 
v Writing into a shared block is treated essentially like a write miss, using a read-exclusive bus 
transaction to acquire exclusive ownership.  
� The data that comes back in the read exclusive can be ignored in this case, unlike when 
writing to an invalid or not present block, since it is already in the cache. The block in the 
requesting cache transitions to the modified state. 
 
v Writes to a block while it is in the modified state and reads to a block not in invalid 
stategenerate no bus transactions. 
 
v A replacement of a block from a cache logically demotes the block to invalid (not present) by 
removing it from the cache. A replacement therefore causes the state machines for two blocks to 
change states in the cache.             
� If the block being replaced was in modified state, the replacement transition from M to I 
generates a write-back transaction. 
� No special action is taken by the other caches on this transaction.  
� If the block being replaced was in shared or invalid state, then it itself does not cause any 
transaction on the bus. 
 
With write-back protocols, a block can be written many times before the memory is actually 
updated. 
 
A read may obtain data not from memory but rather from a writer’s cache, and in fact it may be this 
read rather than a replacement that causes memory to be updated. 
Write hits do not appear on the bus, so the concept of a write being performed with respect to other 
processors is a little different. In fact, to say that a write is being performed means that the write is 
being “made visible”. A write to a shared or invalid block is made visible by the bus read-exclusive 
transaction it triggers. The writer will “observe” the data in its cache after this transaction. The write 
will be made visible to other processors by the invalidations that the read exclusive generates, and 
those processors will experience a cache miss before actually observing the value written. Write hits 
to a modified block are visible to other processors but again are observed by them only after a miss 
through a bus transaction. 
Thus, in the MSI protocol, the write to a nonmodified block is performed or made visible when the 
BusRdX transaction occurs, and the write to a modified block is made visible when the block is 
made visible when the block is updated in the writer’s cache.  
 
A problem arises with our MSI if we consider a sequential application running on a multiprocessor. 
Such multiprogrammed use in fact constitutes the most common workload on small-scale 
multiprocessors. When the process reads in and modifies a data item, in the MSI protocol two bus 
transactions are generated even though there are never any shares. The first is a BusRd that gets the 
memory block in S state, and the second is a BusRdX  that converts the block from S to M state. By 
adding a state that indicates that the block is the only (exclusive) copy but is not modified and by 
loading the block in this state, we can save the latter transaction since the state indicates that no 
other processor is caching the block.  
The solution of this problem implied the birth of a new protocol, MESI, which in some situations 
could be optimised in other ways; then other protocols have been created. A full treatment of a 
number of them can be found in [3]. 
 
We believe that in these few pages we presented all the knowledge to understand the environment 
in which we’ll be working and the problems we have to face. However much more can be found in 
literature. Especially [1], [2] and [3] present very interesting analysis and discussions. 
 
 
 17 
 
Performance Evaluation Methodology 
 
There are three possible approaches for performance evaluation of computer architecture: 
simulation, building analytical models, or prototype construction. Prototype construction is time-
consuming and expensive, and therefore something that is generally avoided until a late stage of 
product development. Analytical methods have been explored primarily because of their speed, but 
usually they are less accurate than simulations and do not allow for the possibility of trying different 
software scheduling and execution schemes. 
Therefore the preferred method for performance evaluation today is the use of simulators, since 
they are more flexible and cost-efficient than building prototypes and more accurate than other 
known methods. Moreover simulation allows for detailed statistics collection, providing a better 
understanding of the trade-offs involved and facilitating further performance tuning. In order to get 
an idea of the factors involved in building a multiprocessor system, in our case its cache system, it 
can be useful to build a less complex simulator but that is still sufficiently detailed to include 
important hardware and software related issues. 
 
The simulation methodologies can be classified into three groups according to how the workload is 
simulated. Program-driven simulators use real programs as their workload, either as unmodified 
binaries or in a format specific to the simulator. Distribution-driven workloads make use of random 
variables of, for instance, memory references, locality, and program behaviour; typically the 
stochastic parameters are based on statistics from a reference system.  
Our work instead is a trace-driven simulator: this category uses traces of input events, such as 
memory references, captured on an existing system, but do not execute any code by itself. 
 
A simulator with program-driven load mimics the target system closely. Real application programs 
are executed on top of the simulator. The advantage is flexibility in target system models and 
accurate results for the applications used in the simulations. A very important weakness is that such 
a simulator tends to be computationally expensive and it is not always practical to construct one. 
The construction of such a simulator is often extremely complicated, due to the need to recreate the 
entire execution environment; this is especially true for complex program systems including system 
software. A particular type of simulator using program-driven workload is an execution-driven 
simulator, where the code is executed on a real processor and the memory system is simulated. The 
execution is interrupted at predefined interaction points (such as access to shared data or message 
passing, this depends on the target architecture), and the simulator takes over. The advantage of this 
approach is speed, but it is considerably less flexible. 
 
A distribution-driven workload is based on statistics on system behaviour. Simulation is fast and it 
is easy to tweak system parameters in order to see what would happen in various application 
scenarios. The difficulty lies in obtaining accurate parameters, and include in the model how these 
parameters interact. 
 
A simulator with trace driven workload uses as input memory reference traces generated by 
program running on an existing system. 
A primary advantage of this method, shared with distribution-driven and analytical models, is that 
no code execution is performed, and thus the complexity of correctly modelling an instruction-set, 
and all devices needed to create the execution environment necessary to run real programs is 
avoided. 
 
The major drawback is that the reliability of the results is heavily dependent on the traces. 
Everything works well as long as the target system is fairly similar to the system the traces were 
 18 
captured from. Using traces from a uniprocessor on a multiprocessor system simulation is more 
difficult as, for example, execution order will be different on the target system.  
Furthermore the traces need to be large to obtain as high probability of a valid results as possible. 
[25] claims that it is possible to concatenate different traces under certain circumstances. The 
criteria needed to be satisfied is that the number of new references is almost constant over time and 
traces; not often this requirement is verified and it is then quite difficult to have large enough data to 
work with. 
 
However, trace-driven simulation is the most widely used method to evaluate caches, even if this 
demands large amounts of storage and computer time. Several techniques have been proposed to 
improve the simulations. [13] is fully dedicated to this topic; [1] also dedicates it a chapter.  
 
 
 
 
 
Performance Metrics 
 
The final task of the simulator is to collect statistics about the cache behaviour and to relate them to 
the parameters characterising the system. In this fashion we can individuate the bottleneck and 
determine trade-offs between the different configurations and policies. All the parameters we are 
interested in are non-time related, that means we don’t care about timing information; anyway the 
extension should be rather easy, by using analytical models and statistics about the different delays 
encountered. 
 
Our main concern when evaluating performance is the traffic on the bus, that is the factor that limits 
the most an efficient utilisation of multiprocessors. 
 
Of course the primary source of the traffic is the miss ratio. In order to see how different 
components behave, we distinguish between read and write, and between data and instructions. 
Moreover, it is useful to understand how helpful is the cache hierarchy: then we want to display 
how many accesses succeed in the first level and how many in the second level. 
 
We are also interested in how the different policies affect performance. For the most common types 
of cache coherence protocols we individuated four categories of bus accesses: read misses, 
invalidations (including the write misses), flushes and writebacks. The first three regards mainly the 
coherency traffic, while the last deals with the replacement of the blocks. 
 
Analysing and comparing the collected data we are able to tune in a better way the parameters 
determining the configuration of the cache system and the adopted strategies. For example an high 
number of writebacks using LRU means that either the associativity has to be increased or it is not 
worth to introduce the complexity given by the replacement algorithm. 
Similar conclusions can be drawn by considering the other output values.