讲解CPEN 211、辅导C/C++语言、辅导C/C++程序设计、辅导Cortex-A9
讲解Python程序|讲解Database
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
UNIVERSITY OF BRITISH COLUMBIA
CPEN 211 Introduction to Microcomputers, Fall 2018
Lab 11: Caches, Performance Counters, and Floating-Point
The handin deadline is 9:59 PM the evening before your lab section the week of Nov 26 to Nov 30
1 Introduction
The ARM processor in your DE1-SoC has the Cortex-A9 microarchitecture. The specifications for the
Cortex-A9 include an 8-stage pipeline, an L1 instruction cache a separate L1 data cache, and a unified L2
cache. In this lab we explore factors that impact program performance with a focus on the L1 data cache.
1.1 Caches inside the DE1-SoC
Both L1 caches hold 32KB and are 4-way set associative with 32 byte blocks and pseudo random replacement.
Using the initialization code provided for this lab, addresses between 0x00000000 and 0x3FFFFFFF
are configured to be cached. Addresses larger than 0x3FFFFFFF are configured to “bypass the cache”,
meaning accesses to these addresses are not cached in L1 or L2 caches.
Why bypass a cache One reason we wish accesses to certain addresses to not be cached is if these
addresses correspond to registers in I/O peripherals. For example, consider what would happen if instead,
when software on the DE1-SoC reads SW_BASE at address 0xFF200040 the values it read from the control
register were allowed to be cached in the L1 data cache: The first LDR instruction to read from address
0xFF200040 would cause a cache block to be allocated in the L1 data cache. This cache block would
contain the value of the switches at the time this first LDR instruction was executed. Now, if the settings of
the switches change after the first LDR executes but that cache block remains in the cache, subsequent LDR
instructions reading from address 0xFF200040 will read the old or “stale” value for the switch settings that
in the cache. Thus, it will seem to the software like the switches have not changed even though they have.
Without an understanding of caches such behavior would be very surprising and hard to explain.
In addition, the initialization code provided for this lab configures the L1 data cache so that store instructions
(e.g., STR and FSTD) are handled as “write back with write allocate”. By write allocate we mean
that if the cache block accessed by the store was not in the L1 data cache then it will be brought into the
cache, possibly evicting another block. By write-back we mean that if a cache block in the L1 is written to
by a store instruction then only the copy of the block in the L1 is modified.
1.2 Performance Counters
How can you increase the performance of a software program? One common approach is to “profile” the
program to identify which lines of code it spends the most time executing. Standard developer tools such as
Visual Studio include such basic profiling capabilities1
. Using this form of profiling you can identify where
making “algorithmic” changes, such as using a hash table instead of a linked list, is worth the effort.
To obtain the highest performance it is also necessary to know about how a program interacts with the
microarchitecture of the CPU. One of the most important questions is “does the program incur many cache
misses?” The software that runs in datacenters, such as those operated by Google, Facebook, Amazon,
Microsoft and others, typically suffers many cache misses. Google reports that “Half of cycles [in their
datacenters] are spent stalled on caches”2
. Most modern microprocessors include special counter registers
that can be configured to measure how often events such as cache misses occur. Special profiling tools such
as Intel’s VTune3
can use these hardware performance counters. Hardware counters can also be used for
runtime optimization of software and to enable the operating system to select lower power modes.
1
https://msdn.microsoft.com/en-CA/library/ms182372.aspx
2 Kanev et al., Profiling a warehouse-scale computer, ACM/IEEE Int’l Symp. on Computer Architecture, 2015.
3
https://software.intel.com/en-us/intel-vtune-amplifier-xe
CPEN 211 - Lab 11 1 of 10The Cortex-A9 processor supports six hardware counters. Each counter can be configured to track one
of 58 different events. In this lab you will use these counters to measure clock cycles, load instructions
executed, and L1 data cache misses (caused either by loads or stores). You will use these three counters
to analyze the performance as you make changes to programs. These performance counters are a standard
feature of the ARMv7 architecture and are implemented as part of “coprocessor 15” (CP15). CP15 also
includes functionality for controlling the caches and virtual memory hardware. For this lab we provide
you ARM assembly code to enable the L1 data cache and L2 unified cache using CP15. (The L2 cache
is accessed when a load or store instruction does not find a matching cache block in the L1 data cache.)
Enabling the data caches on the Cortex-A9 also requires enabling virtual memory. So, the code we provide
for Lab 11 (pagetable.s) also does this for you using a one-level page table with 1MB pages (called
“sections” in ARM terminology). You do not need to know how virtual memory works to complete this lab.
However, for those who are interested, Bonus #1 and Bonus #2 ask you to modify pagetable.s.
In Part 1 of this lab you run an example assembly program that helps illustrates how to access the
performance counters. In Part 2, you write a matrix-multiply function using floating-point instructions and
study its cache behavior using the performance counters. In Part 3, you modify your matrix-multiply to
improve cache performance.
To enable the caches on your DE1-SoC, your assembly needs to call the function CONFIG_VIRTUAL_MEMORY
defined inside pagetable.s to enable the cache. After virtual memory is enabled the Altera Monitor Program
will not correctly download changes to your code without first power cycling the DE1-SoC. To save
time during debugging (e.g., in Part 2 and 3) enable virtual memory only after you get your code
working. Also, note that resetting the ARM processor through the Altera Monitor Program does not “flush”
the contents of the caches. Thus, you will need to power cycle your DE1-SoC each time you want to
make a new performance measurement.
The ARM coprocessor model was briefly described in Slide Set #13. The Cortex-A9 contains a Performance
Monitor Unit (PMU) inside of Coprocessor 15. While there are 58 different events that can be
tracked on the Cortex-A9, the PMU contains only six performance counters with which to track them. These
are called PMXEVCNTR0 through PMXEVCNTR5, which we will abbreviate to PMN0 through PMN5.
These counters are controlled through several additional registers inside the PMU.
The specific PMU registers you will need to use in this lab are listed in Table 1. Recall that the MCR,
or “move to coprocessor from an ARM register”, instruction moves a value to the coprocessor (i.e., PMU)
from an ARM register (R0-R14). The MRC, or “move to ARM register from a coprocessor” instruction
copies a value from a coprocessor (i.e., PMU) into an ARM register. Certain registers in the PMU are used
to configure the performance counter hardware before using the counters PMN0 through PMN5 to actually
count hardware events. The relationship between the different PMU registers, the hardware events and the
performance counter registers is partly illustrated in Figure 1. The operation of this hardware is described
below. You will measure the three events listed in Table 2. The other 55 possible events can be found in
ARM documents that are available on ARM’s website4
.
To use one of the performance counters you need to complete the following steps:
1. Select counter PMNx by putting the value x in a regular register (e.g., R0-R12) and then executing
the ARM code shown in Table 1 for “Set PMSELR” (replacing