讲解CPEN 211、辅导C/C++语言、辅导C/C++程序设计、辅导Cortex-A9 讲解Python程序|讲解Database

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
UNIVERSITY OF BRITISH COLUMBIA
CPEN 211 Introduction to Microcomputers, Fall 2018
Lab 11: Caches, Performance Counters, and Floating-Point
The handin deadline is 9:59 PM the evening before your lab section the week of Nov 26 to Nov 30
1 Introduction
The ARM processor in your DE1-SoC has the Cortex-A9 microarchitecture. The specifications for the
Cortex-A9 include an 8-stage pipeline, an L1 instruction cache a separate L1 data cache, and a unified L2
cache. In this lab we explore factors that impact program performance with a focus on the L1 data cache.
1.1 Caches inside the DE1-SoC
Both L1 caches hold 32KB and are 4-way set associative with 32 byte blocks and pseudo random replacement.
Using the initialization code provided for this lab, addresses between 0x00000000 and 0x3FFFFFFF
are configured to be cached. Addresses larger than 0x3FFFFFFF are configured to “bypass the cache”,
meaning accesses to these addresses are not cached in L1 or L2 caches.
Why bypass a cache One reason we wish accesses to certain addresses to not be cached is if these
addresses correspond to registers in I/O peripherals. For example, consider what would happen if instead,
when software on the DE1-SoC reads SW_BASE at address 0xFF200040 the values it read from the control
register were allowed to be cached in the L1 data cache: The first LDR instruction to read from address
0xFF200040 would cause a cache block to be allocated in the L1 data cache. This cache block would
contain the value of the switches at the time this first LDR instruction was executed. Now, if the settings of
the switches change after the first LDR executes but that cache block remains in the cache, subsequent LDR
instructions reading from address 0xFF200040 will read the old or “stale” value for the switch settings that
in the cache. Thus, it will seem to the software like the switches have not changed even though they have.
Without an understanding of caches such behavior would be very surprising and hard to explain.
In addition, the initialization code provided for this lab configures the L1 data cache so that store instructions
(e.g., STR and FSTD) are handled as “write back with write allocate”. By write allocate we mean
that if the cache block accessed by the store was not in the L1 data cache then it will be brought into the
cache, possibly evicting another block. By write-back we mean that if a cache block in the L1 is written to
by a store instruction then only the copy of the block in the L1 is modified.
1.2 Performance Counters
How can you increase the performance of a software program? One common approach is to “profile” the
program to identify which lines of code it spends the most time executing. Standard developer tools such as
Visual Studio include such basic profiling capabilities1
. Using this form of profiling you can identify where
making “algorithmic” changes, such as using a hash table instead of a linked list, is worth the effort.
To obtain the highest performance it is also necessary to know about how a program interacts with the
microarchitecture of the CPU. One of the most important questions is “does the program incur many cache
misses?” The software that runs in datacenters, such as those operated by Google, Facebook, Amazon,
Microsoft and others, typically suffers many cache misses. Google reports that “Half of cycles [in their
datacenters] are spent stalled on caches”2
. Most modern microprocessors include special counter registers
that can be configured to measure how often events such as cache misses occur. Special profiling tools such
as Intel’s VTune3
can use these hardware performance counters. Hardware counters can also be used for
runtime optimization of software and to enable the operating system to select lower power modes.
1
https://msdn.microsoft.com/en-CA/library/ms182372.aspx
2 Kanev et al., Profiling a warehouse-scale computer, ACM/IEEE Int’l Symp. on Computer Architecture, 2015.
3
https://software.intel.com/en-us/intel-vtune-amplifier-xe
CPEN 211 - Lab 11 1 of 10The Cortex-A9 processor supports six hardware counters. Each counter can be configured to track one
of 58 different events. In this lab you will use these counters to measure clock cycles, load instructions
executed, and L1 data cache misses (caused either by loads or stores). You will use these three counters
to analyze the performance as you make changes to programs. These performance counters are a standard
feature of the ARMv7 architecture and are implemented as part of “coprocessor 15” (CP15). CP15 also
includes functionality for controlling the caches and virtual memory hardware. For this lab we provide
you ARM assembly code to enable the L1 data cache and L2 unified cache using CP15. (The L2 cache
is accessed when a load or store instruction does not find a matching cache block in the L1 data cache.)
Enabling the data caches on the Cortex-A9 also requires enabling virtual memory. So, the code we provide
for Lab 11 (pagetable.s) also does this for you using a one-level page table with 1MB pages (called
“sections” in ARM terminology). You do not need to know how virtual memory works to complete this lab.
However, for those who are interested, Bonus #1 and Bonus #2 ask you to modify pagetable.s.
In Part 1 of this lab you run an example assembly program that helps illustrates how to access the
performance counters. In Part 2, you write a matrix-multiply function using floating-point instructions and
study its cache behavior using the performance counters. In Part 3, you modify your matrix-multiply to
improve cache performance.
To enable the caches on your DE1-SoC, your assembly needs to call the function CONFIG_VIRTUAL_MEMORY
defined inside pagetable.s to enable the cache. After virtual memory is enabled the Altera Monitor Program
will not correctly download changes to your code without first power cycling the DE1-SoC. To save
time during debugging (e.g., in Part 2 and 3) enable virtual memory only after you get your code
working. Also, note that resetting the ARM processor through the Altera Monitor Program does not “flush”
the contents of the caches. Thus, you will need to power cycle your DE1-SoC each time you want to
make a new performance measurement.
The ARM coprocessor model was briefly described in Slide Set #13. The Cortex-A9 contains a Performance
Monitor Unit (PMU) inside of Coprocessor 15. While there are 58 different events that can be
tracked on the Cortex-A9, the PMU contains only six performance counters with which to track them. These
are called PMXEVCNTR0 through PMXEVCNTR5, which we will abbreviate to PMN0 through PMN5.
These counters are controlled through several additional registers inside the PMU.
The specific PMU registers you will need to use in this lab are listed in Table 1. Recall that the MCR,
or “move to coprocessor from an ARM register”, instruction moves a value to the coprocessor (i.e., PMU)
from an ARM register (R0-R14). The MRC, or “move to ARM register from a coprocessor” instruction
copies a value from a coprocessor (i.e., PMU) into an ARM register. Certain registers in the PMU are used
to configure the performance counter hardware before using the counters PMN0 through PMN5 to actually
count hardware events. The relationship between the different PMU registers, the hardware events and the
performance counter registers is partly illustrated in Figure 1. The operation of this hardware is described
below. You will measure the three events listed in Table 2. The other 55 possible events can be found in
ARM documents that are available on ARM’s website4
.
To use one of the performance counters you need to complete the following steps:
1. Select counter PMNx by putting the value x in a regular register (e.g., R0-R12) and then executing
the ARM code shown in Table 1 for “Set PMSELR” (replacing with the register you put the
value x in, e.g., R0). This put the value in into the register labeled PMSELR in Figure 1, which
controls the “demultiplexer” labeled 1 .
2. Selecting the event that PMNx should count by putting the event number in the first column of Table 2
into a regular register (e.g., R0-R12) and then executing the ARM code in Table 1 for “Set PMXEVTYPER”
(replacing with the register you put the value in). This causes the value of
4ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition, Cortex-A9 Technical Reference Manual
CPEN 211 - Lab 11 2 of 10Figure 1: Hardware Organization of Cortex-A9 Performance Monitor Unit
to be placed into the corresponding register named PMXEVTYPER0 through PMXEVTYPER5 in
Figure 1. Which one of PMXEVTYPER0 through PMXEVTYPER5, is updated depends upon the
value in PMSELR set in the prior step.
3. Repeat steps 1 through 2 for up to six counters.
4. Enable each PMNx by setting bit x in a regular register (e.g., R0-R12) to 1 and then executing the
ARM code in Table 1 for “Set PMCNTENSET” (replace with the register containing bits set to
1). This sets the register labeled PMCNTENSET in Figure 1.
5. Reset all counters and start those that are enabled by putting the value 3 into a regular register (e.g.,
R0-R12) and then executing the ARM code in Table 1 for “Set PMCR” (replacing with the
register containing 3). In Figure 1, this step resets the counters PMN0 through PMN5 and allows
them to begin counting the events passed through the multiplexes that connect to the hardware event
signals 5 .
6. Run the code you wish to measure the performance of (e.g., matrix multiply). During this step the
counters PMN0 through PMN5 shown in Figure 1 will be incremented whenever a configured event
occurs.
7. Stop the performance counters by putting the value 0 into a regular register (e.g., R0-R12) and then
executing the ARM code in Table 1 for “Set PMCR” (replacing with the register containing 3).
8. For each counter you wish to read, follow steps 9 and 10 below.
CPEN 211 - Lab 11 3 of 10Name ARM Code (NOTE: replace ) Function
Set PMSELR MCR p15, 0, , c9, c12, 5 Value in ARM register speci-
fies the performance counter PMN0
through PMN5 that will either be con-
figured using a PMXEVTYPER operation
or read using a PMXEVCNTR operation.
Set PMXEVTYPER MCR p15, 0, , c9, c13, 1 Lower 8-bits of configures which
event increments counter selected by
PMSELR.
Set PMCNTENSET MCR p15, 0, , c9, c12, 1 A 1 in bit 0 through bit 5 of enables
performance counter 0 through 5,
respectively
Set PMCR MCR p15, 0, , c9, c12, 0 If Bit 1 of is 1 this instruction
clears all six performance counters. If
Bit 0 of is 1 this instruction starts
any performance counters enabled by
PMCNTENSET. If Bit 0 of is ‘0’
this instruction stops all performance
counters.
Read PMXEVCNTR MRC p15, 0, , c9, c13, 2 Copies current value of counter selected
by PMSELR into .
Table 1: ARM Cortex-A9 Performance Monitor Interface
Event number Event description
0x3 Level 1 data cache misses
0x6 Number of load instructions executed (counted if condition code passed)
0x11 CPU cycles
Table 2: Event Numbers
9. Select counter PMNx by putting the value x in a regular register (e.g., R0-R12) and then executing the
ARM code shown in Table 1 for “Set PMSELR” (replacing with the register you put the value
x in).
10. Read PMNx by executing the ARM code shown in Table 1 for “Read PMXEVCNTR” after replacing
with the register (e.g., R0-R12) you want to copy the performance counter value into. This
corresponds to reading the counters via the multiplexer 10 illustrated in Figure 1.
These steps are illustrated in the example in Figure 2 which is described in more detail in the following
section.
2 Lab Procedure
Follow the steps below.
2.1 Part 1 [4 marks]: Performance Measurement using Example Code
Run the ARM assembly code in Figure 2 on your DE1-SoC (this code must be run on real hardware).
Note you should not single step while collecting performance counters. Set a breakpoint on the line
CPEN 211 - Lab 11 4 of 10.text
.global _start
_start:
BL CONFIG_VIRTUAL_MEMORY
// Step 1-3: configure PMN0 to count cycles
MOV R0, #0 // Write 0 into R0 then PMSELR
MCR p15, 0, R0, c9, c12, 5 // Write 0 into PMSELR selects PMN0
MOV R1, #0x11 // Event 0x11 is CPU cycles
MCR p15, 0, R1, c9, c13, 1 // Write 0x11 into PMXEVTYPER (PMN0 measure CPU cycles)
// Step 4: enable PMN0
mov R0, #1 // PMN0 is bit 0 of PMCNTENSET
MCR p15, 0, R0, c9, c12, 1 // Setting bit 0 of PMCNTENSET enables PMN0
// Step 5: clear all counters and start counters
mov r0, #3 // bits 0 (start counters) and 1 (reset counters)
MCR p15, 0, r0, c9, c12, 0 // Setting PMCR to 3
// Step 6: code we wish to profile using hardware counters
mov r1, #0x00100000 // base of array
mov r2, #0x100 // iterations of inner loop
mov r3, #2 // iterations of outer loop
mov r4, #0 // i=0 (outer loop counter)
L_outer_loop:
mov r5, #0 // j=0 (inner loop counter)
L_inner_loop:
ldr r6, [r1, r5, LSL #2] // read data from memory
add r5, r5, #1 // j=j+1
cmp r5, r2 // compare j with 256
blt L_inner_loop // branch if less than
add r4, r4, #1 // i=i+1
cmp r4, r3 // compare i with 2
blt L_outer_loop // branch if less than
// Step 7: stop counters
mov r0, #0
MCR p15, 0, r0, c9, c12, 0 // Write 0 to PMCR to stop counters
// Step 8-10: Select PMN0 and read out result into R3
mov r0, #0 // PMN0
MCR p15, 0, R0, c9, c12, 5 // Write 0 to PMSELR
MRC p15, 0, R3, c9, c13, 2 // Read PMXEVCNTR into R3
end: b end // wait here
Figure 2: Example 1 (NOTE: CONFIG_VIRTUAL_MEMORY is defined in pagetable.s)
“end: b end;” and run to it without single stepping. This code measures the number of cycles to execute
a nested loop that repeatedly iterates over elements of a one-dimensional array. You will notice that the
code in Figure 2 does not actually use the values loaded from memory by the line:
ldr r6, [r1, r5, LSL #2] // read data from memory
The reason is in this example we are concerned only with how many cache hits or misses are generated by
a program that repeatedly reads values from an array.
Next, modify the code from Figure 2 to also measure cycles and number of load instructions. NOTE:
Running the above code the measured CPU cycles will decrease each time you run the program (e.g., if
using Actions>Restart). This occurs because if you do NOT power cycle your DE1-SoC and download the
program again the cache blocks brought into the cache by one run of the code will remain valid in the cache
CPEN 211 - Lab 11 5 of 10#define N 128
double A[N][N], B[N][N], C[N][N];
void matrix_multiply(void)
{
int i, j, k;
for( i=0; ifor( j=0; jdouble sum=0.0;
for( k=0; ksum = sum + A[i][k] * B[k][j]; }
C[i][j] = sum; } }
}
Figure 3: Matrix Multiply C code
thus reducing subsequent cache misses.
Measure all three performance counters and compute the three factors in the processor performance
equation discussed in Slide Set #14:
Execution Time = Instruction Count × CPI × Cycle Time (1)
CPI is the average cycles per instruction and can be obtained by dividing cycle count by the instruction
count. To obtain cycle time you need to know the clock frequency, which is 800 MHz. Surprisingly the ARM
Cortex-A9 does not have a counter that measures all instructions executed (the ARMv7 documentation
says this is mandatory, the Cortex-A9 documentation says it is not implemented!) So you will need to
compute instruction count by analyzing the program. Create a table (using your favorite document editor or
spreadsheet program) to record the values measured by each of the performance counters. Note that usually
hardware performance counters are not perfect and may slightly under or over count events versus what you
expect.
Then, try increasing the value of the shift parameter "#2" in the following line to at least one other value
and repeat the measurements:
ldr r6, [r1, r5, LSL #2] // read data from memory
Your mark for Part 1 will be:
4/4 If you measure all three counters for two values of the left shift parameter, compute the three terms in
the processor performance equation (Equation 1) and can explain the results.
3/4 If you measure all three counters for at least two values of the left shift parameters, compute the three
terms in the processor performance equation, but have difficulty explaining the results to your TA.
2/4 If you measure all three counters for the default value of the left shift parameter.
1/4 If you measure at least two counter values
2.2 Part 2 [4 marks]: Matrix Multiply
In this part you will write ARM assemble code equivalent to the C code shown in Figure 3.
This code multiplies the matrix A times B and puts the result in matrix C. Matrix multiplication is an
important computational “kernel” in many important applications today (e.g., machine learning algorithms
such as deep belief networks used in speech recognition, self driving cars, etc...). Note “+” and “*” operations
in the above code should be double precision floating-point (Slide Set 13). Two-dimensional C arrays
are stored in memory in “row major” format. The elements in a row are placed adjacent in memory. For
example, consider the array with 2 rows and 3 columns declared as follows:
CPEN 211 - Lab 11 6 of 10address data
0x100 1.1
0x108 1.2
0x110 1.3
0x118 2.1
0x120 2.2
0x128 2.3
Figure 4: Layout of two dimensional array in memory
double my_array[2][3] = {{1.1, 1.2, 1.3}, {2.1, 2.2, 2.3}};
Drawn as a matrix my_array looks like:

1.1 1.2 1.3
2.1 2.2 2.3

(2)
This means my_array[0][0] contains 1.1, my_array[0][1] contains 1.2, my_array[0][2] contains
1.2, my_array[1][0] contains 2.1, and so on. Assume the base address of “my_array” is 0x100. Then, the
above six elements of “my_array” would be placed in memory as shown in Figure 4 (recall IEEE double
precision floating-point uses 64-bits which is 8 bytes):
You can use the .double directive to initialize the contents of the array. For example, “my_array” above
can be specified to have the initial contents in the above example using ARM assembly as follows:
my_array: .double 1.1
.double 1.2
.double 1.3
.double 2.1
.double 2.2
.double 2.3
To avoid conflicts with the memory used by CONFIG_VIRTUAL_MEMORY place your arrays below
address 0x01000000.
Use the above information about how arrays are placed in memory to help you compute the address
to load from for “A[i][k]” and “B[k][j]” and the address to store to for “C[i][j]”. You need to use
“N” in your address calculation. There an example of ARM assembly code performing matrix multiply on
pages 250-253 in Chapter 3 of COD4e (PDF on Canvas) with N=32. You can use this code as a starting
point provided you add a citation to it in your .s file in a comment. Alternatively, you can write the code
yourself. Either way, due to the limitations of the Altera Monitor Program noted earlier, you will need to
encode floating-point operations and your assembly code should support arbitrary values of N.
Before enabling virtual memory and caches run your code with a small value of N and with A and B
matrices with values of your choosing to verify the results of your matrix multiply code are correct. To
check the results are correct you will need to look at the values stored into memory for the output array
using the memory tab in the Altera Monitor Program. NOTE: The Altera Monitor Program does not know
how to display floating-point numbers. Instead, use the following URL to find the hexadecimal encoding
for a double precision number: http://www.binaryconvert.com/result_double.html?decimal=049
Rerun the code with virtual memory and caches enabled with N set to 128 and then N set to 16. Use
the performance counters to help you compute the average CPI in both cases and be prepared to be able to
explain them. When using larger values of N (e.g., 16 and 128) to measure cache performance you do NOT
need to explicitly initialize the input matrices unless you want to.
CPEN 211 - Lab 11 7 of 1031 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
cond 1 1 0 1 U 0 0 1 Rn Dd 1 0 1 1 Imm8
Figure 5: Floating-Point Double Precision Load (FLDD Dd, [Rn,#imm8]). If U (bit-23) is 1, then
imm8 is added to the contents of Rn to form the effective address. If U is 0 imm8 is subtracted from Rn.
See also documentation on LDC in COD4e Appendix B1 & B2 (copro=11).
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
cond 1 1 1 0 0 0 1 0 Dn Dd 1 0 1 1 0 0 0 0 Dm
Figure 6: FP Double Precision Multiply (FMULD Dd, Dn, Dm). See also CDP in COD4e Appendix B1
& B2 (op1=2, op2=0,coproc=11).
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
cond 1 1 1 0 0 0 1 1 Dn Dd 1 0 1 1 0 0 0 0 Dm
Figure 7: FP Double Precision Addition (FADDD Dd, Dn, Dm). See also CDP in COD4e Appendix B1
& B2 (op1=3, op2=0,coproc=11).
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
cond 1 1 0 1 U 0 0 1 Rn Dd 1 0 1 1 Imm8
Figure 8: Floating-Point Double Precision Store (FSTD Dd, [Rn,#imm8]). If U (bit-23) is 1, then
imm8 is added to the contents of Rn to form the effective address. If U is 0 imm8 is subtracted from Rn.
See also documentation on STC in COD4e Appendix B1 & B2 (copro=11).
A challenge you will encounter is that the Altera Monitor Program is not setup to support programs that
use floating-point even though the ARM Cortex-A9 on the DE1-SoC has very good support for floatingpoint.
There are two issues: One is that the Altera Monitor Program does not show the contents of the
floating-point registers (e.g., D0-D15). Another is that it will not compile floating-point assembly mnemonics
such as “FMULD” (i.e., double precision floating-point multiply). We will “work around” the lack of
support for floating point in the Altera Monitor Program in the following way: You will manually assemble
FLDD, FMULD, FADDD and FSTD instructions into 1’s and 0’s then load them into memory at the
appropriate location in your assembly program using the “.word” directive.
The encodings for these four instructions are summarized below in Figures 5-8. Recall that ARM
floating-point operations are implemented using the “coprocessor” model described in Slide Set 11 on Slide
13. The double precision floating-point coprocessor number is 11 (CP11).
For example, the instruction “FLDD D0, [R8]” can be specified using “.word 0xED980B00”.
Your mark for Part 2 will be:
4/4 If you write the code and you can show it correctly computes results for small value of N that is not
a power of 2, and you run it with N = 128 and N = 16 and collect all three hardware counters and
compute the average CPI for both cases and you can explain the results you get.
3/4 If you write the code and you can show it correctly computes results for a small value of N other than
32.
2/4 If you wrote code for this part and it runs without triggering an illegal instruction fault and jumping
to address 0x00000004 on your hand coded float-point assembly and it stores values for the output
CPEN 211 - Lab 11 8 of 10matrix in memory but the result looks wrong.
1/4 If you did not do this part or your code will not compile, or it compiles but triggers an illegal instruction
fault and jumps to address 0x00000004 and/or it does not write the results to memory.
2.3 Part 3: Blocked Matrix Multiply
Look at the C code in Figure 5.21 in COD ARM edition (page 429). If you do not have the second textbook
it is available on short term loan in the library. This code performs a blocked matrix multiply which helps
improve performance by ensuring values are used multiple times after they are brought into the cache.
Implement this same strategy in assembly code and measure the difference in performance.
Your mark for Part 3 will be:
2/2 If you code up blocked matrix multiply in assembly and can show it computes the correct result for
small values of N and you use performance counters to verify it improves average CPI for N=128.
1/2 If you coded something up and it looks to the TA like it might have a chance of working but it does not
actually work.
2.4 Bonus #1 of 2 [4 marks]: Two-Level Page Table and TLB Performance Events
Both bonuses require knowledge of virtual memory you can learn by going through the last flipped lecture
on Virtual Memory early. If you plan to attempt these bonus questions you will need to sign up to ARM’s
website so you can download additional ARM documentation.
Modify the code in pagetable.s to create a working two-level page table implementation with 4KB pages.
You will likely need to consult the ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition
available from the ARM website to complete this (you will need to register with ARM to access it). You
will also want to read ahead about virtual memory in the textbook (we will cover virtual memory in class
too, but not necessarily before your lab section). Once you think you have the two-level page table working
make sure you extend the testing approach in pagetable.s to verify that it does (you need to figure out how
to do this). Lookup the event numbers for translation look aside buffer (TLB) misses and measure them on
the code from Part 2 but with N set to a large enough value to trigger TLB misses with 4KB pages. Your
mark for Bonus #1 will be:
4/4 If you complete of the aspects described in the paragraph above.
3/4 If you don’t get the TLB performance counter part done but otherwise get everything done.
2/4 If you code up the two level page table and it runs but you don’t have any testing code or your test isn’t
convincing.
1/4 If you code up most of the changes needed for the two-level page table but it is not working.
2.5 Bonus #2 of 2 [4 marks]: Mini Operating System
Modify pagetable.s and combine it with the task switching code from Part 4 of Lab 10 to create a simple
operating system that provides virtual memory protection as well as preemptive multitasking for applications
that use floating-point. You may use code from another student’s Lab 10 Part 4 provided both you and they
have demoed and submitted Lab 10 using handin, they give you permission to do so, and you acknowledge
them in a CONTRIBUTIONS file that you submit with your code. Process 0 and Process 1 must each have
their own page table. For Process 0 and Process 1 Virtual addresses between 0x00000000 and 0x0FFFFFFF
should map to different physical locations. Other virtual addresses should be marked "invalid" in the page
table. Be sure to consider the impact of the TLBs when virtual to physical mapping change. Not required
(do this for fun): Use the SWI instruction to enable your OS to expose I/O safely to software. Your mark for
Bonus #2 will be: 4/4 If you complete the aspects described in the paragraph above and you can convince
your TA your code works or at your TA’s discretion otherwise.
CPEN 211 - Lab 11 9 of 103 Lab Submission
Submit all files by the deadline on Page 1 using handin. Use the same procedure outlined at the end of the
Lab 3 handout except that now you are submitting Lab 11, so use:
handin cpen211 Lab11-

where

should be replaced by your lab section. Remember you can overwrite previous or trial
submissions using -o.
To ensure the demo proceeds quickly, your lab11.zip file should include all files including your assembly
source code AND your project files.
4 Lab Demonstration Procedure
As in prior labs we will be dividing each lab section into two one hour sessions (details on cpen211.ece.ubc.
ca). Your TA will have your submitted code with them and have setup a “TA marking station” where you
will go when it is your turn to be marked. Please bring your DE1-SoC.