MIPS Assembly
CISC vs. RISC CPU Architecture
The x86 family of CPUs (including the x86-64 we’ve been using) are considered Complex Instruction Set Computer architectures – CISC. The CISC-style of instruction sets originated with early mainframe computers which were primarily programmed in assembly, rather than in some high-level language. Thus, the designers of those CPUs had the goal of making programming for them easier for programmers, so they approached the problem of designing the instruction set in the same way one would design a full programming language. I.e., include things to make the programmer’s job easier.
Hence, in x86 we have a wide variety of instructions; most instructions can
access memory; in x86-64, memory operands can use any combination of registers
as base/index. As we’ve seen with string instructions and their rep
prefixes,
some instructions can even perform their own internal looping.
Early mainframe instruction sets went even further, with specialized instructions for performing matrix math, financial calculations, etc., all available for the assembly language programmer to use.
While this rich instruction set makes the assembly language programmer’s job
easier, it makes the CPU designer’s job much more difficult. Different
instructions have different complexities, which means that different
instructions take different cycle counts to run. (The speed of a CPU, in
Ghz, is the number of cycles per second, but different instructions take
different numbers of cycles.) See Agner Fog’s instruction tables for x86 which list cycle counts for all the different x86
instructions. The latency ranges from 1 for simple mov
instructions to upwards
100 for complex floating-point instructions. Varying instruction times in turn
makes pipelining difficult. And, of course, all those fancy instructions
require more circuits on the CPU die, which requires more power and generates
more heat.
The alternative is the Reduced Instruction Set Computer (RISC) architecture. Somewhat confusing is that RISC is also the name of a company, which makes RISC CPUs. Here we use RISC as a general term for any CPU which is designed around a smaller, more orthogonal core of simple instructions. Having only simple instructions means that most (all?) instructions can have similar cycle counts, which makes pipelining easier. Smaller instructions can be implemented in less silicon, which leads to lower power draw and less heat production.
Of course, the downside to RISC is that the programmer must now use 10 simple instructions where on CISC one complex instruction would do the job. For example, consider the problem of incrementing a value in memory. On x86, this is just
inc dword [addr]
However, on MIPS, this becomes something like
lod r1, dword [addr]
add r1, r1, 1
sto dword [addr], r1
The hope of RISC is that the simpler instructions can be executed faster, so that the increase in instruction count will be outweighed by the efficiency gains of the simpler instructions.
Examples of RISC architectures include SPARC, PowerPC from IBM, ARM, Alpha, and MIPS which we’ll look at here. Common features of RISC architectures are lots of registers, to make up for limited instructions which access memory.
MIPS Assembly
We’re specifically looking at MIPS32, the 32-bit variant of MIPS. We’ve already seen a bit of this in the context of the MIPS pipeline; the design of the pipeline (with its single MEM-ory access stage) influences the design of the instruction set: there are only two instructions capable of accessing memory, and they don’t do anything else.
Registers
MIPS32 has 32 general purpose registers; all are 32-bits (dwords) in size.
In examples I’ve referred to these
just by their numbers: r1, r2, ...
but in proper MIPS assembly
they have names:
Register | Number | Usage |
$zero | $0 | Always = 0 |
$at | $1 | Assembler temporary |
$v0,$v1 | $2,$3 | Function return; expression evaluation |
$a0-$a3 | $4-$7 | Function arguments |
$t0-$t7 | $8-$15 | Temporaries |
$t8,$t9 | $25,$25 | Temporaries |
$s0-$s7 | $16-$23 | Saved temporaries |
$k0,$k1 | $26,$27 | Reserved for OS |
$gp | $28 | Global pointer |
$sp | $29 | Stack pointer |
$fp | $30 | Frame pointer |
$ra | $31 | Return address |
You can refer to a register either by its “name” or its numeric form in assembly language.
Of these, the saved temporaries, $gp, $sp
, and $fp
are callee-preserved;
all the rest are caller-preserved.
Some registers are special:
$zero
is always equal to 0. If you read from it, you get 0. You can write to it, but doing so does nothing (i.e., it’s value does not change).$at
is used by pseudo instructions. Because MIPS is such a limited instruction set, many “common” instructions are actually implemented as assembler macros. If a macro needs a temporary (‘scratch’) register, it uses$at
.$sp
is the MIPS equivalent ofrsp
on x86-64; it’s the stack pointer. As on x86, the stack on MIPS grows down.$fp
is the MIPS equivalent ofrbp
, it is the stack frame pointer. On MIPS it contains the stack pointer of the calling function. On MIPS, functions typically setup the stack pointer once at the beginning of the function (allocating enough stack space for local variables and any registers that need to be saved due to function calls), rather than pushing/popping throughout the function as is more common on x86.$gp
is setup to point to an area of memory reserved for global variables. This makes accessing global variables faster.
We’ll talk about function calling convention on MIPS later.
Structure of a MIPS program
As on x86, a MIPS program is broken into sections, with the two most
important sections being the text section (containing code) and the
data section (containing … you know). These are introduced in a MIPS
assembly file with the .text
and .data
directives (all MIPS assembly
directives start with a period). For example, here’s Hello World in MIPS
assembly:
###
### hello.asm
### Hello, world! in MIPS assembly
###
.data
msg: .asciiz "Hello, world!"
.text
li $v0, 4
la $a0, msg
syscall
li $v0, 10
syscall
The MIPS simulator we’re using, MARS, supports a number of useful syscalls, including two that we’re using here:
Syscall code 4: Print nul-terminated string. The address of the string is placed in
$a0
.Syscall code 10: Terminate program
The syscall code itself is always placed in register $v0$
. The arguments to
the syscall are placed in $a0,$a1,$a2
as needed.
The .asciiz
directive stores an ASCII string, adding the terminated nul
character at the end. (The .ascii
directive does the same, but doesn’t add
the terminating nul.)
The instructions we’re using here are
li $v0, 4
: Load Immediate, loads the value 4 into$v0
. This is actually a pseudo instruction; the assembler translates this into the (real) instructionaddi $v0, $zero, 4 # $v0 = 0 + 4
la $a0, msg
: Load Address, loads the addressmsg
into$a0
.syscall
: Executes a syscall.
MIPS Instructions
We’ll try to group instructions and pseduo-instructions together as appropriate.
Note that since all registers are 32-bits, when a smaller value is loaded into
a register it must be either sign-extended or zero-extended. The default is
signed; instructions whose names end with u
do zero-extension.
Memory operands: No specific syntax is used for addresses vs. immediates; the instruction distinguishes between the two. For example,
lb $1, addr # Load byte from `addr` into $1
la $1, addr # Load `addr` as an immediate into $1
As opposed to x86’s combination of displacement, base register, index register, and scale in memory operands, MIPS can only do the addition of an immediate displacement, plus a register. For example, to access the elements of an array, index given by $1:
lb $1, arr($1)
The syntax address($1)
means to compute the address address + $1
and then
access that memory location.
# Immediates
# -----------------------------------------------------------------------------
li r₁, imm # Load immediate (16-bit, signed) (addi r₁, $zero, imm)
# Arithmetic: addition and subtraction
# -----------------------------------------------------------------------------
add r₁, r₂, r₃ # r₁ = r₂ + r₃, signed (overflow is an error)
addu r₁, r₂, r₃ # r₁ = r₂ + r₃, unsigned
addi r₁, r₂, imm # r₁ = r₂ + imm, signed
addiu r₁, r₂, imm # r₁ = r₂ + imm, unsigned
sub r₁, r₂, r₃ # r₁ = r₂ - r₃, signed (overload is an error)
subi r₁, r₂, imm # r₁ = r₂ - imm (signed)
subiu r₁, r₂, imm # r₁ = r₂ - imm (unsigned)
# There are no increment, decrement instructions; use these instead:
add r₁, r₁, 1 # r₁ = r₁ + 1
sub r₁, r₁, 1 # r₁ = r₁ - 1
# Memory access: loading values
# -----------------------------------------------------------------------------
lb r₁, 100(r₂) # Load byte from addr (r₂ + 100) into r₁, sign-extension
lbu r₁, 100(r₂) # Load byte from addr (r₂ + 100) into r₁, zero-extension
lh r₁, 100(r₂) # Load 16-bit from addr (r₂ + 100) into r₁, sign-extension
lhu r₁, 100(r₂) # Load 16-bit from addr (r₂ + 100) into r₁, zero-extension
lw r₁, 100(r₂) # Load 32-bit from addr (r₂ + 100) into r₁
la r₁, label # Load address (internally, this is addiu)
li r₁, imm # Load immediate (internally, this is addi r₁, $zero, imm)
# Memory access: storing values
# -----------------------------------------------------------------------------
sb r₁, addr # Store low byte of r₁ into memory addr.
sw r₁, addr # Store 32-bit value from r₁ into memory addr.
# Store 32-bit value from $N into addr, and then
sd $N, addr # another 32-bit value from $(N+1) into addr+4
# I.e., "store doubleword" (64-bit value)
NOTE: MIPS calls a 32-bit value a “word”, as opposed to a dword as we’d call it.
NOTE: The store instruction (sw
, etc.) have the destination in the second
operand, which is the opposite of every other MIPS instruction (and x86,
where the destination is always first).
Multiplication and division
Multiplication and division on MIPS are interesting because they take more
than one instruction to complete; this is because multiplication/division take
longer than a single EX pipeline stage to complete, so their processing must
be split over two instructions. Two special registers, HI
and LO
are used
just for multiplication and division.
mthi r₁ # Copy r₁ into HI
mtlo r₂ # Copy r₁ into LO
mfhi r₁ # Copy HI into r₁
mflo r₁ # Copy LO into r₁
# Compute r₂ × r₃ (result is 64-bits)
mul r₁, r₂, r₃ # Store low 32-bits into LO and r₁
# Store high 32-bits into HI
# Compute r₂ × imm (result is 64-bits)
mul r₁, r₂, imm # Store low 32-bits into LO and r₁
# Store high 32-bits into HI
# Compute r₂ × r₃, unsigned (result is 64-bits)
mulu r₁, r₂, r₃ # Store low 32-bits into LO and r₁
# Store high 32-bits into HI
# Compute r₂ × imm, unsigned (result is 64-bits)
mulu r₁, r₂, imm # Store low 32-bits into LO and r₁
# Store high 32-bits into HI
# Compute r₁ × r₂ (result is 64-bits, signed)
mult r₁, r₂ # Store low 32-bits into LO
# Store high 32-bits into HI
# Compute r₁ × r₂ (result is 64-bits, unsigned)
multu r₁, r₂ # Store low 32-bits into LO
# Store high 32-bits into HI
# Compute r₁ divided by r₂ (signed)
div r₁, r₂ # Store r₁ / r₂ into LO
# Store r₁ % r₂ into HI
# Compute r₂ divided by r₃ (signed)
div r₁, r₂, r₃ # Store r₂ / r₃ into r₁
# (Remainder is discarded)
# Compute r₂ divided by imm (signed)
div r₁, r₂, imm # Store r₂ / imm into r₁
# (Remainder is discarded)
# Compute r₁ divided by r₂ (unsigned)
divu r₁, r₂ # Store r₁ / r₂ into LO
# Store r₁ % r₂ into HI
# Compute r₂ divided by r₃ (unsigned)
divu r₁, r₂, r₃ # Store r₂ / r₃ into r₁
# (Remainder is discarded)
# Compute r₂ divided by imm (unsigned)
divu r₁, r₂, imm # Store r₂ / imm into r₁
# (Remainder is discarded)