MIPS Assembly

CISC vs. RISC CPU Architecture

The x86 family of CPUs (including the x86-64 we’ve been using) are considered Complex Instruction Set Computer architectures – CISC. The CISC-style of instruction sets originated with early mainframe computers which were primarily programmed in assembly, rather than in some high-level language. Thus, the designers of those CPUs had the goal of making programming for them easier for programmers, so they approached the problem of designing the instruction set in the same way one would design a full programming language. I.e., include things to make the programmer’s job easier.

Hence, in x86 we have a wide variety of instructions; most instructions can access memory; in x86-64, memory operands can use any combination of registers as base/index. As we’ve seen with string instructions and their rep prefixes, some instructions can even perform their own internal looping.

Early mainframe instruction sets went even further, with specialized instructions for performing matrix math, financial calculations, etc., all available for the assembly language programmer to use.

While this rich instruction set makes the assembly language programmer’s job easier, it makes the CPU designer’s job much more difficult. Different instructions have different complexities, which means that different instructions take different cycle counts to run. (The speed of a CPU, in Ghz, is the number of cycles per second, but different instructions take different numbers of cycles.) See Agner Fog’s instruction tables for x86 which list cycle counts for all the different x86 instructions. The latency ranges from 1 for simple mov instructions to upwards 100 for complex floating-point instructions. Varying instruction times in turn makes pipelining difficult. And, of course, all those fancy instructions require more circuits on the CPU die, which requires more power and generates more heat.

The alternative is the Reduced Instruction Set Computer (RISC) architecture. Somewhat confusing is that RISC is also the name of a company, which makes RISC CPUs. Here we use RISC as a general term for any CPU which is designed around a smaller, more orthogonal core of simple instructions. Having only simple instructions means that most (all?) instructions can have similar cycle counts, which makes pipelining easier. Smaller instructions can be implemented in less silicon, which leads to lower power draw and less heat production.

Of course, the downside to RISC is that the programmer must now use 10 simple instructions where on CISC one complex instruction would do the job. For example, consider the problem of incrementing a value in memory. On x86, this is just

inc dword [addr]

However, on MIPS, this becomes something like

lod r1, dword [addr]
add r1, r1, 1
sto dword [addr], r1

The hope of RISC is that the simpler instructions can be executed faster, so that the increase in instruction count will be outweighed by the efficiency gains of the simpler instructions.

Examples of RISC architectures include SPARC, PowerPC from IBM, ARM, Alpha, and MIPS which we’ll look at here. Common features of RISC architectures are lots of registers, to make up for limited instructions which access memory.

MIPS Assembly

We’re specifically looking at MIPS32, the 32-bit variant of MIPS. We’ve already seen a bit of this in the context of the MIPS pipeline; the design of the pipeline (with its single MEM-ory access stage) influences the design of the instruction set: there are only two instructions capable of accessing memory, and they don’t do anything else.

Registers

MIPS32 has 32 general purpose registers; all are 32-bits (dwords) in size. In examples I’ve referred to these just by their numbers: r1, r2, ... but in proper MIPS assembly they have names:

	Register	Number	Usage
$zero	$0	Always = 0
$at	$1	Assembler temporary
$v0,$v1	$2,$3	Function return; expression evaluation
$a0-$a3	$4-$7	Function arguments
$t0-$t7	$8-$15	Temporaries
$t8,$t9	$25,$25	Temporaries
$s0-$s7	$16-$23	Saved temporaries
$k0,$k1	$26,$27	Reserved for OS
$gp	$28	Global pointer
$sp	$29	Stack pointer
$fp	$30	Frame pointer
$ra	$31	Return address

You can refer to a register either by its “name” or its numeric form in assembly language.

Of these, the saved temporaries, $gp, $sp, and $fp are callee-preserved; all the rest are caller-preserved.

Some registers are special:

$zero is always equal to 0. If you read from it, you get 0. You can write to it, but doing so does nothing (i.e., it’s value does not change).
$at is used by pseudo instructions. Because MIPS is such a limited instruction set, many “common” instructions are actually implemented as assembler macros. If a macro needs a temporary (‘scratch’) register, it uses $at.
$sp is the MIPS equivalent of rsp on x86-64; it’s the stack pointer. As on x86, the stack on MIPS grows down.
$fp is the MIPS equivalent of rbp, it is the stack frame pointer. On MIPS it contains the stack pointer of the calling function. On MIPS, functions typically setup the stack pointer once at the beginning of the function (allocating enough stack space for local variables and any registers that need to be saved due to function calls), rather than pushing/popping throughout the function as is more common on x86.
$gp is setup to point to an area of memory reserved for global variables. This makes accessing global variables faster.

We’ll talk about function calling convention on MIPS later.

Structure of a MIPS program

As on x86, a MIPS program is broken into sections, with the two most important sections being the text section (containing code) and the data section (containing … you know). These are introduced in a MIPS assembly file with the .text and .data directives (all MIPS assembly directives start with a period). For example, here’s Hello World in MIPS assembly:

###
### hello.asm
### Hello, world! in MIPS assembly
###

.data

msg:    .asciiz "Hello, world!"

.text

li $v0, 4
la $a0, msg
syscall

li $v0, 10
syscall

The MIPS simulator we’re using, MARS, supports a number of useful syscalls, including two that we’re using here:

Syscall code 4: Print nul-terminated string. The address of the string is placed in $a0.
Syscall code 10: Terminate program

The syscall code itself is always placed in register $v0$ . The arguments to the syscall are placed in $a0,$a1,$a2 as needed.

The .asciiz directive stores an ASCII string, adding the terminated nul character at the end. (The .ascii directive does the same, but doesn’t add the terminating nul.)

The instructions we’re using here are

li $v0, 4: Load Immediate, loads the value 4 into $v0. This is actually a pseudo instruction; the assembler translates this into the (real) instruction
```
 addi $v0, $zero, 4     # $v0 = 0 + 4
```
la $a0, msg: Load Address, loads the address msg into $a0.
syscall: Executes a syscall.

MIPS Instructions

We’ll try to group instructions and pseduo-instructions together as appropriate.

Note that since all registers are 32-bits, when a smaller value is loaded into a register it must be either sign-extended or zero-extended. The default is signed; instructions whose names end with u do zero-extension.

Memory operands: No specific syntax is used for addresses vs. immediates; the instruction distinguishes between the two. For example,

lb $1, addr         # Load byte from `addr` into $1
la $1, addr         # Load `addr` as an immediate into $1

As opposed to x86’s combination of displacement, base register, index register, and scale in memory operands, MIPS can only do the addition of an immediate displacement, plus a register. For example, to access the elements of an array, index given by $1:

lb $1, arr($1)

The syntax address($1) means to compute the address address + $1 and then access that memory location.

# Immediates
# -----------------------------------------------------------------------------
li    r₁, imm         # Load immediate (16-bit, signed) (addi r₁, $zero, imm)       

# Arithmetic: addition and subtraction
# -----------------------------------------------------------------------------
add   r₁, r₂, r₃      # r₁ = r₂ + r₃, signed (overflow is an error)
addu  r₁, r₂, r₃      # r₁ = r₂ + r₃, unsigned

addi  r₁, r₂, imm     # r₁ = r₂ + imm, signed
addiu r₁, r₂, imm     # r₁ = r₂ + imm, unsigned

sub   r₁, r₂, r₃      # r₁ = r₂ - r₃, signed (overload is an error)
subi  r₁, r₂, imm     # r₁ = r₂ - imm (signed)
subiu r₁, r₂, imm     # r₁ = r₂ - imm (unsigned)

# There are no increment, decrement instructions; use these instead:
add   r₁, r₁, 1       # r₁ = r₁ + 1
sub   r₁, r₁, 1       # r₁ = r₁ - 1

# Memory access: loading values
# -----------------------------------------------------------------------------
lb    r₁, 100(r₂)     # Load byte from addr (r₂ + 100) into r₁, sign-extension
lbu   r₁, 100(r₂)     # Load byte from addr (r₂ + 100) into r₁, zero-extension
lh    r₁, 100(r₂)     # Load 16-bit from addr (r₂ + 100) into r₁, sign-extension
lhu   r₁, 100(r₂)     # Load 16-bit from addr (r₂ + 100) into r₁, zero-extension
lw    r₁, 100(r₂)     # Load 32-bit from addr (r₂ + 100) into r₁

la    r₁, label       # Load address (internally, this is addiu)
li    r₁, imm         # Load immediate (internally, this is addi r₁, $zero, imm)

# Memory access: storing values
# -----------------------------------------------------------------------------

sb    r₁, addr        # Store low byte of r₁ into memory addr.
sw    r₁, addr        # Store 32-bit value from r₁ into memory addr.

                      # Store 32-bit value from $N into addr, and then
sd    $N, addr        # another 32-bit value from $(N+1) into addr+4
                      # I.e., "store doubleword" (64-bit value)

NOTE: MIPS calls a 32-bit value a “word”, as opposed to a dword as we’d call it.

NOTE: The store instruction (sw, etc.) have the destination in the second operand, which is the opposite of every other MIPS instruction (and x86, where the destination is always first).

Multiplication and division

Multiplication and division on MIPS are interesting because they take more than one instruction to complete; this is because multiplication/division take longer than a single EX pipeline stage to complete, so their processing must be split over two instructions. Two special registers, HI and LO are used just for multiplication and division.

mthi r₁             # Copy r₁ into HI
mtlo r₂             # Copy r₁ into LO
mfhi r₁             # Copy HI into r₁
mflo r₁             # Copy LO into r₁

                    # Compute r₂ × r₃ (result is 64-bits)
mul  r₁, r₂, r₃     # Store low 32-bits into LO and r₁
                    # Store high 32-bits into HI 

                    # Compute r₂ × imm (result is 64-bits)
mul  r₁, r₂, imm    # Store low 32-bits into LO and r₁
                    # Store high 32-bits into HI 

                    # Compute r₂ × r₃, unsigned (result is 64-bits)
mulu  r₁, r₂, r₃    # Store low 32-bits into LO and r₁
                    # Store high 32-bits into HI 

                    # Compute r₂ × imm, unsigned (result is 64-bits)
mulu  r₁, r₂, imm   # Store low 32-bits into LO and r₁
                    # Store high 32-bits into HI 

                    # Compute r₁ × r₂ (result is 64-bits, signed)
mult r₁, r₂         # Store low 32-bits into LO
                    # Store high 32-bits into HI

                    # Compute r₁ × r₂ (result is 64-bits, unsigned)
multu r₁, r₂        # Store low 32-bits into LO
                    # Store high 32-bits into HI   

                    # Compute r₁ divided by r₂ (signed)
div r₁, r₂          # Store r₁ / r₂ into LO
                    # Store r₁ % r₂ into HI   

                    # Compute r₂ divided by r₃ (signed)
div r₁, r₂, r₃      # Store r₂ / r₃ into r₁
                    # (Remainder is discarded)   

                    # Compute r₂ divided by imm (signed)
div r₁, r₂, imm     # Store r₂ / imm into r₁
                    # (Remainder is discarded)   

                    # Compute r₁ divided by r₂ (unsigned)
divu r₁, r₂         # Store r₁ / r₂ into LO
                    # Store r₁ % r₂ into HI  

                    # Compute r₂ divided by r₃ (unsigned)
divu r₁, r₂, r₃     # Store r₂ / r₃ into r₁
                    # (Remainder is discarded)   

                    # Compute r₂ divided by imm (unsigned)
divu r₁, r₂, imm    # Store r₂ / imm into r₁
                    # (Remainder is discarded)