Function calling conventions
To review, when we call a function, we have to choose some registers to use for arguments, at least one to use for return value, some to be caller-saved (available for temporary use by the function) and some to be callee-saved. Our choices for these were selected so as to align with the standard Unix C ABI calling convention, so with a bit more work, our functions will be compatible with the C standard library.
Register | Use |
---|---|
rax |
Return value |
rbx |
Callee-preserved |
rcx |
4th argument |
rdx |
3rd argument |
rsi |
2nd argument |
rdi |
1st argument |
rbp |
Callee-preserved |
rsp |
Stack pointer |
r8 |
5th argument |
r9 |
6th argument |
r10 |
Temporary (caller-preserved) |
r11 |
Temporary (caller-preserved) |
r12 -r15 |
Callee-preserved |
Note that all argument and return value registers count as caller-preserved.
I.e., if your function is using rcx
and you want to call a function, you
must preserve rcx
even if the function doesn’t use it as an argument.
The general process for calling a function is thus
...
push r10 ; Push any caller-saved registers in use
call func
pop r10 ; Restore after return
Similarly, within a function, we have a preamble which saves any callee-saved registers, and a prologue which restores them:
func:
push r12 ; Push any callee-saved registers in use
...
pop r12 ; Restore them before return
ret
As we will see, making our functions C-compatible will require a few changes to the function preamble/prologue to setup the stack correctly.
Stack operations
To manually add and remove elements to/from the stack, we use the push
and
pop
operations:
push op
– Pushes the operandop
onto the stack. I.e., decrementsrsp
by the size ofop
and then performsmov [rsp] op
.op
cannot be a 64-bit immediate, but it can be a smaller immediate (8, 16, or 32-bits). Note that you cannot push a byte-sized register or memory operand, but you can push a byte-sized immediate.pop op
– Pops the top value of the stack intoop
. I.e., performsmov op [rsp]
and then incrementsrsp
by the size ofop
.op
cannot be an immediate (obviously), and must be a word or larger.
Note that in x86-64, the stack starts at the “end” of memory and grows backwards,
so that pushing decrements rsp
and popping increments
it. Thus, to look
“down” the stack we will access rsp +
some distance.
Note that you are free to modify the stack without using these instructions. For example, if you wish to “pop” the top 8 bytes off the stack but don’t care about their values, you can simply do
add rsp, 8
to move rsp
up by 8.
The fact that the stack starts at the end of a process’s address space
and grows backwards means that if you look at rsp
at any point, it’s probably
huge (on the order of terabytes). This doesn’t mean that your process can actually
use that much memory; as we’ll see later, the system has a way of not allocating
parts of memory which are unused.
Sorting in Assembly
To review we’re going to develop a couple of 133 algorithms into their assembly equivalents. Here is the preamble we’re working with:
section .data
; Load 1M of random data from random.dat
data: incbin "random.dat"
DATLEN: equ $-data
section .text
global _start
_start:
...
(Rather than going to the trouble of using the open
/read
syscalls, we
just compile the random data directly into the executable!)
Checking for sorted-ness
We’re going to implement a sorting algorithm, but we need a way to check and
see if it is working correctly. So we’re going to write a function
is_sorted
which will set rax
to 1 if the input array is sorted, and 0
if it is not. It will receive the address of the array as rdi
and the
length as rsi
. We will consider the array to consist of signed qwords (8 bytes
per array entry).
In C/C++, we could check such an array with something like
bool is_sorted(long* arr, size_t length) {
size_t i = 0;
while(i < length-1) {
if(arr[i] > arr[i+1])
return false;
++i;
}
return true;
}
There are three situations where we have non-sequential execution:
If the
while
condition is false, we jump to the finalreturn
.At the end of the loop, we jump (unconditionally) back to the beginning.
If the
if
condition is false, we jump to thei++
at the end of the loop.
In assembly, we have something like this:
is_sorted:
; rdi = address of array
; rsi = length (in bytes)
; Returns: rax = 1 if sorted
sub rsi, 8 ; Last element
mov rax, 0
.while:
cmp rax, rsi
je .done
mov r8, qword [rax + rdi]
mov r9, qword [rax + rdi + 8]
cmp r8, r9
jle .continue
mov rax, 0
ret
.continue:
add rax, 8
jmp .while
.done:
mov rax, 1
ret
We’re using an extended form of the memory operand to simplify the array lookups. In particular, memory operands can consist of a sum of
A constant address
Up to two registers, where one is called the base register and the other the index register and
The index register can be multiplied by 1, 2, 4, or 8.
Since there’s no restriction on which registers can be bases vs. indexes, in practice, one of the registers can be multiplied and the other cannot.
There’s one difference between the assembly and the C/C++ code: the length in assembly is given in the number of bytes, not the number of array elements. It’s fairly common in assembly routines for lengths to be given in bytes, regardless of the actual size of the elements.
Insertion sort
We’re going to implement insertion sort. In C++ insertion sort looks like this:
void insertion_sort(long* arr, unsigned long size) {
unsigned long i = 1, j;
while(i < size) {
j = i;
while(j > 0) {
if(arr[j] >= arr[j-1])
break;
else
std::swap(arr[j], arr[j-1]);
--j;
}
++i;
}
}
(I’ve written this using while
loops so that it will be easier to translate
into assembly; the natural way to write it in C/C++ would be to use for
loops.)
In assembly, a while
loop has the structure:
.while:
cmp ...
jcc .end_while
...
jmp .while
.end_while
In keeping with our calling convention for functions, our function will
receive the address of the array in rdi
, and the size (unsigned) in rsi
.
insertion_sort:
; rdi = addr of array
; rsi = length of array
mov rax, 1 ; rax = outer loop index, unsigned
.while_i:
cmp rax, rsi
je .done
mov rbx, rax ; rbx = inner loop index
.while_j:
cmp rbx, 0
je .done_i
mov r8, qword [rdi + 8*rbx]
mov r9, qword [rdi + 8*rbx - 8]
cmp r8,r9
jge .done_i ; break
; swap
mov qword [rdi + 8*rbx], r9
mov qword [rdi + 8*rbx - 8], r8
dec rbx
jmp .while_j
.done_i:
inc rax
jmp .while_i
.done
ret
We could simplify the comparison of the two values in the inner loop by using
the string comparison operation cmpsb
which allows a memory-memory comparison
between byte [rsi]
and byte [rdi]
.
Binary search
Implementing binary search allows us to make use of another conditional
operation: conditional moves. Conditional jumps are somewhat expensive,
as the CPU cannot pre-load the instruction after the jump, until it has
processed the conditional jump itself. A conditional move is a mov
with a
condition code attached; the move only happens if the condition is true.
The conditional move instruction is cmov**
where **
is one of the
condition codes.
The restrictions on conditional moves’ operands are
The first operand must be a register
The second can be a register or memory, but not an immediate
Both must be the same size
Byte-sized conditional moves are not allowed.
For reference, the implementation of a binary search in C++ is
unsigned long binary_search(long* arr, unsigned long len, long target) {
unsigned long low = 0, high = len-1, mid;
while(low < high) {
mid = (high - low) / 2 + low;
if(target < arr[mid])
high = mid-1;
else if(target == arr[mid])
return mid;
else
low = mid+1;
}
if(arr[low] == target)
return low;
else
return -1; // 0xffffffff...
}
(Normally we would write the condition as low <= high
and skip the final
if
, however, if low
and high
are unsigned, then if high == 0
and we
do high - 1
, we get high = 0xfffff...
, and so the condition is never
false.)
binary_search:
; rdi = address of array
; rsi = length of array
; dl = target to search for
mov rax, 0 ; rax = low = 0
mov rbx, rsi ; rbx = high = len-1
dec rbx;
.while:
cmp rax, rbx
jae .not_found ; Stop if rax > rbx
; rcx = mid = (high - low)/2 + low
mov rcx, rbx ; rcx = high
sub rcx, rax ; rcx -= low
shr rcx, 1 ; rcx /= 2
add rcx, rax ; rcx += low
mov r14, rcx ; r14 = mid+1
inc r14
mov r15, rcx ; r15 = mid-1
dec r15
cmp rdx, qword [rdi + 8*rcx] ; compare target, arr[mid]
cmovg rax, r14 ; target > arr [mid] ===> low = mid+1
cmovl rbx, r15 ; target < arr [mid] ===> high = mid-1
cmove rax, rcx ; target == arr[mid] ===> rax = mid (index)
je .return ; target == arr[mid] ===> return
jmp .while
.not_found:
cmp qword [rdi + 8*rax], rdx
je .return
mov rax, -1
.return:
ret
The section of code
mov r14, rcx ; r14 = mid+1
inc r14
mov r15, rcx ; r15 = mid-1
dec r15
might seem counterintuitive. How could it possibly be faster to compute
mid-1
and mid+1
, when it’s possible that neither will be needed? Due to
instruction-level parallelism, the processor may actually be able to compute
both of these at the same time, overlapping both with the computation of
of cmp dl, byte [rcx]
. Sometimes, it can actually be faster to give the
compiler more work to do, so long as the work can be parallelized.
The final program, incorporating all these routines, is as follows:
;;;;
;;;; sort_search.s
;;;; Sorting and searching in assembly.
;;;;
section .data
data: incbin "random.dat"
DATLEN: equ ($-data) / 8
sorted: dq 1,9,2,8,3,7,4,6,5
SORTLEN: equ ($-sorted)/8
section .text
;;; _start
;;; Program entry point
global _start
_start:
mov rdi, data
mov rsi, DATLEN
call insertion_sort
mov rdi, data
mov rsi, DATLEN
mov rdx, 10000
call linear_search
mov r10, rax ; Save index
mov rdi, data
mov rsi, DATLEN
mov rdx, 10000
call binary_search
mov r8, 0
mov r9, 1
cmp rax, r10
cmove rdi, r8
cmovne rdi, r9
; Exit with sorted-ness in exit code
mov rax, 60
syscall
;;; is_sorted
;;; Returns true if an array is sorted.
;;;
;;; rdi = address
;;; rsi = length
;;; Returns: rax = 1 if sorted, 0 otherwise.
is_sorted:
; rdi = address of array
; rsi = length
; Returns: rax = 1 if sorted
dec rsi
mov rax, 0
.while:
cmp rax, rsi
je .done
mov r8, qword [8*rax + rdi]
mov r9, qword [8*rax + rdi + 8]
cmp r8, r9
jle .continue
mov rax, 0
ret
.continue:
inc rax
jmp .while
.done:
mov rax, 1
ret
;;; insertion_sort
;;; Sorts and array of qwords
;;; rdi = address
;;; rsi = length
;;; Returns nothing
insertion_sort:
; rdi = addr of array
; rsi = length of array
mov rax, 1 ; rax = outer loop index, unsigned
.while_i:
cmp rax, rsi
je .done
mov rbx, rax ; rbx = inner loop index
.while_j:
cmp rbx, 0
je .done_i
mov r8, qword [rdi + 8*rbx]
mov r9, qword [rdi + 8*rbx - 8]
cmp r8,r9
jge .done_i ; break
; swap
mov qword [rdi + 8*rbx], r9
mov qword [rdi + 8*rbx - 8], r8
dec rbx
jmp .while_j
.done_i:
inc rax
jmp .while_i
.done
ret
;;; binary_search
;;; Does a binary search over an array for a target value
;;; rdi = address
;;; rsi = length
;;; rdx = target
;;; Returns: index of target or -1 (0xfffff...) if not found
binary_search:
; rdi = address of array
; rsi = length of array
; dl = target to search for
mov rax, 0 ; rax = low = 0
mov rbx, rsi ; rbx = high = len-1
dec rbx;
.while:
cmp rax, rbx
jae .not_found ; Stop if rax > rbx
; rcx = mid = (high - low)/2 + low
mov rcx, rbx ; rcx = high
sub rcx, rax ; rcx -= low
shr rcx, 1 ; rcx /= 2
add rcx, rax ; rcx += low
mov r14, rcx ; r14 = mid+1
inc r14
mov r15, rcx ; r15 = mid-1
dec r15
cmp rdx, qword [rdi + 8*rcx] ; compare target, arr[mid]
cmovg rax, r14 ; target > arr [mid] ===> low = mid+1
cmovl rbx, r15 ; target < arr [mid] ===> high = mid-1
cmove rax, rcx ; target == arr[mid] ===> rax = mid (index)
je .return ; target == arr[mid] ===> return
jmp .while
.not_found:
cmp qword [rdi + 8*rax], rdx
je .return
mov rax, -1
.return:
ret
;;; linear_search
;;; Does a linear search in an array.
;;; rdi = address
;;; rsi = length
;;; rdx = target
;;; Returns rax = index of target or -1 (0xffff...) if not found
linear_search:
; rdi = address of array
; rsi = length of array
; dl = target to search for
; Returns: rax = index of target, or -1 (0xffff...) if not found
mov rax, 0 ; Index
.while:
cmp rax, rsi
je .not_found
cmp rdx, qword [rdi + 8*rax]
jne .continue ; Not equal: next iteration
ret ; Equal, return current rax
.continue:
inc rax
jmp .while
.not_found:
mov rax, -1
ret
This does an insertion sort, then does a linear search, followed by a binary search (for the same target), and exits with code 0 (success) if both searches returned the same index.
The above file is available on the server as
/usr/local/class/cs241/sort_search.s
.
Bubble sort
Implementing Bubblesort is left as an exercise, but for reference, the C/C++ implementation is
void bubble_sort(long* arr, unsigned long len)
{
unsigned long i = len-1, j;
while(i > 0) {
j = 0;
bool swapped = false;
while(j < i) {
if(arr[j] < arr[j+1]) {
std::swap(arr[j], arr[j+1]);
swapped = true;
}
++j;
}
if(!swapped)
break;
--i;
}
}
C-compatible Functions
As we’ve seen, functions in assembly operate at a much lower level than in C or C++. We’ll have to learn how to manually manage the stack space used by our functions.
call
and ret
The basic instructions for implementing function calls are call
and ret
:
call Func
pushes the IP registerrip
onto the stack and then jumps to the labelFunc
.ret
pops the top of the stack intorip
and then jumps torip
. Becauserip
always points to the next instruction, this will jump to the instruction immediately after thecall
that brought us into this function.
Thus, at the most basic level, the stack is used to keep track of where to return to. This allows us to have function f call function g, and then g calls h, and when h/g return, we have a trail of “breadcrumbs” (return addresses) on the stack, telling us how to get back to f.
This is not all the stack is used for, however. Because we have a limited number of registers, we also use the stack for:
Temporary storage for register values we want to save and then restore later.
Passing parameters, when a function has more parameters than will fit in the six registers designated for this.
Returning values, when the return value is larger than will fit in
rdx:rax
.Local variables, when there are more local variables than available registers (or when we need to use the address of a local variable; registers do not have addresses).
Stack direction
The stack, contrary to the way we normally draw it, starts at the end of
a program’s memory space and grows backwards, towards 0. Thus, the stack
operations push
and pop
have the following effects:
push x
: subtract the size ofx
in bytes fromrsp
, and thenmov [rsp], x
.pop x
:mov x, [rsp]
and then add the size of x torsp
.
It’s important to bear in mind that the size of the operand to push
/pop
affects how the stack pointer moves. If you push a word, rsp
will be
decremented by 2.
You can push
/pop
registers, memory operands, or immediates, but the minimum
size is a word. You cannot push/pop individual bytes.
Stack alignment
The System-V x86-64 function call specification requires that the top of the
stack
(i.e., the address in rsp
) always be aligned to a multiple of 16 bytes before any call
instruction.
This means that the value of rsp
must always
be expressible as
call
a function. Put another way, if you
divide rsp
by 16, you should always get a remainder of 0.
When we call
a function, the call
instruction itself pushes the return
address onto the stack, making the stack alignment now +8 (i.e., rsp
,
when divided by 16, gives a remainder of 8).
We didn’t have to worry about this as long as we weren’t
interacting with the rest of the system, but as soon as we try to write
main
instead of _start
, we’ll need to follow these rules.
Hence, the first thing that every function should do is realign the stack pointer, either by
Subtracting 8 from
rsp
:sub rsp, 8
(remember that the stack grows down in terms of addresses!)Pushing
rbp
onto the stack (see below):push rbp
Before returning, a function should undo these actions, so that when ret
pops the rip
, the stack will again be correctly aligned.
Similarly, if you are pushing arguments onto the stack (see next section),
you will have to ensure that the final rsp
immediately before the call is
properly aligned, otherwise the call
will segfault, because the compiler
has generated aligned-moves in the assembly of the function.
Leaf functions
A “leaf function” is one that
Takes all of its parameters in registers
Returns its result in registers
Does not call any other functions
Does not push anything onto the stack
I.e., a leaf function does not use the stack, either directly or indirectly.
Hence, a leaf function is allowed to not align the stack pointer, and also
to skip setting up rbp
(see below). This is fine because a leaf function is
never going to call another function or use the stack.
Passing arguments
Traditionally, arguments were passed by pushing them onto the stack, in the reverse order of that in which they are passed (i.e., last argument pushed first). This incurs significant overhead on calling functions, as functions must access memory in order to read in their arguments. The x86-64 application binary interface (which defines the function calling convention and many other standard assembly-level “interfaces”) modifies this so that
The first six (non-floating point) arguments are passed via registers, specifically the registers
rdi
,rsi
,rdx
,rcx
,r8
andr9
. If an argument is less than 64-bits then it should use the appropriate portion of its respective register (e.g.,edi
for 32-bits).The first eight floating-point parameters are passed via the floating point registers
xmm0
throughxmm7
.Any remaining arguments are pushed onto the stack in reverse order.
Arguments larger than 64-bits are generally passed either on the stack, or in memory, as an address. We’ll cover passing/returning structures later, but see the System-V x86-64 ABI in the meantime for all the gory details.
E.g., suppose we have a function
int f(long x, float y, char* z);
The registers used by this function in call/return are
rdi
– contains the value ofx
(signed qword)xmm0
– contains the value ofy
rsi
– contains the value ofz
(address, qword)eax
– contains the return value (dword)
Note that because any stack-based arguments are pushed by the calling function,
they will appear on the stack below the rip
value pushed by the call
instruction. This might be somewhat surprising if you were thinking of the
called function’s stack frame (see below) as “starting” at the pushed rip
.
The stack frame includes, first, any registers that the function needs to
preserve (caller-preserved), followed by any stack-based local variables.
Also note that, because a function pushes any extra arguments before rip
,
it is not possible for the called function to pop them off; hence, if your
function pushes any arguments onto the stack, it should pop them off, or just
increment rsp
appropriately, after the
function returns, so as to leave the stack in the same state.
As mentioned above, before executing call
the stack pointer must be aligned
to a multiple of 16 + 8, so if you are calling a function which has any
stack-based arguments, you should add up the total size of these arguments and
then, if necessary, adjust the rsp
(by subtraction) so that it is aligned.
The reason for requiring stack alignment is that it enables the use of the
more-efficient aligned move instructions. These move values into/out of
the vector registers xmm0,1,...
, but they require that the address of a
memory operand be aligned to a multiple of 8 bytes. Attempts to do an aligned
move from an unaligned address triggers a segfault. (E.g., if you call a
C library function without aligning the stack pointer first, your program
will segfault at the point of the call.)
Returning results
The return result should be placed in rax
, or xmm0
if floating point.
Technically,
rax
and rdx
can be used to return either two values (not supported in
C/C++) or a single 128-bit value, as rdx:rax
.
Larger return values are typically “returned” by passing in a pointer to
memory as an implicit first argument (i.e., in rdi
, pushing all remaining
arguments down in the list), modifying the memory within the function,
and then returning the same pointer in rax
. Often, the calling function
will allocate space for the structure in its stack frame, but this is not
easily accessible to the called function, hence the pointer-argument.
Returning from a function with local variables/callee-saved regs.
Remember that the ret
instruction simply pops the stack into rip
and then
jumps there. Hence, if a function used the stack for any of its local variables,
it must ensure that the top of the stack contains the calling function’s
rip
before executing ret
. Typically this is done not by pop
-ing, but
just by incrementing the stack pointer rsp
. Similarly, if you had to adjust
rsp
to align it, you will have to de-align before you start popping.
If rip
is not on the top of the
stack when ret
is executed, whatever 8 bytes are on top of the stack will be
treated as a 64-bit address and jumped to, most likely jumping into random memory
and crashing your program.
Register usage
Because all functions must use the same set of registers, there is competition for who gets to use which register: is the value in a given register guaranteed to be preserved by a function call, or is the function allowed to use the register without bothering to save its value? Both have certain advantages:
If a function must preserve the value of a register (callee-preserved) it makes it much easier to call functions; you don’t have to worry about saving your registers before every
call
, and then restoring them afterwards (think about what we have to do withrcx
when doing asyscall
).If a function is allowed to clobber the value of a register (caller-preserved) it makes it easier and faster to write functions: functions don’t have to do any extra work to save/restore registers, so they execute faster and use less of the stack.
When calling a function, some registers are used for passing arguments (see next section), some are reserved for the use of the called function (caller-preserved), and some are guaranteed/required to be saved by the called function (callee-preserved). You can rely on these registers containing the same values before and after a function call.
Of course, this means that
when you are writing a function, you have to take care to preserve the values
in these registers if you are using them, and restore them before you ret
.
The best way to do this is to push
them onto the stack at the beginning
of your function, and then pop
them off immediately before you ret
.
(Of course, if any of your function’s arguments are on the stack – see the
next section – then you will have to pop those before pushing the callee-saved
registers.)
Register | Use |
---|---|
rax |
Return value |
rbx |
Callee-preserved |
rcx |
4th argument |
rdx |
3rd argument |
rsi |
2nd argument |
rdi |
1st argument |
rbp |
Callee-preserved |
rsp |
Stack pointer |
r8 |
5th argument |
r9 |
6th argument |
r10 |
Temporary (caller-preserved) |
r11 |
Temporary (caller-preserved) |
r12 -r15 |
Callee-preserved |
The argument and “temporary” registers are not callee saved, and must be saved by the calling function if their values should be preserved. This means that within a function we can uses these registers freely, without the need to save/restore their values. The registers labeled “callee-saved” must be saved by a function if it needs to modify them, and then restored to their original values before return.
The usual way to save the callee-saved registers (only the ones you use in your function need to be saved, of course), is to push them onto the stack. Because this is done inside the called function, at the beginning of the function in what’s called the preamble, these registers’ values will appear on the stack above the pushed rip.
Floating-point arguments
Although we haven’t covered floating-point yet, that uses registers as well, at least for the first 8 f-p arguments:
The first 8 floating-point arguments are passed in registers
xmm0
-xmm7
.Remaining f-p arguments are passed on the stack.
For variadic functions (those which take an unlimited number of arguments), set the
al
register to the number of floating-point arguments passed in xmm registers.
Stack frame
The portion of the stack that is “created” when a function is called is called
its stack frame. At a minimum, this must include the return address
(i.e., the rip
that the ret
instruction will jump to), but may include
additional information:
Arguments, if there are more than 6
Values of callee-saved registers
Local variables (if there are too many to fit in registers)
Many small functions need nothing more than the return address; these functions have a fixed-size stack frame, just big enough to hold the address (i.e., 8 bytes).
Traditionally, the rbp
(base pointer) register was used to point to (sort of) the
beginning of a function stack frame; its value points to the stack element
just above the pushed return address. If we want to use rbp
this way, then we
must preserve it over function calls, so the value on the stack that rbp
points
to is in fact the calling function’s rbp
. Thus, the stack is laid out like
Addr | Contents |
---|---|
… | |
—- | |
caller-saved registers… | |
rbp+n |
… |
rbp+16 |
stack-based arguments… |
rbp+8 |
rip (return address) |
rbp |
caller’s rbp (<-- rbp ) |
rbp-8 |
callee saved regs and local vars |
rbp-m |
… |
rsp |
top of stack (<-- rsp ) |
… | Red area |
rsp-128 |
This allows stack-based arguments and local variables to be accessed via
offset from rbp
. E.g., the first stack-passed argument is located at
rbp + 16
. (Because arguments are pushed by the caller, they come
before rip
on the stack, which is implicitly pushed by call
.) Of course,
the size of any stack-passed arguments affects where, above rbp
, they will
be located.
Before a function returns, it simply sets rsp
to rbp
(thus “popping”
everything on the stack other than rbp
and rip
), pops into rbp
(restoring the caller’s rbp
) and the ret
urns.
Used thus, rbp
doesn’t exactly point to the beginning of the stack frame,
but it points to the element just above rip
, which is close enough. (The
passed arguments on the stack, having been pushed by the calling function,
are not part of the called function’s stack frame; they will be cleaned up
by the calling function.)
Pushing/popping the rbp
also takes care of aligning the stack pointer, as the
rbp
is 8 bytes, exactly the offset we would need to manually subtract to
realign rsp
. Again, if you don’t push rbp
, you must manually align rsp
by subtracting 8 from it. If you are pushing callee-saved registers or
reserving space for local variables, you may need to subtract more or less in
order to correctly align the stack pointer. It’s up to you to ensure that rsp
is correctly aligned.
If you know where something is on the stack
in relation to rsp
(or, more likely, rbp
), you can use it directly:
mov rax, [rbp + 16] ; First pushed argument
Similarly, if you’ve reserved some space above rbp
for local variables, you
can access them as
mov rax, [rbp - 8] ; First local variable (assuming no called-saved regs)
The 128 bytes above the top of the stack (i.e., below rsp
) are called the
“red area”. They can be used for temporary storage, although if you call any
functions it may be overwritten. This allows “semi-leaf” functions (functions
which do not call any other functions, but still want to use some stack space).
enter
and leave
The enter
and leave
instructions perform this stack setup/teardown for you.
enter n
pushesrbp
onto the stack and thenmov rbp, rsp
and then decrementsrsp
by n, effectively reserving n bytes on the stack for local variables or callee-saved registers.leave
doesmov rsp, rbp
and thenpop
s the oldrbp
. (Note thatrsp
must be adjusted prior toleave
to ignore any local variable space!)
enter
and leave
are useful when we have functions which have local
variables on the stack. However, as we’ve seen, we usually have enough
registers so that this is not necessary, so the convenience of having
instructions to setup/teardown the stack is not really necessary. I won’t use
these instructions in any of the functions I write.
Calling standard library functions
If you link with the C standard library, then the entry point of your program
becomes main
, not _start
; the exit code should be returned in rax
(and
main
should end with ret
). (All C functions are stored in the executable
file with an underscore added to their names.)
To call C functions from assembly, it’s as simple as
extern printf
...
mov rdi, msg ; Address of nul terminated string
call printf
Thus, a “Hello, world” program using the standard library becomes just
section .data
msg: db "Hello, world!", 10, 0
section .text
extern printf
global main
main:
push rbp
mov rbp, rsp
mov rdi, msg
call printf
; Return value if any in rax
pop rbp
mov rax, 0 ; "return 0;"
ret
(If you forget to push rbp
or add rsp, 8
at the beginning, you’ll get a
segfault when you try to call printf
.)
To assembly a function that uses the standard library, you can either use the
asm
script provided, or do it manually, via
yasm -g dwarf2 -f elf64 -o program.o program.s -l program.lst
gcc -o program program.o
Only the link step is different; the assemble command-line is the same as before. As you can imagine, this will make things much easier going forward!
Function template
The following is a template for a function which preserves the callee-saved
registers and uses rbp
to point to the beginning of its stack frame:
;; func_name(arguments...)
;; What the function does
;;
;; Arguments:
;; arg1 -- rdi
;; arg2 -- si
;; ... etc
;; arg7 -- qword, stack [rbp + 16]
;;
;; Return value: eax
;;
global func_name
func_name:
;; Preamble
push rbp ; Save calling function's rbp
mov rbp, rsp ; rbp points to our stack frame
push r12 ; Push any callee-saved registers you use
; If you need space for stack-based local variables, you can reserve it with
; sub rsp, amount
; Realign rsp to 16*rsp + 8
;; Function body
...
;; Epilogue
; Remove any alignment added
; If you reserved any space for stack-based local variables, you should
; restore it with
; add rsp, amount
pop r12 ; Restore callee-saved registers
pop rbp ; Restore caller's rbp
ret
Recursive functions
Note that a function written using the above template can be recursive without
any additional work. As we would expect, the stack is used to keep track of
all the different function invocations. Here is a recursive function which
computes sum(x) = x + (x-1) + ... + 2 + 1 + 0
:
global sum
sum:
;; Preamble
push rbp ; Save calling function's rbp
mov rbp, rsp ; rbp points to our stack frame
push rbx ; Push any callee-saved registers you use
sub rsp, 8 ; Re-align rsp
cmp rdi, 0 ; If 1st arg == 0, goto base case
je base_case
; Recursive case
mov rbx, rdi ; Save argument
dec rdi
call sum ; Call sum(x-1)
add rax, rbx ; Add, return in rax
add rsp, 8
pop rbx ; Restore callee-saved registers
pop rbp ; Restore caller's rbp
ret
base_case:
mov rax, 0 ; Return 0
add rsp, 8
pop rbx ; Restore callee-saved registers
pop rbp ; Restore caller's rbp
ret
Note that we have to clean up the stack before any ret
is issued, so if
your function has multiple exit points, they must all have the same
cleanup code. You might want to have a single exit point at the end of your
function, and then just jmp
to it from every other exit point, for
simplicity.
Rewriting insertion sort as a proper function
Here we will rewrite our insertion_sort
as a well-behaved function.
The registers we use in the implementation are
rdi, rsi
as arguments. This is fine, we don’t have to save their values or anything.rax,rbx
as loop counters.rax
is fine, because that is supposed to be the return value, and our function doesn’t return anything, butrbx
is callee-preserved, so it’s our responsibility to save it and restore it before weret
.r8
andr9
as temporary values used in swapping. These are normally used as arguments, so there’s no guarantee that the callee will preserve them, so we’re fine there.
;;; insertion_sort
;;; Sorts and array of qwords
;;; rdi = address
;;; rsi = length
;;; Returns nothing
insertion_sort:
; rdi = addr of array
; rsi = length of array
push rbp ; Stack base pointer
mov rbp, rsp ; rbp points to our stack frame
push rbx ; Save rbx
sub rsp, 8 ; Re-align rsp
mov rax, 1 ; rax = outer loop index, unsigned
.while_i:
cmp rax, rsi
je .done
mov rbx, rax ; rbx = inner loop index
.while_j:
cmp rbx, 0
je .done_i
mov r8, qword [rdi + 8*rbx]
mov r9, qword [rdi + 8*rbx - 8]
cmp r8,r9
jge .done_i ; break
; swap
mov qword [rdi + 8*rbx], r9
mov qword [rdi + 8*rbx - 8], r8
dec rbx
jmp .while_j
.done_i:
inc rax
jmp .while_i
.done:
add rsp, 8 ; De-align rsp
pop rbx
pop rbp
ret
Interoperating with C-functions
The rules given above describe the “calling convention” used by C functions. (C++ functions use the same calling conventions, but the names of C++ functions are “mangled” to include their argument and return types.) This means that by using the above rules, we can
Call functions written in C, including C-standard-library functions.
Call assembly functions from C/C++ code.
In order to call a C function, we have to do three things:
Write the entry point of our program as
main
instead of_start
, declared asglobal
. Setup the stack correctly on function entry, tear it down on exit, and return the exit code of the program throughrax
(i.e., no need to use theSYS_EXIT
syscall).Declare any C functions we want to call in our assembly file as
extern
. E.g.,extern printf
.Link with the C library. The
asm
script detects whether your assembly file uses_start
as its entry point, ormain
, and will automatically link with the C standard library in the latter case, so you don’t need to worry about it.
Here’s an example: The C function puts(char* s)
writes a nul-terminated
string to the standard output stream, followed by a newline. I.e., it does
what the SYS_WRITE
syscall does, except that we don’t need to give it the
length (just make sure we end the string with a 0-byte), and it will automatically
add a newline. To print the “Hello, world!” message using puts
we do
section .data
msg: db "Hello, world!", 0
; Note: no length, no 10 (newline) byte
section .text
extern puts
global main
main:
; Setup stack
push rbp
mov rbp, rsp
; Set registers to arguments.
mov rdi, msg
call puts
; Cleanup stack, return 0
pop rbp
mov rax, 0
ret
One very useful C function is printf
, which serves as a replacement for cout
.
It takes at least one argument, a format string (char*
, nul-terminated),
which contains placeholders, starting with the %
character. The remaining
parameters are “spliced in” during printing by replacing their respective
placeholders. For example,
printf("%d, %d, %d\n", 1, 2, 3);
would print
1, 2, 3
followed by a newline. %d
is the placeholder for a signed 32-bit integer value.
The useful placeholders are
Placeholder | Description | Argument type |
---|---|---|
%c |
Single character | int (not char !) |
%s |
Nul-terminated string | char* |
%d |
Signed integer | int (dword) |
%ld |
Signed integer | long (qword) |
%hd |
Signed integer | short (word) |
%hhd |
Signed integer | char (byte) |
%u |
Unsigned integer | unsigned int |
%lu , %hu , … |
Unsigned integers | As for %d |
%f |
Floating-point value | double (not float !) |
As printf
takes any number of arguments, it is a variadic function, which
adds an extra rule to how its arguments are passed:
- If there are any floating-point arguments passed in the
xmm
registers, then setal
to the number of these arguments.
A counterpart to printf
is scanf
which takes a similar formatting string
except that it reads input from stdin, according to the formats given. The
remaining arguments must be pointers to variables which the inputs will be
written into. E.g., if we do
int x, y, z;
scanf("%d %d %d", &x, &y, &z);
and type in 1 2 3
then variable x
will contain 1, y
will contain 2, and
z
will contain 3.
With scanf
, it is vitally important to use the right format specifier! E.g.,
using %d
(dword) when you are intending to read in a qword will cause
problems, typically because scanf
will only write to the low dword, leaving
the high 4 bytes unchanged.