Creating Trivial RISC-V Executables Using Assembly Code
This post shows an approach for generating (relatively) light-weight executable files. We use the ELF format, since it enables us to use existing binutils tools (such as readelf or nm) but if ELF is too bulky or inflexible for your needs, it is not too difficult to come up with your own executable format. In that case, it is perhaps easier to first generate ELF files, then extract the post-relocation bytes from the ELF sections, and then pack the bytes into your desired format, although doing so will certainly increase the time to produce the executable artifacts.
Before we automate the process of generating executable code, we are going to look at how to do this process manually, for the following reasons:
- to check whether our approach even works,
- to define a clear goal for our compiler,
- to assist in debugging programs (that is, to manually create the smallest reproducible example of the problem), and finally,
- to better understand the behavior of an instruction, or a combination of instructions, by executing them and observing their side effects (especially when the specification is ambiguous).
However, we are soon going to run into a circular dependency. Specifically, several parts related to setting up code to execute are highly dependent on the execution platform, so we cannot write a complete, functioning program without first knowing a little about execution platform. In this post, we are going to break this dependency by generating executables that are not ready to be executed yet. Once we learn more about the simulation platform in the next post, we will revisit our process of generating executable files.
A Starter ELF File
At the very least, to start creating elemetary ELF files, we need the assembly source file. For now, we’ll start with one that runs in an infinite loop (recollect that signaling program termination is entirely dependent on the execution platform, which we will dive into in a later post). Our barebones assembly source is:
.text
.global _start
_start:
j 0 # Jump to self, thus looping indefinitely
Compiling this source file into an ELF executable file is
trivial
(clang --target=riscv64 -c -nostdlib -o program-0 program.S
),
but not only does this hide several parts of the ELF file
generation but it also adds some (small) unnecessary bits to the
ELF file that we probably don’t care about, so let’s use the
alternative approach that skips some of the defaults to
illustrate various parts of the process.
First, we can compile the assembly source into an object file
(clang --target=riscv64 -march=rv64i -c program.S -o program.o
),
and although optional, we have explicitly specified that we want
to produce an ELF file for a processor that only supports the
base (integer) ISA. Now that we have the object file, we will
produce the ELF file, but instead of using the default linker
script (which specifies the placement of instructions and data
in memory), let’s supply our own, with only the sections we care
about.
ENTRY (_start)
SECTIONS
{
/* specify the sections that are in the assembly source */
.data : { *(.data) }
.text : { *(.text) }
/* include certain sections that are required for ELF processing */
.shstrtab : { *(.shstrtab) }
.strtab : { *(.strtab) }
.symtab : { *(.symtab) }
/* throw away all other sections */
/DISCARD/ : { *(*) }
}
This linker script trims away sections that we don’t care
about. When we invoke the linker with the object file and the
above linker script
(ld.lld --nmagic --script=program.ld -o program-1 program.o
),
we get a relatively smaller ELF file. Here’s what I see on my
machine:
> ls -lh program-*
-rwxr-xr-x 1 user user 1.3K Apr 2 19:17 program-0*
-rwxr-xr-x 1 user user 904 Apr 2 19:17 program-1*
The --nmagic
option is key to keeping the binary
size small; since we anticipate only a few sections in the ELF
file, having these sections at page-aligned offsets doesn’t help
us much, so we use --nmagic
to pack all loadable
sections as closely as possible in the ELF file.
I am sure that there are more techniques possible to trim the ELF file size further, but this seems good enough given that we didn’t have to expend a lot of effort to get to 904 bytes.
ELF File with Real Computation
One of the main reasons why writing assembly by hand for non-trivial programs becomes complicated is because of manual register allocation. Specifically, we not only need to reason about register usage in the face of potentially complex control flows, but we also need to re-analyze the register usage each time we need to add or remove lines from the assembly code. For instance, here is a RISC-V assembly program that computes the sum of the values in each row of a 4x7 matrix.
.set ROWS, 4
.set COLS, 7
.text
.global _start
_start:
la t0, input # source pointer
la t1, output # result pointer
li t3, ROWS # outer loop counter
outer:
li t2, 0 # result
li t4, COLS # inner loop counter
inner:
lw t5, 0(t0) # read from source pointer into t5
add t2, t2, t5 # accumulate result for a single row in t2
addi t0, t0, 4 # bump source pointer by four bytes
addi t4, t4, -1 # decrement inner loop counter
bnez t4, inner
sw t2, 0(t1)
addi t1, t1, 4 # bump result pointer by four bytes
addi t3, t3, -1 # decrement outer loop counter
bnez t3, outer
j 0 # infinite loop as a proxy for termination logic
.data
input:
# use assembler syntax to create a 4x7 matrix
.set outer_idx, 0
.rept ROWS
.set inner_idx, 0
.rept COLS
.word outer_idx + inner_idx
.set inner_idx, inner_idx + 1
.endr
.set outer_idx, outer_idx + 1
.endr
output:
# initialize the result with zeros
.rept ROWS
.word 0
.endr
Even in this trivial example, correctly deciding which register(s) to use in each instruction is difficult. If we need to update the code in inner loop, we would then need to recompute the liveness of each register. And all this is before we even need to spill registers and keep track of spill slots!
Indeed, if the vector extension is available to us, this same snippet of code becomes a lot shorter, with fewer registers to reason about, but it is still easy to make mistakes.
.set ROWS, 4
.set COLS, 7
.text
.global _start
_start:
li t0, 0x600
csrrc t1, mstatus, t0
or t1, t1, t0
csrw mstatus, t1 # set MSTATUS.VS to enable vector instructions
vsetivli zero, COLS, e32, m1, tu, ma
la t0, input # source pointer
la t1, output # result pointer
li t2, ROWS # outer loop counter
vmv.s.x v4, zero # v4[0] = 0
loop:
vle32.v v0, 0(t0)
vredsum.vs v8, v0, v4
vmv.x.s t3, v8
sw t3, 0(t1)
addi t0, t0, COLS * 4 # bump source pointer to the next row
addi t1, t1, 4 # bump result pointer to the next element
addi t2, t2, -1 # decrement outer loop counter
bnez t2, loop
j 0 # infinite loop as a proxy for termination
.data
input:
# use assembler syntax to create a 4x7 matrix
.set outer_idx, 0
.rept ROWS
.set inner_idx, 0
.rept COLS
.word outer_idx + inner_idx
.set inner_idx, inner_idx + 1
.endr
.set outer_idx, outer_idx + 1
.endr
output:
# initialize the result with zeros
.rept ROWS
.word 0
.endr
But all of this becomes a lot easier to do if we use C or C++ with vector intrinsics, since doing so lets us focus on the instructions, while leaving the tedious part of register allocation to the LLVM backend. That’s coming up in the next post.
back