RISC-V ISA Peculiarities
At work, I am helping write a compiler for a RISC-V processor. Coming from a background where I had to routinely read and write x64 assembly, the RISC-V Instruction Set Architecture (ISA) makes it much easier to write assembly code by hand. This post records some preliminary information about the ISA before diving into a couple of peculiarities.
Base ISA and Extensions
Perhaps the primary choice you will need to be aware of when
compiling for your RISC-V processor is its word size, which is
32 bits for RV32 processors and 64 bits for RV64 processors.
There is work in progress to define RV128, but since it’s not
official yet, we will limit ourselves to RV32 and RV64. The word
size determines both the sizes of addresses as well as data.
When it comes to instructions, RV32’s instructions are limited
to operate on at most 32 bits, but RV64 extends RV32’s
instructions with a few more that let you operate on 64 bits. In
other words, you can use instructions from RV32 on RV64
processors, but not the other way around. Both RV32 and RV64
support 32 registers, but their width, of course, varies between
32 bits and 64 bits. When you invoke the LLVM backend, you will
need to make the choice early to specify the target triple as
either riscv32
or riscv64
.
Orthogonal to the choice between RV32 and RV64, the RISC-ISA also offers choices in the form of Extensions, which are essentially new instructions grouped together according to the functionality. All extensions are optional (and thus left as a choice for the designer of the processor) except for the extension for supporting integer instructions. Here, the designer has a choice: either support the full 32 registers, thus designing either an RV32I or RV64I processor, or only support 16 registers, thus designing an RV32E or RV64E processor.
Depending on which other extensions are enabled (say, M for multiply and divide instructions, A for atomic instructions, F for single-precision floating-point instructions, etc.), RISC-V processors are conventionally described by tacking on all the extensions that they support, usually in a particular sequence; so RV64IMAFC describes an 64-bit RISC-V processor that supports the integer instructions, multiplication and division instructions, atomic instructions, float-point instructions, and compressed instructions. Sometimes, you will see the extension “G” and no references to “I”; that’s because “G” is just shorthand for “IMAFD” (i.e. integer, multiplication and division, and single- and double-precision floating-point instructions).
The list of RISC-V ISA extensions is constantly evolving, so it’s hard to list all possible extensions, and although small compared to other ISAs, there is a formidable list of instructions that a processor can support. Still, there are a few instructions that deserve special attention, since we will run into them frequently in the code generated by our compiler.
Address Computation Instructions
Load and store instructions are bound to occur often in whichever program you compile, and consequently, so do instructions that calculate the address to use with the load and store instructions, making them equally important since our focus will be on generating lean assembly code. These instructions will also come into play in our subsequent discussion of so-called Code Models, Relocation, and Position-Independent Code when generating assembly code.
All RISC-V load and store instructions require that the
address be made available using one of the integer registers.
Depending on the exact value of the address, it may be cheaper
to use a combination of the auipc
or
addi
instructions versus a combination of the
lui
, addi
, and slli
instructions, but we may need additional, secondary loads (just
for address computation) in certain circumstances, as explained
below.
The lui
instruction accepts a 20-bit immediate
value, which is first shifted left by 12 bits (to obtain a
32-bit value) before being sign-extended and finally written to
the destination register. Crucially, the instruction writes a
32-bit value to the destination register, whether or
not it’s an RV32 or RV64 processor. Since the value computed
using the lui
instruction can range from
-231 to +231, if we use the value as an
address, we can load from or store to only the first 2GB of
memory or the last 2GB memory. As a quick side note, if we
specify the “medlow” code model to the RISC-V LLVM backend, it
will generate lui
instructions for address
computation.
The auipc
instruction is similar to the
lui
instruction: the auipc
instruction
also accepts a 20-bit immediate value, and it also first shifts
the value left by 12 bits before sign-extending it, but instead
of writing this 32-bit value directly, the instruction first
adds it to the program counter before writing it to the
destination register. Similar to the lui
instruction, auipc
writes a 32-bit value, but since
this value is now relative to the program counter, if we use
this value as an address, we can load from or store to an
address that is within ±2GB away from the program counter. The
RISC-V LLVM backend emits auipc
for address
computation if we specify “medany” as the code model.
Put another way, auipc
lets us compute addresses
that are ±2GB away from the program counter, whereas
lui
lets us compute addresses that are ±2GB away
from the memory address zero. So if your total addressable
memory is less than 4GB, just use the lui
instruction (or the “medlow” code model). If you are confident
that all your memory references will be within 2GB from the
referencing instruction, use auipc
(or the “medany”
code model). For RV32 processors, we’re all set. But what we’re
compiling for an RV64 processor and if our addresses are more
than 2GB away from the instruction? For that, we need either a
Global Offset Table or a Literal Pool, combined with an
additional load instruction.
The key idea behind referencing addresses that don’t fit into
the above use cases is to create a table of the addresses that
we want to load from and store to, but to place this table
within 2GB from the referencing instruction(s). This way, we can
use the auipc
instruction to get the address of the
table entry, before loading the address stored in the table
entry. Of course, this implies that the text section (which
contains the address referencing instructions) should be less
than 2GB in size in addition to the table containing less than
256K entries, but that’s generally the case anyway. If not, we
would need to split both the text section as well as the table
and splice the text section splits and the table splits, which
seems practically impossible if we are using the RISC-V LLVM
backend, although it does seem possible (but painful) if we
decided to write and use our own code for object file
generation.
The RISC-V LLVM backend uses this idea of creating a table
and issuing secondary loads in both the “medany” code model with
position-independent code (PIC) as well as the “large” code
model. Both seem similar (the large code model calls the table
of addresses as the Literal Pool, whereas the medany code model
with PIC calls the table of addresses as the Global Offset
Table), but the medany code model with PIC seems preferable over
the large code model because the former enables the LLVM backend
to distinguish between local symbols (for example, those in the
.rodata
section, which are expected to live within
2GB distance from the text section and thus don’t need secondary
loads) and non-local symbols (which likely are more than 2GB
distance from the text section, thus requiring secondary loads).
So through careful section assignment and placement, the medany
code model with PIC will probably emit more efficient code
compared to the large code model.
The vset{i}vl{i}
Instructions
To emit efficient vector code in a later post, it’s crucial
to understand the semantics, versatility, and elegance of the
vsetvl
, vsetvli
, and
vsetivli
instructions, which form the backbone of
the vector extension. These three instructions are minor
variants of each other, so for brevity, we will refer to them
collectively as the vsetvl
instruction.
We will use x86 vector instructions for comparison to
illustrate the vsetvl
instructions. On x86 machines
that support SSE instructions, an add operation between two
vectors can be represented using one of eight instructions:
paddb
, paddw
, paddd
,
paddq
, addps
, addpd
,
addss
, and addsd
. The opcode in each
instruction encodes the data type (whether integer or float),
width of the element (8, 16, or 32 bits), as well as the number
of elements that we want to operated on. If you now intend to
emit AVX instructions, there are an additional six instructions
that you now need to know about: vpaddb
,
vpaddw
, vpaddd
, vpaddq
,
vaddps
, and vaddpd
. AVX-512 adds an
additional set of instructions for representing the add
operation. Clearly, vector code generation for x86 machines gets
quite unwieldy even for basic operations.
In contrast, the RISC-V vector compute operations only
specify the operation (for example, vadd.vv
for
integer addition or vfadd.vv
for float-point
addition) whereas the control plane information (specifically,
the element width, the number of elements, and how many
registers to use) is decoupled from the instruction itself and
instead expressed using the vsetvl
instructions,
which effectively records this control plane information in
RISC-V Control and Status Registers (CSRs). Typically, writing
to CSRs is very expensive, but given the expected high frequency
of the vsetvl
instructions, this instruction is (as
I recollect) executed speculatively, thus reducing its
performance impact. The information written by the
vsetvl
instructions is thus used by all subsequent
vector computation instructions until a new vsetvl
instruction is executed.
Perhaps the most elegant part of the vsetvl
instructions is in the specification of vl
or
vector length. Unlike x86 vector instructions, which implicitly
encode the number of vector elements each instruction operates
on, RISC-V vector instructions can work with an arbitrary number
of vector elements bounded by the physical vector register size.
For instance, if the physical vector register size (also known
as VLEN
) is 2048 bits and the element width is 32,
the vector length (or vl
) as specified by the
programmer can be anywhere between 1 and 2048/32 = 64. Perhaps
obvious, the vector length does not need to be a power of 2
either. This greatly simplifies code generation for vectorized
loops, since, unlike on x86 processors, we don’t need to emit
peeled loops at the beginning or at the end of the vectorized
loop to handle cases when the trip count is not a perfect
multiple of the vector element width.
To keep the code portable, the programmer may not want to
assume a specific VLEN
, in which case, the
programmer specifies a destination register that the
architecture writes to, with the actual vl
(which
is either requested vl
or VLEN
divided
by the element width, whichever is lower). If the programmer
knows VLEN
, then she can ignore the actual
vl
(since it can be computed offline) and use
zero
or x0
as the destination
operation, so that she doesn’t need to allocate a register for a
value that will not be referenced in the future. In a later post
in which we emit vector instructions, with the goal of
generating high-performance assembly, we will assume that
VLEN
is a known constant, although it is not
difficult to adapt the code to work with any arbitrary
VLEN
.
The second most elegant bit related to the
vsetvl
instructions is the ability to group
multiple consecutive vector register into one large so-called
pseudo-register, thus being able to use fewer vector
instructions to increase the number of elements we can operate.
Specifically, the length multiplier (or LMUL
) can
be set to 1 (which is the default), or 2, or 4, or 8. For
instance, when LMUL
=8, instead of 32 vector
registers each of size VLEN bits, the programmer now has access
to 32/8 = 4 vector registers, each of size 8*VLEN bits.
LMUL
can also be fractional (specifically, ½, ¼, or
⅛), which limits the addressable register size to be half,
quarter, or one-eighth of VLEN, respectively), but fractional
LMUL
s (like integral LMUL
s) do not
increase the number of vector registers.
Despite its utility in reducing the number of executed vector
instructions, the real reason for the existence of
LMUL
is to efficiently handle loops containing
mixed-precision operations. For instance, if a loop operates on
multiple streams of 16-bit elements and 8-bit elements, it can
be helpful to set a fractional LMUL
for the 8-bit
vector operations (depending on the desired vl
) and
twice that LMUL
for the 16-bit vector operations to
reduce the number of vector register spills.
Finally, there are a few policy bits that can be set using
the vsetvl
instructions to specifying how to handle
unused vector lanes, which may arise from vl
being
smaller than VLEN
divided by the element width
(resulting in tail values) or from masked vector
instructions (resulting in masked values). In either
case, we can specify that values should either remain
undisturbed (that is unmodified) or that we don’t care (that is,
they could be overwritten with all 1s). Note that the
specification explicitly avoids saying that these values could
be overwritten with all zeros.