RISC-V ISA Peculiarities
At work, I am helping write a compiler for a RISC-V processor. Coming from a background where I had to routinely read and write x64 assembly, the RISC-V Instruction Set Architecture (ISA) makes it much easier to write assembly code by hand. This post records some preliminary information about the ISA before diving into a couple of peculiarities.
Base ISA and Extensions
Perhaps the primary choice you will need to be aware of when
compiling for your RISC-V processor is its word size, which is 32 bits
for RV32 processors and 64 bits for RV64 processors. There is work in
progress to define RV128, but since it’s not official yet, we will limit
ourselves to RV32 and RV64. The word size determines both the sizes of
addresses as well as data. When it comes to instructions, RV32’s
instructions are limited to operate on at most 32 bits, but RV64
extends RV32’s instructions with a few more that let you
operate on 64 bits. In other words, you can use instructions from RV32
on RV64 processors, but not the other way around. Both RV32 and RV64
support 32 registers, but their width, of course, varies between 32 bits
and 64 bits. When you invoke the LLVM backend, you will need to make the
choice early to specify the target triple as either riscv32
or riscv64
.
Orthogonal to the choice between RV32 and RV64, the RISC-ISA also offers choices in the form of Extensions, which are essentially new instructions grouped together according to the functionality. All extensions are optional (and thus left as a choice for the designer of the processor) except for the extension for supporting integer instructions. Here, the designer has a choice: either support the full 32 registers, thus designing either an RV32I or RV64I processor, or only support 16 registers, thus designing an RV32E or RV64E processor.
Depending on which other extensions are enabled (say, M for multiply and divide instructions, A for atomic instructions, F for single-precision floating-point instructions, etc.), RISC-V processors are conventionally described by tacking on all the extensions that they support, usually in a particular sequence; so RV64IMAFC describes an 64-bit RISC-V processor that supports the integer instructions, multiplication and division instructions, atomic instructions, float-point instructions, and compressed instructions. Sometimes, you will see the extension “G” and no references to “I”; that’s because “G” is just shorthand for “IMAFD” (i.e. integer, multiplication and division, and single- and double-precision floating-point instructions).
The list of RISC-V ISA extensions is constantly evolving, so it’s hard to list all possible extensions, and although small compared to other ISAs, there is a formidable list of instructions that a processor can support. Still, there are a few instructions that deserve special attention, since we will run into them frequently in the code generated by our compiler.
Address Computation Instructions
Load and store instructions are bound to occur often in whichever program you compile, and consequently, so do instructions that calculate the address to use with the load and store instructions, making them equally important since our focus will be on generating lean assembly code. These instructions will also come into play in our subsequent discussion of so-called Code Models, Relocation, and Position-Independent Code when generating assembly code.
All RISC-V load and store instructions require that the address be
made available using one of the integer registers. Depending on the
exact value of the address, it may be cheaper to use a combination of
the auipc
or addi
instructions versus a
combination of the lui
, addi
, and
slli
instructions, but we may need additional, secondary
loads (just for address computation) in certain circumstances, as
explained below.
The lui
instruction accepts a 20-bit immediate value,
which is first shifted left by 12 bits (to obtain a 32-bit value) before
being sign-extended and finally written to the destination register.
Crucially, the instruction writes a 32-bit value to the
destination register, whether or not it’s an RV32 or RV64 processor.
Since the value computed using the lui
instruction can
range from -231 to +231, if we use the value as an
address, we can load from or store to only the first 2GB of memory or
the last 2GB memory. As a quick side note, if we specify the “medlow”
code model to the RISC-V LLVM backend, it will generate lui
instructions for address computation.
The auipc
instruction is similar to the lui
instruction: the auipc
instruction also accepts a 20-bit
immediate value, and it also first shifts the value left by 12 bits
before sign-extending it, but instead of writing this 32-bit value
directly, the instruction first adds it to the program counter before
writing it to the destination register. Similar to the lui
instruction, auipc
writes a 32-bit value, but since this
value is now relative to the program counter, if we use this value as an
address, we can load from or store to an address that is within ±2GB
away from the program counter. The RISC-V LLVM backend emits
auipc
for address computation if we specify “medany” as the
code model.
Put another way, auipc
lets us compute addresses that
are ±2GB away from the program counter, whereas lui
lets us
compute addresses that are ±2GB away from the memory address zero. So if
your total addressable memory is less than 4GB, just use the
lui
instruction (or the “medlow” code model). If you are
confident that all your memory references will be within 2GB from the
referencing instruction, use auipc
(or the “medany” code
model). For RV32 processors, we’re all set. But what we’re compiling for
an RV64 processor and if our addresses are more than 2GB away from the
instruction? For that, we need either a Global Offset Table or a Literal
Pool, combined with an additional load instruction.
The key idea behind referencing addresses that don’t fit into the
above use cases is to create a table of the addresses that we want to
load from and store to, but to place this table within 2GB from the
referencing instruction(s). This way, we can use the auipc
instruction to get the address of the table entry, before loading the
address stored in the table entry. Of course, this implies that the text
section (which contains the address referencing instructions) should be
less than 2GB in size in addition to the table containing less than 256K
entries, but that’s generally the case anyway. If not, we would need to
split both the text section as well as the table and splice the text
section splits and the table splits, which seems practically impossible
if we are using the RISC-V LLVM backend, although it does seem possible
(but painful) if we decided to write and use our own code for object
file generation.
The RISC-V LLVM backend uses this idea of creating a table and
issuing secondary loads in both the “medany” code model with
position-independent code (PIC) as well as the “large” code model. Both
seem similar (the large code model calls the table of addresses as the
Literal Pool, whereas the medany code model with PIC calls the table of
addresses as the Global Offset Table), but the medany code model with
PIC seems preferable over the large code model because the former
enables the LLVM backend to distinguish between local symbols (for
example, those in the .rodata
section, which are expected
to live within 2GB distance from the text section and thus don’t need
secondary loads) and non-local symbols (which likely are more than 2GB
distance from the text section, thus requiring secondary loads). So
through careful section assignment and placement, the medany code model
with PIC will probably emit more efficient code compared to the large
code model.
The vset{i}vl{i}
Instructions
To emit efficient vector code in a later post, it’s crucial to
understand the semantics, versatility, and elegance of the
vsetvl
, vsetvli
, and vsetivli
instructions, which form the backbone of the vector extension. These
three instructions are minor variants of each other, so for brevity, we
will refer to them collectively as the vsetvl
instruction.
We will use x86 vector instructions for comparison to illustrate the
vsetvl
instructions. On x86 machines that support SSE
instructions, an add operation between two vectors can be represented
using one of eight instructions: paddb
, paddw
,
paddd
, paddq
, addps
,
addpd
, addss
, and addsd
. The
opcode in each instruction encodes the data type (whether integer or
float), width of the element (8, 16, or 32 bits), as well as the number
of elements that we want to operated on. If you now intend to emit AVX
instructions, there are an additional six instructions that you now need
to know about: vpaddb
, vpaddw
,
vpaddd
, vpaddq
, vaddps
, and
vaddpd
. AVX-512 adds an additional set of instructions for
representing the add operation. Clearly, vector code generation for x86
machines gets quite unwieldy even for basic operations.
In contrast, the RISC-V vector compute operations only specify the
operation (for example, vadd.vv
for integer addition or
vfadd.vv
for float-point addition) whereas the control
plane information (specifically, the element width, the number of
elements, and how many registers to use) is decoupled from the
instruction itself and instead expressed using the vsetvl
instructions, which effectively records this control plane information
in RISC-V Control and Status Registers (CSRs). Typically, writing to
CSRs is very expensive, but given the expected high frequency of the
vsetvl
instructions, this instruction is (as I recollect)
executed speculatively, thus reducing its performance impact. The
information written by the vsetvl
instructions is thus used
by all subsequent vector computation instructions until a new
vsetvl
instruction is executed.
Perhaps the most elegant part of the vsetvl
instructions
is in the specification of vl
or vector length. Unlike x86
vector instructions, which implicitly encode the number of vector
elements each instruction operates on, RISC-V vector instructions can
work with an arbitrary number of vector elements bounded by the physical
vector register size. For instance, if the physical vector register size
(also known as VLEN
) is 2048 bits and the element width is
32, the vector length (or vl
) as specified by the
programmer can be anywhere between 1 and 2048/32 = 64. Perhaps obvious,
the vector length does not need to be a power of 2 either. This greatly
simplifies code generation for vectorized loops, since, unlike on x86
processors, we don’t need to emit peeled loops at the beginning or at
the end of the vectorized loop to handle cases when the trip count is
not a perfect multiple of the vector element width.
To keep the code portable, the programmer may not want to assume a
specific VLEN
, in which case, the programmer specifies a
destination register that the architecture writes to, with the actual
vl
(which is either requested vl
or
VLEN
divided by the element width, whichever is lower). If
the programmer knows VLEN
, then she can ignore the actual
vl
(since it can be computed offline) and use
zero
or x0
as the destination operation, so
that she doesn’t need to allocate a register for a value that will not
be referenced in the future. In a later post in which we emit vector
instructions, with the goal of generating high-performance assembly, we
will assume that VLEN
is a known constant, although it is
not difficult to adapt the code to work with any arbitrary
VLEN
.
The second most elegant bit related to the vsetvl
instructions is the ability to group multiple consecutive vector
register into one large so-called pseudo-register, thus being able to
use fewer vector instructions to increase the number of elements we can
operate. Specifically, the length multiplier (or LMUL
) can
be set to 1 (which is the default), or 2, or 4, or 8. For instance, when
LMUL
=8, instead of 32 vector registers each of size VLEN
bits, the programmer now has access to 32/8 = 4 vector registers, each
of size 8*VLEN bits. LMUL
can also be fractional
(specifically, ½, ¼, or ⅛), which limits the addressable register size
to be half, quarter, or one-eighth of VLEN, respectively), but
fractional LMUL
s (like integral LMUL
s) do not
increase the number of vector registers.
Despite its utility in reducing the number of executed vector
instructions, the real reason for the existence of LMUL
is
to efficiently handle loops containing mixed-precision operations. For
instance, if a loop operates on multiple streams of 16-bit elements and
8-bit elements, it can be helpful to set a fractional LMUL
for the 8-bit vector operations (depending on the desired
vl
) and twice that LMUL
for the 16-bit vector
operations to reduce the number of vector register spills.
Finally, there are a few policy bits that can be set using the
vsetvl
instructions to specifying how to handle unused
vector lanes, which may arise from vl
being smaller than
VLEN
divided by the element width (resulting in
tail values) or from masked vector instructions (resulting in
masked values). In either case, we can specify that values
should either remain undisturbed (that is unmodified) or that we don’t
care (that is, they could be overwritten with all 1s). Note
that the specification explicitly avoids saying that these values could
be overwritten with all zeros.