home

RISC-V ISA Peculiarities

At work, I am helping write a compiler for a RISC-V processor. Coming from a background where I had to routinely read and write x64 assembly, the RISC-V Instruction Set Architecture (ISA) makes it much easier to write assembly code by hand. This post records some preliminary information about the ISA before diving into a couple of peculiarities.

Base ISA and Extensions

Perhaps the primary choice you will need to be aware of when compiling for your RISC-V processor is its word size, which is 32 bits for RV32 processors and 64 bits for RV64 processors. There is work in progress to define RV128, but since it’s not official yet, we will limit ourselves to RV32 and RV64. The word size determines both the sizes of addresses as well as data. When it comes to instructions, RV32’s instructions are limited to operate on at most 32 bits, but RV64 extends RV32’s instructions with a few more that let you operate on 64 bits. In other words, you can use instructions from RV32 on RV64 processors, but not the other way around. Both RV32 and RV64 support 32 registers, but their width, of course, varies between 32 bits and 64 bits. When you invoke the LLVM backend, you will need to make the choice early to specify the target triple as either riscv32 or riscv64.

Orthogonal to the choice between RV32 and RV64, the RISC-ISA also offers choices in the form of Extensions, which are essentially new instructions grouped together according to the functionality. All extensions are optional (and thus left as a choice for the designer of the processor) except for the extension for supporting integer instructions. Here, the designer has a choice: either support the full 32 registers, thus designing either an RV32I or RV64I processor, or only support 16 registers, thus designing an RV32E or RV64E processor.

Depending on which other extensions are enabled (say, M for multiply and divide instructions, A for atomic instructions, F for single-precision floating-point instructions, etc.), RISC-V processors are conventionally described by tacking on all the extensions that they support, usually in a particular sequence; so RV64IMAFC describes an 64-bit RISC-V processor that supports the integer instructions, multiplication and division instructions, atomic instructions, float-point instructions, and compressed instructions. Sometimes, you will see the extension “G” and no references to “I”; that’s because “G” is just shorthand for “IMAFD” (i.e. integer, multiplication and division, and single- and double-precision floating-point instructions).

The list of RISC-V ISA extensions is constantly evolving, so it’s hard to list all possible extensions, and although small compared to other ISAs, there is a formidable list of instructions that a processor can support. Still, there are a few instructions that deserve special attention, since we will run into them frequently in the code generated by our compiler.

Address Computation Instructions

Load and store instructions are bound to occur often in whichever program you compile, and consequently, so do instructions that calculate the address to use with the load and store instructions, making them equally important since our focus will be on generating lean assembly code. These instructions will also come into play in our subsequent discussion of so-called Code Models, Relocation, and Position-Independent Code when generating assembly code.

All RISC-V load and store instructions require that the address be made available using one of the integer registers. Depending on the exact value of the address, it may be cheaper to use a combination of the auipc or addi instructions versus a combination of the lui, addi, and slli instructions, but we may need additional, secondary loads (just for address computation) in certain circumstances, as explained below.

The lui instruction accepts a 20-bit immediate value, which is first shifted left by 12 bits (to obtain a 32-bit value) before being sign-extended and finally written to the destination register. Crucially, the instruction writes a 32-bit value to the destination register, whether or not it’s an RV32 or RV64 processor. Since the value computed using the lui instruction can range from -231 to +231, if we use the value as an address, we can load from or store to only the first 2GB of memory or the last 2GB memory. As a quick side note, if we specify the “medlow” code model to the RISC-V LLVM backend, it will generate lui instructions for address computation.

The auipc instruction is similar to the lui instruction: the auipc instruction also accepts a 20-bit immediate value, and it also first shifts the value left by 12 bits before sign-extending it, but instead of writing this 32-bit value directly, the instruction first adds it to the program counter before writing it to the destination register. Similar to the lui instruction, auipc writes a 32-bit value, but since this value is now relative to the program counter, if we use this value as an address, we can load from or store to an address that is within ±2GB away from the program counter. The RISC-V LLVM backend emits auipc for address computation if we specify “medany” as the code model.

Put another way, auipc lets us compute addresses that are ±2GB away from the program counter, whereas lui lets us compute addresses that are ±2GB away from the memory address zero. So if your total addressable memory is less than 4GB, just use the lui instruction (or the “medlow” code model). If you are confident that all your memory references will be within 2GB from the referencing instruction, use auipc (or the “medany” code model). For RV32 processors, we’re all set. But what we’re compiling for an RV64 processor and if our addresses are more than 2GB away from the instruction? For that, we need either a Global Offset Table or a Literal Pool, combined with an additional load instruction.

The key idea behind referencing addresses that don’t fit into the above use cases is to create a table of the addresses that we want to load from and store to, but to place this table within 2GB from the referencing instruction(s). This way, we can use the auipc instruction to get the address of the table entry, before loading the address stored in the table entry. Of course, this implies that the text section (which contains the address referencing instructions) should be less than 2GB in size in addition to the table containing less than 256K entries, but that’s generally the case anyway. If not, we would need to split both the text section as well as the table and splice the text section splits and the table splits, which seems practically impossible if we are using the RISC-V LLVM backend, although it does seem possible (but painful) if we decided to write and use our own code for object file generation.

The RISC-V LLVM backend uses this idea of creating a table and issuing secondary loads in both the “medany” code model with position-independent code (PIC) as well as the “large” code model. Both seem similar (the large code model calls the table of addresses as the Literal Pool, whereas the medany code model with PIC calls the table of addresses as the Global Offset Table), but the medany code model with PIC seems preferable over the large code model because the former enables the LLVM backend to distinguish between local symbols (for example, those in the .rodata section, which are expected to live within 2GB distance from the text section and thus don’t need secondary loads) and non-local symbols (which likely are more than 2GB distance from the text section, thus requiring secondary loads). So through careful section assignment and placement, the medany code model with PIC will probably emit more efficient code compared to the large code model.

The vset{i}vl{i} Instructions

To emit efficient vector code in a later post, it’s crucial to understand the semantics, versatility, and elegance of the vsetvl, vsetvli, and vsetivli instructions, which form the backbone of the vector extension. These three instructions are minor variants of each other, so for brevity, we will refer to them collectively as the vsetvl instruction.

We will use x86 vector instructions for comparison to illustrate the vsetvl instructions. On x86 machines that support SSE instructions, an add operation between two vectors can be represented using one of eight instructions: paddb, paddw, paddd, paddq, addps, addpd, addss, and addsd. The opcode in each instruction encodes the data type (whether integer or float), width of the element (8, 16, or 32 bits), as well as the number of elements that we want to operated on. If you now intend to emit AVX instructions, there are an additional six instructions that you now need to know about: vpaddb, vpaddw, vpaddd, vpaddq, vaddps, and vaddpd. AVX-512 adds an additional set of instructions for representing the add operation. Clearly, vector code generation for x86 machines gets quite unwieldy even for basic operations.

In contrast, the RISC-V vector compute operations only specify the operation (for example, vadd.vv for integer addition or vfadd.vv for float-point addition) whereas the control plane information (specifically, the element width, the number of elements, and how many registers to use) is decoupled from the instruction itself and instead expressed using the vsetvl instructions, which effectively records this control plane information in RISC-V Control and Status Registers (CSRs). Typically, writing to CSRs is very expensive, but given the expected high frequency of the vsetvl instructions, this instruction is (as I recollect) executed speculatively, thus reducing its performance impact. The information written by the vsetvl instructions is thus used by all subsequent vector computation instructions until a new vsetvl instruction is executed.

Perhaps the most elegant part of the vsetvl instructions is in the specification of vl or vector length. Unlike x86 vector instructions, which implicitly encode the number of vector elements each instruction operates on, RISC-V vector instructions can work with an arbitrary number of vector elements bounded by the physical vector register size. For instance, if the physical vector register size (also known as VLEN) is 2048 bits and the element width is 32, the vector length (or vl) as specified by the programmer can be anywhere between 1 and 2048/32 = 64. Perhaps obvious, the vector length does not need to be a power of 2 either. This greatly simplifies code generation for vectorized loops, since, unlike on x86 processors, we don’t need to emit peeled loops at the beginning or at the end of the vectorized loop to handle cases when the trip count is not a perfect multiple of the vector element width.

To keep the code portable, the programmer may not want to assume a specific VLEN, in which case, the programmer specifies a destination register that the architecture writes to, with the actual vl (which is either requested vl or VLEN divided by the element width, whichever is lower). If the programmer knows VLEN, then she can ignore the actual vl (since it can be computed offline) and use zero or x0 as the destination operation, so that she doesn’t need to allocate a register for a value that will not be referenced in the future. In a later post in which we emit vector instructions, with the goal of generating high-performance assembly, we will assume that VLEN is a known constant, although it is not difficult to adapt the code to work with any arbitrary VLEN.

The second most elegant bit related to the vsetvl instructions is the ability to group multiple consecutive vector register into one large so-called pseudo-register, thus being able to use fewer vector instructions to increase the number of elements we can operate. Specifically, the length multiplier (or LMUL) can be set to 1 (which is the default), or 2, or 4, or 8. For instance, when LMUL=8, instead of 32 vector registers each of size VLEN bits, the programmer now has access to 32/8 = 4 vector registers, each of size 8*VLEN bits. LMUL can also be fractional (specifically, ½, ¼, or ⅛), which limits the addressable register size to be half, quarter, or one-eighth of VLEN, respectively), but fractional LMULs (like integral LMULs) do not increase the number of vector registers.

Despite its utility in reducing the number of executed vector instructions, the real reason for the existence of LMUL is to efficiently handle loops containing mixed-precision operations. For instance, if a loop operates on multiple streams of 16-bit elements and 8-bit elements, it can be helpful to set a fractional LMUL for the 8-bit vector operations (depending on the desired vl) and twice that LMUL for the 16-bit vector operations to reduce the number of vector register spills.

Finally, there are a few policy bits that can be set using the vsetvl instructions to specifying how to handle unused vector lanes, which may arise from vl being smaller than VLEN divided by the element width (resulting in tail values) or from masked vector instructions (resulting in masked values). In either case, we can specify that values should either remain undisturbed (that is unmodified) or that we don’t care (that is, they could be overwritten with all 1s). Note that the specification explicitly avoids saying that these values could be overwritten with all zeros.

back