Contributing support for a Wasm instruction to Winch

A month ago I landed a pull request to Winch, Wasmtime's baseline (non-optimizing) compiler, to add support for a number of WebAssembly instructions. Shopify is working on Winch to make Wasm compilation faster. I learned a number of important parts of the codebase and how to test the changes I made which I'll outline below.

Adding support for a WebAssembly instruction means that when Winch encounters a given instruction, it emits zero or more native machine code instructions to perform the operation while ensuring the WebAssembly stack is in the correct state and that CPU registers are appropriately allocated or freed.

This is a somewhat involved process so here's a quick summary of the changes involved:

Adding the instruction to the visitor and calling through to the MacroAssembler and possibly CodeGenContext
Adding a method to the MacroAssembler which invokes one or more methods on the Assembler if there isn't an appropriate method already defined on the MacroAssembler
Potentially adding a helper method to the CodeGenContext to ensure that operands for the instruction have been popped from the stack and into registers before emitting machine instructions to act on the operands
Adding the methods to the assembler that emit x86_64 machine instructions
Adding a few tests for the new instruction and inspecting the generated code
Adding the new instructions to the list of supported instructions for fuzzing
Executing the fuzzing for at least a few minutes (and potentially investigating and fixing failures)

This is a high level diagram that shows the relationship between the various pieces I'll be discussing:

To add support for an instruction, I started by changing the visitor in Winch’s codegen module. The visitor is an implementation of the visitor design pattern, to walk the parsed instructions of a Wasm function. The wasmparser crate enables parsing of WebAssembly modules. WebAssembly modules consist of an ordered list of sections, for example, the type, code, or import sections. For the purposes of adding support for an instruction, we focus on how the code section is parsed. The code section contains a collection of code entries. Part of code entries are functions. Part of a function is an expression representing the function body. And expressions are a series of zero or more instructions followed by an opcode that indicates the expression has ended. When parsing an expression, wasmparser can be configured to call an implementation of its VisitOperator trait. The visitor is Winch’s implementation of the VisitOperator trait. The VisitOperator trait has visit methods that correspond to each WebAssembly instruction. Adding support for an instruction involves adding an implementation for the appropriate visit method in the VisitOperator implementation. The visitor delegates to the CodeGenContext and the MacroAssembler.

A change to the visitor looks like:

This change adds support for the i32.clz Wasm instruction which counts the number of leading zeros in the value on the top of the stack.

The CodeGenContext coordinates the register allocator, the value stack, and the current function's frame. The values on the value stack may be constants, a register, an index to a WebAssembly local, or an offset in linear memory. The register allocator is responsible for keeping track of which registers are currently allocated and can allocate or free registers. It can also spill values if a register is unavailable and is requested. A spill involves:

Moving a value from the register into an offset in linear memory
Updating references to that register on the value stack to point to the offset in memory for where that value is now stored

This frees the register while preserving the value on the stack. However, since spilling results in additional operations, we want to avoid using more registers than necessary when performing an operation to reduce the chance that we need to spill. The CodeGenContext has methods to ensure the correct number of operands have been popped off the stack and are either passed as an immediate, if possible, or loaded into a register. Immediates are constants used in a machine instruction. For example, in the instruction add eax, 10, the 10 is an immediate. Code in the CodeGenContext is independent of the instruction set architecture (ISA), while code that is not ISA independent is put in the MacroAssembler.

The MacroAssembler is an interface independent of an ISA to emit machine code instructions that correspond to WebAssembly instructions. x86_64 (Intel and AMD CPUs) and AArch64 (Apple Silicon) are examples of ISAs. Adding support for the i32.and Wasm instruction involves adding an and instruction to the MacroAssembler. There is an implementation of the MacroAssembler for each ISA that Winch supports. This allows different ISAs to emit different machine instructions for the same WebAssembly instruction. The MacroAssembler delegates to an ISA-specific assembler to emit the actual machine instructions.

In simple cases where a WebAssembly instruction can be implemented by a single machine code instruction, and the operands for that machine code instruction do not need to be placed in particular registers, the MacroAssembler can simply invoke an equivalent method on the assembler. In more complicated cases, the MacroAssembler may need to place operands in specific registers. For example, when implementing shift left (that is, shl), if the operand for the number of bits to shift is not passed as an immediate, it is required to be placed in the cl register. This requires additional code in the MacroAssembler. Additionally, the MacroAssembler may need to make a series of calls to the assembler to emit many instructions. For example, when implementing Wasm's eq instruction, an x86_64 cmp instruction must be emitted, and then an instruction must be emitted to copy the value of the zero flag in the status register to a destination register.

In even more complicated cases, the MacroAssembler may need to check if the CPU it is compiling machine code for supports a given instruction and substitute a different set of instructions if that support is missing. For example, Wasm's clz (count leading zeros) instruction can be implemented by using x86_64's lzcnt (leading zeros count) instruction, but some x86_64 CPUs do not support lzcnt, so we check if that support is available and have the MacroAssembler instruct the assembler to emit a series of instructions that output the same result if support is not available.

An example of a change to the MacroAssembler would be:

The clz instruction counts the number leading zeros in the binary representation of the operand. We use x86_64's lzcnt (leading zero count) instruction if it's available on the CPU we're compiling for. Otherwise, we fallback to an implementation that uses bsr (bit scan reverse), which finds the index of the most significant bit set. The bsr instruction sets a zero flag bit in the status register if its operand is 0 and clears the zero flag bit otherwise. The setne instruction copies 0 into the scratch register if the zero flag bit is 0, and copies 1 otherwise. Assembler methods ending in _ir indicate they act on an immediate and a register, and ones that end in _rr indicate they act on two registers. I also had to adjust how I was performing the subtraction because there is no way to subtract a register from an immediate value.

The assembler exposes methods that align with the underlying machine instruction set. It’s built on top of Cranelift’s machine code emission. Cranelift is a low-level code generator (that is, code that has minimal abstractions from machine code) used by Wasmtime. There may also be methods that are specialized to taking a pair of registers or an immediate and a register. To determine what machine instructions are available, I used a combination of ChatGPT, Google, and Cranelift's Codegen's x86_64 ISLE definitions. Cranelift's ISLE is an S-expression based domain specific language for lowering instructions to machine code. The definitions are run through code generation to output Rust code that the assembler consumes. The definitions have useful source level comments that provide additional context on what the defined instruction does. I'd start with trying to cmd+f through the ISLE definitions to see what I can find related to a given operation (for example, or). If I couldn't find something obvious, I'd ask ChatGPT how it would implement a description of a given WebAssembly instruction in x86_64 assembly language. I'd then look up the instructions ChatGPT suggested on Google to verify the accuracy of the suggestions. In some cases, the x86_64 assembly code ChatGPT suggested was subtly incorrect. Then I'd try to locate an entry in the ISLE definitions that matched the suggested x86_64 assembly instruction. Then I'd write the assembler method or methods to invoke the generated code for that ISLE definition to emit that machine instruction with the correct arguments.

An example of what the ISLE definition would look like for bsr or lzcnt is:

The UnaryRmR instruction is what I would reference and I would pass Bsr or Lzcnt as the op to perform.

An example of a change in the assembler would be:

This change adds methods to emit the lzcnt instruction. Inst::UnaryRmR references generated code from Cranelift's ISLE definitions above.

Testing new changes is also quite different than with typical software development. Automated testing takes two forms, filetests and fuzzing. Filetests use a specially formatted WebAssembly Text format file (a .wat file) where the target architecture and CPU flags are specified as comments, then a WebAssembly module, and finally the machine code that we expect to be generated. Because these tests just generate and check code and do not execute it, they can be run on AArch64 machines like our development Macbooks. Normally, I write the first two parts and then run the tests and check that the machine code output looks how I'm expecting it to look. Specifically for x86_64, I check that the registers have the correct width (registers that start with e for 32-bit and registers that start with r for 64-bit), the expected x86_64 instructions are emitted, and that immediates are used if possible. I write a set of tests that use constants, function parameters, locals, variants where I expect an immediate to be emitted and one where I expect a register, and 32 and 64 bit versions of the tests. These tests are also helpful when I'm refactoring to see if it changes the machine code being emitted. These filetests are not capable of asserting if the machine instructions that are emitted will result in the correct output. For that we use differential fuzzing.

An example of a filetest looks like:

The idea behind differential fuzzing is it randomly generates an infinite series of WebAssembly modules, and it will invoke a function within each module it generates with randomly selected inputs. It will run a particular input on a particular Wasm module using Winch and using a different WebAssembly engine. It will then compare the results and crash if they are different. This enables testing a much wider variety of inputs compared to integration testing with fixed inputs. It definitely helped me uncover a few issues that I otherwise would have missed. The downside is that the differential fuzzing for testing x86_64 instructions has to be performed on a machine with an x86_64 processor. To fuzz a Winch instruction, that instruction needs to be added to a winch_supports_module match statement in the differential fuzz target. The reason for this is Winch does not support all Wasm instructions and a generated module will fail if it uses an instruction that isn't supported yet, so we don't fuzz modules that contain unsupported instructions. This match statement is used to indicate that we do support the instructions listed.

After ensuring that the filetests' outputs looked reasonable and the differential fuzzing ran for at least a few minutes, I opened a draft PR on my personal fork of Wasmtime with a merge into my personal fork (not upstream) so I could collect internal feedback on my changes. After iterating on that feedback, I rebased down to one commit and opened a PR on the public Wasmtime repository.

Jeff Charles is a Senior Developer on Shopify's Wasm Foundations team. You can find him on GitHub as @jeffcharles or on LinkedIn at Jeff Charles.