Announcing go-lua

Today, we’re excited to release go-lua as an Open Source project. Go-lua is an implementation of the Lua programming language written purely in Go. We use go-lua as the core execution engine of our load generation tool. This post outlines its creation, provides examples, and describes some challenges encountered along the way.

Motivation

Shopify has built a succession of load generation tools in Ruby over the years. These tools simulate heavy traffic on Shopify, and allow us to describe the shape of that traffic by defining a series of steps a merchant's customers would take when visiting their site. We call these "flows". A flow can include computation, such as browsing the storefront, adding items to a cart, or extracting data from a HTTP response. We execute many copies of a flow in parallel across many machines to simulate high load scenarios like flash sales.

At the end of 2013, our existing load generation tool could no longer scale large enough to sufficiently stress the platform, so we undertook a rewrite in Go. We chose Go for its scalability. We needed something that would be able to handle a high level of concurrency, and we also liked that Go apps can be easily distributed as a single binary.

A key requirement for the project was that we needed to be able to write and execute new flows without building and deploying new releases of the tool. The tool’s previous implementation had had this feature, with flows being written in a web interface - we also wanted to improve this feature to allow us to write flows as scripts.

Rather than design a custom language to describe flows, we chose to leverage an existing lightweight scripting language. We selected Lua because of its relatively simple implementation and our increasing use of the language through Redis and nginx.

Several Go bindings exist for the C implementation of Lua, including the many forks of golua and the effort to port Lua 5.1 to the Go toolchain. Coupling two language runtimes, each with their own garbage collected heaps and with incompatible concurrency mechanisms seemed unwise for a production load generation tool. Gopher-lua is a recent attempt to port the Lua 5.1 implementation to Go, but it’s a young project with an evolving API, and it wasn’t yet available when we put go-lua into production in May 2014.

Implementing a scripting language purely in Go offers several advantages over binding a C implementation. First, we can leverage Go’s garbage collector and use Go’s heap for the embedded language’s objects. This gives us benefits like zero-overhead sharing of strings between the two languages, as an example. Second, the overhead of creating an execution context for the embedded language is minimal - little more than the lua.State struct, a couple of maps, and a small stack. This is important in a load generation process with thousands of goroutines, each with its own Lua state. Third, concurrency and error handling semantics are easier to map between the two languages.

We implemented go-lua by manually porting the Lua 5.2 compiler and virtual machine from C to Go (Russ Cox’s c2go tool could have simplified this work, had it been available). We started with the “undump” deserializer, which helped define the primary data structures and allowed us to start executing Lua code quickly by loading binary files produced by Lua’s luac compiler. Once we could successfully load and execute Lua binaries, we ported the compiler and tested it by structural comparison with the deserialized binaries produced by luac. The core libraries were written as we needed them, to expand the set of Lua tests we could run, and for our load generation tool.

Examples

Example code in this section is taken from our load generation tool. We describe steps in a flow using Lua’s table.field = syntax, which is equivalent to table[“field”] =:

Lua tables have an undefined iteration order, but we want named steps to execute in the order they’re defined. Lua’s metamethods allow us to build an object with table-like properties, recording the step names as they’re registered:

A similar approach allows us to make the config table read-only:

We built a wrapper library for several Go standard library APIs, called goluago. Amongst other things, this library exposes Go’s regular expressions to Lua. First we allow the library to be loaded with local re = require("goluago/regexp"):

When we compile a regular expression with re.compile(pattern), the result is a table of Go functions that close over the compiled Go regular expression:

As the above example show, we’ve chosen to retain Lua’s familiar stack API, with some changes to match Go conventions. The Lua C API is divided into a core set of functions & types, a debug interface, and an auxiliary library. The core set of functions accept a Lua state object as the initial parameter, and we’ve chosen to implement those functions as methods on *lua.State. The debug interface and auxiliary library are implemented as ordinary functions in the lua package. Go conventions drop the Get prefix from getter functions like GetUserData and use camel case instead of names like checkinteger.

Challenges

Like many language implementations, the core of go-lua is a bytecode interpreter. The C implementation which we ported to Go has a loop containing a switch statement, with a case for each bytecode instruction. Good C compilers translate large, dense switch statements to jump tables. The gc Go compiler instead translates a switch statement like:

To a binary search like:

This significantly limits the performance of the interpreter's inner loop. Go doesn't offer alternatives like gcc's computed gotos. The closest we can come to a jump table in Go is a function table, with each bytecode instruction implemented in a separate function:

In C, using function calls to implement bytecode dispatch would be significantly slower than switch or jump table dispatch. Early experiments showed up to 25% performance gains from using function table dispatch In Go. Extensive manual inlining in the switch dispatched version eliminated this advantage. We implemented both switch and function table dispatch, and can enable either approach at compile time.

The gc Go compiler performs limited function call inlining. Inlining depends on the complexity of the callee function. In particular, function-call nodes in the callee function appear to have infinite cost, as do calls to panic and most calls into the language runtime (e.g. interface assertions). Translating simple C macros to Go functions offers equivalent performance. Anything that wraps a function call, however, can introduce degradations. Assertions are a simple example. An assertion function like:

is intended to simplify checks like:

The gc Go compiler will not inline this call to apiCheck because of the call to panic.

A similar example is extracting arguments from bytecode instructions, which was initially written as:

Rewriting these yielded noticeable performance gains:

The gc Go compiler compiles type switches as cascading if-else if tests, with repeated calls to a runtime type assertion function. An early implementation of the function call stack in go-lua used two implementations of an interface to differentiate between calls to Lua functions and calls to Go functions:

This results in a lot of Go runtime calls for type tests in several hot code paths. We instead replaced the stack of interfaces with a stack of callInfo structs, with pointers to the two variants:

Benchmarks

Benchmark results shown here are taken from a Mid 2012 MacBook Pro Retina with a 2.6 GHz Core i7 CPU running OS X 10.10.2, go 1.4.2 and Lua 5.2.2.

The Fibonacci function can be written a few different ways to evaluate different performance characteristics of a language interpreter. The simplest way is as a recursive function:

This exercises the call stack implementation. When computing fib(35), go-lua is about 6x slower than the C Lua interpreter. Gopher-lua is about 20% faster than go-lua. Much of the performance difference between go-lua and gopher-lua comes from the inclusion of debug hooks in go-lua. The remainder is due to the call stack implementation - go-lua heap-allocates Lua stack frames with a separately allocated variant struct, as outlined above. Although it caches recently used stack frames, it is outperformed by the simpler statically allocated call stacks in gopher-lua.

The recursive Fibonacci function can be transformed into a tail-recursive variant:

The Lua interpreter detects and optimizes tail calls. This exhibits similar relative performance between the 3 interpreters, though gopher-lua edges ahead a little due to its simpler stack model and reduced bookkeeping.

Finally, we can write an explicitly iterative implementation:

This exercises more of the bytecode interpreter’s inner loop. Here we see the performance impact of Go’s switch implementation. Both go-lua and gopher-lua are an order of magnitude slower than the C Lua interpreter.

Limitations

Go-lua aims for compatibility with Lua 5.2. Most of the core libraries are implemented. Notable exceptions include coroutines, string pattern matching and string.dump.

The core VM and compiler has been ported and tested. The compiler is able to correctly process all Lua source files from the Lua test suite. The VM has been tested to correctly execute over a third of the Lua test cases.

Weak reference tables are not and will not be supported. go-lua uses the Go heap for Lua objects, and Go does not support weak references.