Adding the V8 CPU Profiler to v8go

V8 is Google’s open source high-performance JavaScript and WebAssembly engine written in C++. v8go is a library written in Go and C++ allowing users to execute JavaScript from Go using V8 isolates. Using Cgo bindings allows us to run JavaScript in Go at native performance.

The v8go library, developed by Roger Chapman, aims to provide an idiomatic way for Go developers to interface with V8. As it turns out, this can be tricky. For the past few months, I’ve been contributing to v8go to expose functionality in V8. In particular, I’ve been adding support to expose the V8 CPU Profiler

From the start, I wanted this new API to be:

  • easy for the library's Go users to reason about
  • easy to extend for other profiler functionality eventually
  • aligned closely with the V8 API
  • as performant as possible.

The point about performance is especially interesting. I theorized that my first iteration of the implementation was less performant than a proposed alternative. Without benchmarking them, I proceeded to rewrite. That second implementation was merged, and I moved on with my life. So when I was like "Hey! I should write a post about the PR and benchmark the results" only to actually see the benchmarks and reconsider everything.

If you’re interested in API development, Go/Cgo/C++ performance or the importance of good benchmarks, this is a story for you.

Backing Up to the Starting Line: What Was My Goal?

The goal of adding the V8 CPU Profiler to v8go was so users of the library could measure the performance of any JavaScript being executed in a given V8 context. Besides providing insight on the code being executed, the profiler returns information about the JavaScript engine itself including garbage collection cycles, compilation and recompilation, and code optimization. While virtual machines and the like can run web applications incredibly fast, code should still be performant, and it helps to have data to understand when it's not. 

If we have access to a CPU profiler, we can ask it to start profiling before we start executing any code. The profiler samples the CPU stack frames at a preconfigured interval until it's told to stop. Sufficient sampling helps show the hot code paths whether that be in the source code or in the JavaScript engine. Once the profiler has stopped, a CPU profile is returned. The profile comes in the form of a top-down call tree composed of nodes. To walk the tree, you get the root node and then follow its children all the way down.

Here’s an example of some JavaScript code we can profile:

Using v8go, we start by creating the V8 isolate, context, and CPU profiler. Before running the above code, the profiler is told to start profiling:

After the code has finished running, the profiling is stopped and the CPU profile returned. A simplified profile in a top-down view for this code looks like:

Each of these lines corresponds to a node in the profile tree. Each node comes with plenty of details including:

  • name of the function (empty for anonymous functions)
  • id of the script where the function is located
  • name of the script where the function originates
  • number of the line where the function originates
  • number of the column where the function originates
  • whether the script where the function originates is flagged as being shared cross-origin
  • count of samples where the function was currently executing
  • child nodes of this node
  • parent node of this node
  • and more found in the v8-profiler.h file.

For the purposes of v8go, we don’t need to have opinions about how the profile should be formatted, printed, or used since this can vary. Some may even turn the profile into a flame graph. It’s more important to focus on the developer experience of trying to generate a profile in a performant and idiomatic way.

Evolving the API Implementation

Given the focus on performance and an idiomatic-to-Go API, the PR went through a few different iterations. These iterations can be categorized into two distinct rounds: the first where the profile was lazily loaded and the second where the profile was eagerly loaded. Let’s start with lazy loading.

Round 1: Lazy Loading

The initial approach I took aligned v8go with V8's API as closely as possible. This meant introducing a Go struct for each V8 class we needed and their respective functions (that is, CPUProfiler, CPUProfile, and CPUProfileNode).

This is the Go code that causes the profiler to stop profiling and return a pointer to the CPU profile:

This is the corresponding C++ code that translates the request in Go to V8's C++:

With access to the profile in Go, we can now get the top-down root node:

The root node exercises this C++ code to access the profiler pointer and its corresponding GetTopDownRoot() method:

With the top-down root node, we can now traverse the tree. Each call to get a child, for instance, is its own Cgo call as shown here:

The Cgo call exercises this C++ code to access the profile node pointer and its corresponding GetChild() method:

The main differentiator of this approach is that to get any information about the profile and its nodes, we have to make a separate Cgo call. For a very large tree, this makes at least kN more Cgo calls where k is the number of properties queried, and N is the number of nodes. The value for k will only increase as we expose more properties on each node.

How Go and C Talk to Each Other

At this point, I should explain more clearly how v8go works. v8go uses Cgo to bridge the gap between Go and V8's C code. Cgo allows Go programs to interoperate with C libraries: calls can be made from Go to C and vice versa.

Upon some research about Cgo's performance, you'll find Sean Allen’s Gophercon 2018 talk where he made the following recommendation:

“Batch your CGO calls. You should know this going into it, since it can fundamentally affect your design. Additionally once you cross the boundary, try to do as much on the other side as you can. So for go => “C” do as much as you can in a single “C” call. Similarly for “C” => go do as much as you can in a single go call. Even more so since the overhead is much higher.”

Similarly, you’ll find Dave Cheney’s excellent “cgo is not go” that explains the implications of using cgo: 

“C doesn’t know anything about Go’s calling convention or growable stacks, so a call down to C code must record all the details of the goroutine stack, switch to the C stack, and run C code which has no knowledge of how it was invoked, or the larger Go runtime in charge of the program.

The take-away is that the transition between the C and Go world is non trivial, and it will never be free from overhead.”

When we talk about “overhead,” the actual cost can vary by machine but some benchmarks another contributor v8go (Dylan Thacker-Smith) ran show an overhead of about 54 nanoseconds per operation (ns/op) for Go to C calls and 149 ns/op for C to Go calls:

Given this information, the concern for the lazy loading is justified: when a user needs to traverse the tree, they’ll make many more Cgo calls, incurring the overhead cost each time. After reviewing the PR, Dylan made the suggestion of: building the entire profile graph in C code and then passing a single pointer back to Go so Go could rebuild the same graph using Go data structures loaded with all the information that can then be passed to the user. This dramatically reduces the number of Cgo calls. This brings us to round #2.

Round 2: Eager Loading

To build out a profile for visualization, users will need access to most if not all of the nodes of the profile. We also know that for performance, I want to limit the number of C calls that have to be made in order to do so. So, we move the heavy-lifting of getting the entire call graph inside of our C++ function StopProfiling so that the pointer we return to the Go code is to the call graph fully loaded with all the nodes and their properties. Our go CPUProfile and CPUProfileNode objects will match V8’s API in that they have the same getters, but now, internally, they just return the values from the structs private fields instead of reaching back to the C++ code.

This is what the StopProfiling function in C++ does now: once the profiler returns the profile, the function can traverse the graph starting at the root node and build out the C data structures so that a single pointer to the profile can be returned to the Go code that can traverse the graph to build corresponding Go data structures.

The corresponding function in Go, StopProfiling, uses Cgo to call the above C function (CPUProfilerStopProfiling) to get the pointer to our C struct CPUProfile. By traversing the tree, we can build the Go data structures so the CPU profile is completely accessible from the Go side:

With this eager loading, the rest of the Go calls to get profile and node data is as simple as returning the values from the private fields on the struct.

Round 3 (Maybe?): Lazy or Eager Loading

There’s the potential for a variation where both of the above implementations are options. This means allowing users to decide where they want to lazily or eagerly load everything on the profile. It’s another reason why, in the final implementation of the PR, the getters were kept instead of just making all of the Node and Profile fields public. With the getters and private fields, we can change what’s happening under the hood based on how the user wants the profile to load.

Speed is Everything, So Which One's Faster?

Comparing lazy and eager loading required a test that executed some JavaScript program with a decently sized tree so we could exercise a number of Cgo calls on many nodes. We would measure if there was a performance gain by building the tree eagerly in C and returning that complete call graph as a pointer back to Go.

For quite a while, I ran benchmarks using the JavaScript code from earlier. From those tests, I found that:

  1. When lazy loading the tree, the average duration to build it is ~20 microseconds.
  2. When eagerly loading the tree, the average duration to build it is ~25 microseconds.

It's safe to say these results were unexpected. As it turns out, the theorized behavior of the eager approach wasn’t more optimal than lazy loading, in fact, it was the opposite. It relied on more Cgo calls for this tree size. 

However, because these results were unexpected, I decided to try a much larger tree using the Hydrogen starter template. From testing this, I found that:

  1. When lazy loading the tree, the average duration to build it is ~90 microseconds.
  2. When eagerly loading the tree, the average duration to build it is ~60 microseconds.

These results aligned better with our understanding of the performance implications of making numerous Cgo calls. It seems that, for a tiny tree, the cost of traversing it three times (twice to eagerly load information and once to print it) doesn’t cost less than the single walk to print it that includes numerous Cgo calls. The true cost only shows itself on a much larger tree where the benefit of the upfront graph traversal cost greatly benefits the eventual walkthrough of a large tree to be printed. If I hadn’t tried a different sized input, I would never have seen that the value of eager loading eventually shows itself. If I drew the respective approaches of growth lines on a graph, it would look something like:

Simple graph with time to build profile on the y axis and size of javascript on x axis. 2 lines indicating eager and lazy are plotted on the graph with lazy being higher

Looking Back at the Finish line

As a long time Go developer, there’s plenty of things I take for granted about memory management and performance. Working on the v8go library has forced me to learn about Cgo and C++ in such a way that I can understand where the performance bottlenecks might be, how to experiment around them, and how to find ways to optimize for them. Specifically contributing the functionality of CPU profiling to the library reminded me that:

  1. I should benchmark code when performance is critical rather than just going with my (or another’s) gut. It absolutely takes time to flesh out a sufficient alternative code path to do fair benchmarking, but chances are there are discoveries made along the way. 
  2. Designing a benchmark matters. If the variables in the benchmark aren’t reflective of the average use case, then the benchmarks are unlikely to be useful and may even be confusing.

Thank you to Cat Cai, Oliver Fuerst, and Dylan Thacker-Smith for reviewing, clarifying, and generally just correcting me when I'm wrong.

About the Author:

Genevieve is a Staff Developer at Shopify, currently working on Oxygen.

If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.