Sam Saffron AMA: Performance and Monitoring with Ruby

Sam Saffron is a co-founder of Discourse and the creator of the mini_profiler, memory_profiler, mini_mime and mini_racer gems. He has written extensively about various performance topics on samsaffron.com and is dedicated to ensuring Discourse keeps running fast.

Sam visited Shopify in Ottawa and talked to us about Discourse’s approach to Ruby performance and monitoring. He also participated in an AMA and answered the top voted questions submitted by Shopifolk which we are sharing here.

Ruby has a bad reputation when it comes to performance. What do you think are the actual problems? And do you think the community is on the right track to fix this reputation?

Sam Saffron: I think there are a lot of members of the community that are very keen to improve performance. And this runs all the way from above. DHH is also very interested in improving performance of Ruby.

I think the big problem that we have is resources and focus. A lot of times, I can feel that as a community we’re not focusing necessarily on the right thing. It’s very tempting, in performance, just to look at a micro bench. And it’s easy just to look at micro bench and make something 20 times faster, but in the big scheme of things you may not be fixing the right thing. So, it doesn’t make a big difference.

I think one area that Ruby can get better at, is finding the actual real production bottlenecks that people are seeing out there, and working towards solving them. And when I think about performance for us at Discourse, the biggest pain is memory, not CPU. When looking at adoption of Discourse, a lot of it depends on the people being able to run it on very cheap servers and they’re very constrained on memory. It’s a huge difference to adoption for us whether we can run on a 512MB system versus 1024MB. We see these memory issues in our hosting as well, our CPUs are usually doing okay, but memory is where we have issues. I wish the community would focus more on memory.

Just to summarize, I wish we looked at what big pain points consumers in the ecosystem are having and just set the agenda based on that. The other thing would be to spend more time on memory.

Are there any Ruby features or patterns that you generally avoid for performance reasons?

Sam Saffron: That’s an interesting question. Well, I’ll avoid ActiveRecord sometimes if I have something performance sensitive. For example, when I think of a user flow that I’m working on, it could be one that the user will visit once a month, or it could be one an extremely busy route like the topic page. If I’m working on the topic page, it’s a performance sensitive area, then maybe I may opt to skip ActiveRecord and just use MiniSql.

As for using Ruby patterns, I don’t go and write while loops just because I hate blocks and I know that blocks are a little bit slower. I like how wonderful Ruby looks and how wonderful it reads. So, I won’t be like, “Oh, yeah, I have to write C in Ruby now because I don’t want to use blocks anywhere.” I think it’s a there’s a balancing act with patterns and I’ll only strive or move away for two reasons. One is clarity. If the code will be clearer without like using some of these sophisticated patterns, I’ll just go for clear and dumb versus fancy, sophisticated and pretty. I prefer clear and dumb. An example of that is I hate using /unless/. It’s a pet peeve that I have, I won’t use the /unless/ keyword because I find it harder for me to comprehend what the code means. And the second is for performance reasons only. Only rarely where I absolutely have to take the performance hit, will I do that.

Sam Saffron presenting at Shopify in Ottawa

What is the right moment to shift focus on the performance of a product, rather than on other features? Do you have any tripwires or metrics in place?

Sam Saffron: We’re constantly thinking about performance at Discourse. We’ve always got the monitoring in place and we’re always looking at our graphs to see how things are going. I don’t think performance is something that you forget about for two years then go back and say, “Yeah, we’ll do a round of performance now.” I think there should be a culture of performance instilled day-to-day and always be considering it. It doesn’t mean performance the only thing you should be thinking about but it should be in the back of your mind as something that is a constant that you are trying to do.

There’s a balancing act. You want to ship new features, but as long as performance is something the team is constantly thinking about, then I think it’s safe. I would never consider shipping a new feature that is very slow just because I want to get the feature out there. I prefer to have the feature both correct and fast before shipping it.

What was one of the most difficult performance bugs you’ve found? How did you stay focused and motivated?

Sam Saffron: The thing that keeps me focused is having very clear goals. It’s important when you’re dealing with performance issues. You have a graph, it’s going a certain shape, and you want to change the shape of it. That’s your goal. You forget about everything else and it’s about taking that graph from this shape to that shape. When you can break a problem down from something that is impossible into something that is practical and easy to reason about, it’s at that point, you can attack these problems.

Particular war stories are hard—there’s nothing that screams out at me as the worst bug we’ve had. I guess memory leaks have been traditionally, some of the hardest problems we’ve faced. Back in the old days we used the TheRubyRacer, and it had a leak in the interop layer between Ruby and V8. It was a nightmare to find, because you’d have these processes that just keep climbing, and you don’t know what’s responsible for it. It’s something random that you’re doing but how do you get to it? So we looked at that graph and start removing parts of the app and when you remove half of the app, the graph is suddenly stable. So, we put the other half of the app back in and slowly bisect it until you find the problem area and start resolving it. Luckily these days the tooling for debugging memory leaks is far more advanced making it much easier to deal with issues like this.

Do you employ any kind of performance budgeting in your products and/or libraries? If you do, what metrics do you monitor and how do you decide on a budget?

Sam Saffron: Well, one constant budget I have is that any new dependency in our gem file has to be approved by me, and people have to justify its use. So I think dependencies are a big thing which is part of performance budget. In that, it’s easy to add dependencies, but to remove them later is very hard. I need to make sure that every new dependency we add is part of a performance budget that we agree we absolutely need it.

I’m constantly thinking about our performance budget. We’ve got the budget on boot. I’m very proud of the way that I can boot Rails console in under two seconds on my laptop. So boot budget is important to me, especially for dev work. If I want to just open a Rails console, I just do it. I don’t have to think that I’m going to have to wait 20 seconds for this thing to boot up. I might as well go and browse the web.

We’ve got this constant budget, they’re the high profile pages. We can’t afford any of regression there. So, one thing that we’re looking at adding is alerts. If the query count on a topic page is now sitting on a median of 60 queries to SQL, if it goes up to 120, I want to get an alert saying, “There are 120 queries on this page, and there used to be 60 only.” So somebody will have a look at that, and it’ll open an alert topic on Discourse. So I definitely do want to get into more alerting that say, “Look, something happened at this point, look at it.”

What’s your take on the different Ruby runtimes out there? Is MRI still the “go to one” for a new project? If so, what do you think are the other ones missing to become real contenders?

Sam Saffron: We’ve always wanted Discourse to work on a wide array of platforms. That’s been a goal because when we started it was just about pure adoption. We didn’t care if people were paying us or not paying us, we just wanted the software to be adopted. So if it can run on JRuby, all power to JRuby—it makes adoption easier. The unfortunate thing that happened over the years is that we have never been able to run Discourse on JRuby, and they’ve been attempts out there but we are not quite there. Being able to host V8 in Java in JRuby is very very hard. A lot of what we do is married to the C implementation. It’s extremely hard to move to another world. I want there to be diversity, but unfortunately the only option we have at the moment is MRI, and I don’t see any other options in the next couple of years popping up that would be feasible.

Matz (Yukihiro Matsumoto) is saying that he wants Ruby 3 to be three times faster. Are you following the Ruby 3 development? Do you think they are going in the right direction?

Sam Saffron: I think there’s definitely a culture of performance at CRuby. There are a lot of improvements happening patch after patch where they are shaving this bit off and that bit off. CRuby itself, is tracking well but whether it’ll get three times faster or not, I don’t know. Where it gets complicated, the ecosystem itself is tracking its own trajectory and that’s where it gets complicated. There’s one trajectory for the engine, but the other trajectory for the ecosystem.

If you look at things like Active Record, it’s not tracking three times faster for the next version of Rails, unfortunately. And that’s where all our pain is at the moment. When you look at what CRuby is doing, the goal is not making Active Record three times faster because it’s not a goal that is even practical for them to take on. So, they’re just dealing with little micro benchmarks that may help this situation or they may not help the situation, we don’t know.

Overall, Do I think MRI is tracking well? Yes, MRI is tracking well, but I think we need to put a lot more focus around the ecosystem, if we want to the ecosystem to be 3x faster.

Is there any performance tooling that you think MRI is missing right now?

Sam Saffron: Yes. I’d say memory profiling is the big tooling piece that is missing. We have a bunch of tooling, for example, you can get full heap dumps. But the issue is how are you going to analyze it? The tooling for analysis is woeful, to say the least. If you compare Ruby on Rails to what they have in Java or .NET, we’re worlds behind. In Java and .NET, when it comes to tooling for looking at memory, you can get back traces from where something is allocated. In MRI, at best, you can get a call site of where something was allocated, you can’t get the full backtrace of where it was allocated. Having the full backtrace gives you significantly more tools to figure out and pinpoint what it is.

So, I’d say there are some bits missing of raw information that you could opt in for, that would be very handy. And a lot of tooling around visualizing and analyzing what is going on, especially when it comes to the world between managed and unmanaged because it’s very murky.

People look at a process and the process is consuming one gig of memory, and they want to know why. And if you were able at Shopify, for example, to have that picture immediately of why? You might say, well, maybe killing Unicorn workers is not what we need because all the memory looks like this and it’s coming from here. Maybe we just rewrite this little component and we don’t have to kill these Unicorns anymore because we’ve handled the root cause. I think that area is missing.

Intrigued about scaling using Ruby? Shopify is hiring and we’d love to hear from you. Please take a look at our open positions on the Engineering career page.