Renato Athaydes Personal Website

Sharing knowledge for a better world

How to write really slow Rust code

How I tried to port Lisp code to Rust and managed to get a much slower program... and how to fix that!
Written on 31 Jul 2021, 10:50 AM

Decoration image

Photo by Sam Moqadam on Unsplash

I have recently published a blog post that, as I had expected (actually, hoped for, as that would attract people to contribute to the “study”), generated quite some polemic on the Internet!

The post was about an old study by Lutz Prechelt comparing Java to C/C++, as well as a few follow-up papers that added other languages to the comparison, including Common Lisp and a few scripting languages.

I decided to try and see if the results in those papers, which ran their studies 21 years ago, still stand or if things changed completely since then.

I couldn’t get a bunch of students, let alone professional programmers, to write programs for my “study” in multiple languages, so I did the next best thing and ported the arguably most common approach to the problem (based on code by Peter Norvig, no less) to other languages: Java, Rust, and later also Julia and Dart.

The results were shocking: Rust was one of the slowest languages! It had the lowest memory usage, but also the highest LOC (lines-of-code) if I included only the implementations of the same algorithm.

Today, I want to appease angry Rust fans and other concerned parties by showing exactly why my Rust code was so slow, and how with a few small fixes, it can outperform the other languages in the “competition”, at least in performance (and as I found out, for certain kinds of inputs only).

Thanks to the Rust fans who took a lot of their free time to show me just why my Rust code was so slow and how to fix it!

The original Common Lisp port to Rust

Let’s look at the original Common Lisp code before we start looking at the slow Rust code.

Here’s the essence of Peter Norvig’s Lisp solution to the Prechelt’s phone encoding problem (comments removed for brevity):

(defun print-translations (num digits &optional (start 0) (words nil))
  (if (>= start (length digits))
      (format t "~a:~{ ~a~}~%" num (reverse words))
      (let ((found-word nil)
            (n 1)) ; leading zero problem
        (loop for i from start below (length digits) do
              (setf n (+ (* 10 n) (nth-digit digits i)))
              (loop for word in (gethash n *dict*) do
                 (setf found-word t)
                 (print-translations num digits (+ 1 i) (cons word words))))
        (when (and (not found-word) (not (numberp (first words))))
           (print-translations num digits (+ start 1)
                               (cons (nth-digit digits start) words))))))

You can see the full code here.

It is an interesting and brief solution, and one that I wouldn’t come up with myself. What I did come up with was based on a Trie data structure (referred to as Java1 throughout this article, meaning First Java implementation, not JDK 1.0), but that’s another story.

Anyway, porting this to other languages seems pretty easy and fun, so that’s what I did. My purpose was to check how much difference a language really makes, assuming you can solve the problem using the same algorithm.

My Thesis

The thesis (that I perhaps didn’t make very clear in my previous article) was that once you discount choice of algorithm (which sometimes can be the hard part), language differences wouldn’t be as large as they appear to be when you consider programs doing widely different things to achieve the same result. The original study actually had convincing evidence that programmers tend to come up with different algorithms depending on the language: lower level languages programmers tend to use lower level constructs even when their language would allow a higher level one. This, my theory went, was what really caused lower level languages to have higher LOC count and better performance. Once you control for that, both LOC and performance should get closer, making which language you use mean much less (well, as long as you can get yourself to use the right level of abstraction for the problem in your chosen language).

Slow Rust, and how to fix it

Going back to the Rust code… I tried to keep it as close as possible to the original Lisp code in order to have an apples-to-apples comparison. So I considered using big numbers to “key” existing words was very important.

A lot of people criticized that decision because they “knew” that BigInt is slow in many languages, and Lisp is just really good at that.

Well, as we’ll see soon, that’s NOT the reason the Rust code was slow.

Here’s part of the Rust code that I wrote, including the conversion from word to BigUInt that I used:

use lazy_static::lazy_static;
use num_bigint::{BigUint, ToBigUint};

type Dictionary = HashMap<BigUint, Vec<String>>;

lazy_static! {
    static ref ONE: BigUint = 1.to_biguint().unwrap();
    static ref TEN: BigUint = 10.to_biguint().unwrap();
}

fn word_to_number(word: &str) -> BigUint {
    let mut n = ONE.clone();
    for ch in word.chars() {
        if ch.is_alphabetic() {
            n = &n * (&*TEN) + &char_to_digit(ch);
        }
    }
    n
}

I used lazy_static because I wanted a global constant for ONE and TEN in order to avoid allocating an instance of BigUint for every single iteration of the loop shown in word_to_number.

Not only is that horribly ugly, it’s completely unnecessary and futile. I didn’t know that at the time, of course. This seemed like the way to do this from my previous experience in Rust (i.e. need non-constant global state, use lazy_static).

Alright, but here’s the right way to do it, thanks to @philipc:

fn word_to_number(word: &str) -> BigUint {
    let mut n = 1u8.into();
    for digit in word.chars().filter_map(char_to_digit) {
        n *= 10u8;
        n += digit;
    }
    n
}

Seeing this blew my mind. Notice that all operations above are still using BigUint, as the original!

Similarly, in the print_translations function, whereas I had written something like this:

    let mut n = ONE.clone();
    let mut found_word = false;
    for i in start..digits.len() {
        n = &n * (&*TEN) + &nth_digit(digits, i);
    ...

Here’s what I should have written:

    let mut n = 1u8.into();
    let mut found_word = false;
    for i in start..digits.len() {
        n *= 10u8;
        n += nth_digit(digits, i);
    ...

This also requires a small change to char_to_digit’, check the full diff if you’re interested.

With just this readability improvement, the code actually runs faster already (but not as fast as CL)!!

Rust improvements VS Lisp

That goes to show that sometimes, better code is also more performant code.

The next step is to avoid inefficient copies of lists (or Rust Vec) for each partial solution, which I had done in a misguided attempt to keep the code simple.

Every time the algorithm finds a new word for a potential solution, it needs to recurse with that word accumulated into the current partial solution. It needs to do that for potentially several words at the same level, so we can’t just mutate the list and forget about it. That’s why I used the Common Lisp’s approach, approximately:

(cons word words)

This will return a new list that contains all elements of words, plus word, without mutating words.

As you might know, Lisp is built around Lists, so doing this is highly efficient in Lisp. But the equivalent Rust is not.

This is what I had done:

let mut partial_solution = words.clone();
partial_solution.push(word);
// recurse passing the current `partial_solution`

This copies the list, then appends the new word before recursing. That means we can repeat this operation for each word as the original partial_solution is preserved. Also, words can stay immutable, and as functional programmers know, immutability is good!

As many commenters pointed out, this is silly from a performance point-of-view. You know that once you recurse, this very same code is going to run again (or not) and that, in any case, we can just pop the word from the original list once the recursion completes to achieve the exact same result!

Functional programming with immutable structures may be great, but it requires “smart” implementations of immutable data structures, and just cloning a Vec is not that!

Here’s what we can do in this case:

words.push(word);
// recurse passing the current `partial_solution`
words.pop();

Done, right? Wrong… we now have to deal with the borrow checker telling us that we can’t just get a brand-new instance of String and put it into our Vec<&String> because the Vec is not the owner of any of its strings.

I’ve seen this many times in Rust: I write a large chunk of code being a little careless about lifetimes, just to then have to change the types of my variables and functions everywhere in my program to avoid having to clone() things everywhere to make it work.

This means that, yes, writing efficient code in Rust requires a lot of forethought. In this case, our tiny modification to use a mutable Vec actually requires us to only deal with strings whose life times match or exceed that of our Vec.

To achieve that, we need to do two things.

First, add an explicit lifetime to the functions operating on the Vec, going from:

fn print_translations(
    num: &str,
    digits: &Vec<char>,
    start: usize,
    words: Vec<&String>,
    dict: &Dictionary,
) -> io::Result<()>

To:

fn print_translations<'dict>(
    num: &str,
    digits: &Vec<char>,
    start: usize,
    words: &mut Vec<&'dict str>,
    dict: &'dict Dictionary,
) -> io::Result<()>

Now, we have the problem that not all words we insert into words come from dict: some words will be the replacement digits that the problem allows us to use when a word can’t be found.

Right now, the code won’t compile on this line:

let digit = nth_digit(digits, start).to_string();
words.push(&digit);

As digit here has the wrong lifetime.

...
62 |         words.push(&digit);
   |         -----------^^^^^^-
   |         |          |
   |         |          borrowed value does not live long enough
   |         argument requires that `digit` is borrowed for `'dict`

To fix that, we must change nth_digit to give us a String that lives long enough!

But how can we do that? Well, in this case it’s actually very simple: there are only 10 possible digits, so we can have a static list containing them all! nth_digit will then always return something with 'static lifetime, which should work as that’s definitely going to live longer than 'dict.

The implementation of nth_digit goes from this monstrosity:

fn nth_digit(digits: &Vec<char>, i: usize) -> BigUint {
    let ch = digits.get(i).expect("index out of bounds");
    ((*ch as usize) - ('0' as usize)).to_biguint().unwrap()
}

To:

fn nth_digit(digits: &[char], i: usize) -> u8 {
    digits[i] as u8 - b'0'
}

That still works because when we do n += nth_digit(digits, i);, it works even when n is a BigUint!

And when we need to insert a new digit in words:

static DIGITS: [&str; 10] = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"];

...

    let digit = nth_digit(digits, start);
    words.push(DIGITS[usize::from(digit)]);

Here, the Rust novice like me (I have been using Rust on and off for over two years now, but it really takes some time or a lot of dedication to stop being a novice in Rust) might say that this is getting a bit too complicated compared with either Java or Lisp. Lifetimes can get really tricky. However, in this case, it seems that someone with good Rust expertise would’ve seen that coming and would’ve taken care of that from the beginning?!

Well, I would love to run a study with lots of Rust programmers to see how that actually goes in practice :). Maybe one day.

How are we doing performance-wise now, you may ask!

Check it out (full diff from v2 to v3):

Rust improvements V3

That’s right! The Rust code now matches the Common Lisp code in performance almost exactly!

Big Rust stylistic sins

Besides the bad performance, my original code had a lot of stylistic sins, apparently.

For example, shouldn’t have done this:

fn print_solution(num: &str, words: &Vec<&str>)

Do this if you want to avoid being shamed:

fn print_solution(num: &str, words: &[&str])

It might be tempting to use if let Ok() sometimes, but don’t do it:

let words = read_lines(words_file)?;
for line in words {
    if let Ok(word) = line {
...

You might have just forgotten that this will actually ignore IO errors!

The right way:

for word in read_lines(words_file)?.flatten() {
    ...
}

Fancy, right?! This is one of those idioms that if you don’t know it yet, you feel like an idiot for having written that horrible code before.

Rust has so many of these :D.

Here’s another one: reading command-line arguments in order, allowing for default values.

My terrible approach (it actually took me half an hour to decide how to do this):

let mut args: Vec<_> = args().skip(1).collect();
let words_file: String = if !args.is_empty() { args.remove(0) } else { "tests/words.txt".into() };
let input_file: String = if !args.is_empty() { args.remove(0) } else { "tests/numbers.txt".into() };

This is noobie code… here’s what pros would do:

let mut args = args().skip(1);
let words_file = args.next().unwrap_or_else(|| "tests/words.txt".into());
let input_file = args.next().unwrap_or_else(|| "tests/numbers.txt".into());

Apparently, you don’t need to collect the args into a Vec, then remove them to take ownership! Using next() on an iterator you’re actually consuming the items.

Because std::env::Args implements Iterator<String>, not Iterator<&str> or something, consuming it gives you ownership of the args (does it clone the args? magically backs the Strings it returns with a single &'static str somehow? don’t know…).

Who would’ve thought!?

Unfortunately, these stylistic improvements don’t give us much extra performance, the previous chart still reflects well the current performance of the code.

In any case, here’s the full diff with the stylistic improvements.

At this point, the Rust code is basically @phillipc ’s, not mine. See his fork for a few other similar versions of the code.

Going the extra mile

At this point, the Rust developers might have pat themselves on the back and gone home to have a nice time with the family, internally celebrating the small victory in their head.

However, Rust still has more performance to give, so why stop there?

@bugaevc came up with an implementation that gets rid of BigUint and uses a Vec<u8> instead (full diff).

As a small sample of his implementation, here’s the word_to_number function, which still looks quite neat.

fn word_to_number(word: &str) -> Vec<u8> {
    word.chars()
        .filter(char::is_ascii_alphabetic)
        .map(char_to_digit)
        .map(|d| d + b'0')
        .collect()
}

It turns out that this change allows Rust to go through the phone numbers even faster!

Here are the final results, including a improved Java version (henceforth Java3) that uses almost the same exact performance improvements used in the improved Rust code:

Notice that this chart includes only the final BigUInt version from the previous section and the new version. I leave it up to you to consider whether @bugaevc ’s solution is still comparable with the original Common Lisp without us also squeezing the last juice Common Lisp might have to give.

Rust improvements (v4) VS Lisp

As you can see, if you really need that last millisecond of performance, Rust can give you that!

Java seems to have fallen far behind now (at least it performs a lot better than the original, performance-bug-ridden Rust implementation).

What about a larger dictionary and longer runs

As a bonus, I thought it would be a good idea to use a much larger dictionary, because that makes the runtimes really struggle as the combinatorial space of solutions explodes!

This might allow Java to redeem itself, at least partly, and show whether its JIT can make the miracles Java people (like me!) say it can, given enough time and information about hot code paths.

I found a ridiculously large english words list containing 466k words!

Perfect to make our programs sweat. I edited the benchmark config to use that in an extra run.

After some experimentation, that proved to be too much as even finding a couple of phone numbers would take minutes… so I had to adjust the settings until the programs ran for a reasonable amount of time, but not too long (I settled for around one minute for the fastest implementation).

Using a very small amount of phone numbers, but allowing maximum 50-length for each number, caused the variability between runs to become extreme due to the fact that the numbers were generated randomly on each benchmark run (if a particular run happened to contain a few long numbers, it would take forever to find all solutions).

I adjusted for that by limiting the length of the phone numbers to a lower higher bound and increasing the amount of phone numbers, which made runs a little more stable, making the benchmark reproducible.

The settings I ended up with were:

Running the benchmarks a few times showed that there was variance, but not in the relative performance between the languages.

What the longer runs show is very different from what we had seen before!

Longer run 1

The group on the left is the first run, and on the right, the second run (using the same settings).

I also ran this last benchmark with the BigInt final solution from @phillipc, and surprisingly, it took almost exactly as long as the final Rust implementation (32.93s for the former, 32.42s for the latter). Whoever thought Rust’s num crate was slow?

The implementations are all very sensitive to the benchmark settings: increasing the max phone number length, for example, from 12 to 16, has a huge impact on runtime (from around 20 seconds to several minutes).

However, the amount of phone numbers seems large enough that every time I ran the benchmarks, I got approximately the same relative performance difference between the languages.

With that said, Java seems to take the lead (using the implementation that replaces BigInt with a ByteBuffer, like the Rust version did, not the Trie!) on any runs that take at least 20 seconds, and its lead keeps increasing after that. As predicted, the JIT kicks in and makes Java fly past other languages, even with my enterprisey-looking code.

Could something “hidden” be affecting the results?!

The impact of stdout

Rust people speculated that printing to stdout might become the bottleneck at some point… with the number of solutions we’re now getting, that could certainly be true!

So I took one of the suggestions by @bugaevc to buffer writes to stdout. I did the same in Java, of course! This has now become a serious Java VS Rust race, somehow.

Removing writes to stdout would, in my opinion, make the exercise pointless, as the objective is clearly to be able to show the user the results. Also, the benchmark_runner throws away the process’ stdout, so there is no delay due to the stdout consumer blocking for too long, which I think is enough to make the benchmark results valid when using a buffer.

Running this last benchmark showed that both Rust and Java now can run in just a dozen seconds or so! Amazing!!!

Does that invalidate my current results? Well, no… re-running all benchmarks shows that, with the exception of this last result, all previous runs still show roughly the same thing (i.e. stdout was not a bottleneck until now), with results becoming more and more skewed with the larger inputs (not quite enough to change results substantially).

On my first run after buffering stdout writes, Rust again seemed to become much faster than Java. However, the programs were running for only a few seconds now that printouts became faster, not enough for the JVM to optimise away stuff. Sure enough, increasing the input sizes once again to make programs run for at least a minute (which is how long I was patient enough to wait for results) shows that the JVM catches up, eventually.

Both language’s default buffer sizes were used. It seems both chose an 8 KB buffer.

I decided it was necessary to see if there was a trend here, so I had to go back to line charts, but with longer runs varying not the number of inputs, but the length of the single phone number used as input. This way, we know how long each program can find all solutions for just one phone number (the number of solutions grows very quickly with the length of the phone number).

I wrote a new benchmark script to do that, and here’s what happened:

Phone length VS runtime

This shows that Rust starts a lot faster, as expected, and that the JVM eventually catches up!

I stopped at a phone length of 29 because each run was already exceeding my limit of 1-minute per run… but as you can see, at the end it’s unclear if Java is finally able to overtake Rust or not (technically, it did by a very small margin in the last two runs)!

For that reason, I thought one final run with a length 30 was really needed to decide the verdict, even if that took a much longer time :D!

Here’s the final run with a phone number of length 30:

Phone length 30 VS runtime

Unbelievable, right? One of the Java implementations still manages to run faster than Rust.

Are we convinced that this is real yet?

Of course not!

What if we swap the Rust HashMap implementation, as they say the standard one I was using applies a cryptographically-safe hash that the JVM doesn’t?!

Are we also allowed to swap the Java’s HashMap with a much faster implementation like this or this?

What if we just throw away the solutions instead of printing them? What if we parallelize the solutions?

What if we compile the Java code to native using GraalVM.

People have suggested all sorts of things, but I have to draw the line somewhere.

I might make some people angry that I didn’t use this or that library, but my arbitrary line here will be that adding libraries to the mix won’t be allowed because it makes language comparisons even harder than it already is and, to be quite honest, I am already exhausted trying to reach a consensus.

Can we all just agree that Java and Rust are both amazingly fast?!

EDIT (2021-08-08): Some of the questions above are answered in Part 2 of this blog post, in which I went into more detail about why the modifications made above were necessary, and why Java managed to run faster than Rust even after optimisations (after buffering writes to stdout, the number of solutions the program needed to print was so huge that it was still doing mostly IO - after reducing the number of solutions being printed, the results became quite different).

Conclusion

🏁

Writing code in Rust doesn’t mean your code will be blazing fast. It’s easy to make mistakes and get quite slow performance. As this blog post has shown, you might even have to sweat quite a lot to beat Common Lisp and Java.

This blog post descended into a crazy race between Java and Rust to see which one could go faster… that was never my intention, all I wanted was to show how to fix my bad Rust code, but to know the performance was good enough, I had to use the other languages for comparison, and things quickly descended into a pointless race to solve a problem no one cares about just so we could figure out which language is really the fastest.

The nice thing is that, at the end, the performance of the languages that received the most attention, Java and Rust, was so close that my point that language choice doesn’t matter as much as people think has been significantly strengthened, I think.

Sometimes, it doesn’t matter which language you use or how smart your programmers are, your code will run so fast that you won’t ever even think about other languages than what you’re comfortable with.

Other times, almost always I would say, you need to pay attention to the algorithm you’re using, but can still stay in the comfort of your favourite language.

Finally, there are times when every cycle matters. In those cases, you might think it’s safe to pick a particular language that’s normally expected to be the fastest in a certain domain, but until you actually measure it, it’s nearly impossible to know.

What do you think?

Is Rust the fastest, as everyone knows it ought to be, and the fact that for this particular problem the JVM managed to come ahead when it really mattered an incredible fluke?

Or do you think that for long running processes where throughput is more important than latency, a language like Java (and Kotlin, Scala, Groovy and Clojure) that has a JIT compiler at runtime fixing all the inefficiencies your bad programmers introduced to the code something indispensable?

I don’t know… all I know is that if I need to write a CLI, I will write it in Rust as it is a lovely language once you get past the initial blocks, but if I need a server running forever, I might just stick with Java (and Kotlin to get some of the cool stuff that Rust offers, like null safety) for now and know that my bad code will run fast anyway!

What about memory and LOC?!

Oh yeah, almost forgot (not that anyone seems to care, speed is what the people want)!

Lines of code (LOC) are shown below before the buffering of stdout was applied to the Rust and Java implementations, as the modification was not applied on the other languages.

LOC final

You care about memory?? People didn’t seem to care at all from the feedback I got, but if you must, have a look at the raw data.

The fastest solution, Java3, is quite hungry (several hundred MB), Lisp manages to stay under 100MB, and Rust is really well behaved at only around 25MB in all benchmarks.

Just one last note to the Julia programmers out there (and Dart! See the Dart solution): Julia may have won in LOC, but the current implementation, and some others by actual Julia programmers, are still well behind Java/Rust/CL! Can you make it go faster? Let me know!