Renato Athaydes Personal Website

Sharing knowledge for a better world

How to write really slow Rust code - Part 2

A deeper look at what makes Rust slow and why exactly it couldn't beat Java's speed in larger problems
Written on 08 Aug 2021, 06:28 PM (Last updated on 09 Aug 2021, 09:06 AM)

Decoration image

Photo by Sam Moqadam on Unsplash

This blog post is a follow-up to How to write really slow Rust code, which itself was a follow-up to a paper analysis which discussed how programming language affects developer’s productivity and program’s efficiency.

I have received a lot of feedback after these posts, specially the latest one, hence I would like to clarify a few points that I think were not made clear enough, and respond to some criticism that I think was very unfair.

I will also go further and show even more ways of making Rust code faster!

Why was Rust slow, again?

As my previous blog post showed, nearly half of the performance difference between Rust and Common Lisp was due to improper use of BigUint values (from the num-bigint crate as Rust does not natively supports Big Numbers).

I had failed to see that BigUint is able to operate on other number types, which avoids having to allocate too many of them.

That means I didn’t need to do this:

lazy_static! {
    static ref ONE: BigUint = 1.to_biguint().unwrap();
    static ref TEN: BigUint =10.to_biguint().unwrap();
}

fn word_to_number(word: &str) -> BigUint {
    let mut n = ONE.clone();
    for ch in word.chars() {
        if ch.is_alphabetic() {
            n = &n * (&*TEN) + &char_to_digit(ch);
        }
    }
    n
}

Notice that static ref ONE: BigUint = 1.to_biguint().unwrap() was done because BigUint does not implement From<i32> and I was too distracted by the rest of the problems I was having to notice that it does implement From for all un-signed number types, obviously, so this would’ve worked: static ref ONE: BigUint = 1u8.into();

And could do this instead (char_to_digit was also modified to return Option<u8> instead of u32):

fn word_to_number(word: &str) -> BigUint {
    let mut n = 1u8.into();
    for digit in word.chars().filter_map(char_to_digit) {
        n *= 10u8;
        n += digit;
    }
    n
}

It wasn’t very clear in the previous blog post why this made a big difference, so let’s look into it in more detail.

Avoiding allocating too many BigUints

In the example above, we always need to allocate at least one new BigUint instance to initialize n (which we want to return), but that only happens once per word, so it’s not a big problem.

However, the multiplication in the loop happens for every single character in the word that can be converted to a digit.

Even though I had made an attempt at NOT allocating a new BigUint inside that loop with (&*TEN), something went wrong and that code was still allocating too many BigUints! But was it really because of &*TEN??

To figure out what’s happening, we need to understand how dereferencing works in Rust and what exactly the lazy_static macro does.

At first sight, doing &*s may look pointless because *s means dereference s and &s means reference s, so you might think you just get back s by doing that. That doesn’t always hold in Rust.

The reason is that several Rust operators are overloadable, including Deref (the trait associated with the deference operator). Hence, if a type implements Deref, you must know what type the implementation returns to know what’s the type of *s. Also, types implementing Deref are subject to implicit deref coercion, which is what allows, for example, passing &String to methods expecting &str (String derefs to str), which is something we did in the previous blog post.

The &*s construct even has a name, reborrow. For a nice explanation of why Deref was invented, see the accepted answer to the question Why does Option<&String> not coerce to Option<&str>?

Now we can look at what the lazy_static macro does. From the docs:

For a given static ref NAME: TYPE = EXPR;, the macro generates a unique type that implements Deref<TYPE> and stores it in a static with name NAME.

On first deref, EXPR gets evaluated and stored internally, such that all further derefs can return a reference to the same object.

This is very useful if you need to store immutable, non-const global data in your program because with “normal” Rust, that’s basically impossible.

This makes it clear why, when using lazy_static, the only way to access the data is by de-referencing it: the macro makes sure to implement Deref such that the unique type it generates for each value derefs to the declared type.

From the description, it looks like my implementation should have worked as intended, after all, it says that the value is initialized once, then the same object is returned every time.

But it doesn’t because of something else I didn’t fully understand.

Let’s use the alloc_count crate to keep track of how many allocations the code is doing.

How many allocations do you think the following code has?

lazy_static! {
    static ref TEN: BigUint = 10u8.into();
}

fn main() {
    let ((allocs, re_allocs, de_allocs), v) = count_alloc(|| {
        (&*TEN, &*TEN, &*TEN, &*TEN)
    });
    println!("Allocations: {}, Re-allocations: {}, De-allocations: {}, Value: {:?}",
             allocs, re_allocs, de_allocs, v);
}

If lazy_static is doing its job, it should be only one. And that’s what we see:

Allocations: 1, Re-allocations: 0, De-allocations: 0, Value: (10, 10, 10, 10)

It works: the initial value for TEN is allocated once, re-used and never dropped.

How about if we change it as below?

lazy_static! {
    static ref TEN: BigUint = 10u8.into();
}

fn main() {
    let ((allocs, re_allocs, de_allocs), v) = count_alloc(|| {
        (&*TEN * &*TEN, &*TEN * &*TEN, &*TEN * &*TEN)
    });
    println!("Allocations: {}, Re-allocations: {}", allocs, re_allocs);
}

The answer may be surprising:

Allocations: 4, Re-allocations: 0, De-allocations: 0, Value: (100, 100, 100)

Can you see why?

It’s tricky if you don’t have the right knowledge.

The multiplication operator (as opposed to the deref unary operator that uses the same character) is also overloadable in Rust via the Mul trait:

pub trait Mul<Rhs = Self> {
    type Output;
    fn mul(self, rhs: Rhs) -> Self::Output;
}

BigUint implements Mul for all number types and itself (both ref and value):

impl Mul<u8> for BigUint;
impl<'a, 'b> Mul<&'b u8> for &'a BigUint;
impl Mul<u16> for BigUint;
impl<'a, 'b> Mul<&'b u16> for &'a BigUint;
...
impl Mul<BigUint> for BigUint;
impl<'a> Mul<BigUint> for &'a BigUint;

In all cases type Output is BigUint, which means that when you multiply a BigUint by anything what you get back is always a new BigUint instance. In other words, the * multiplication operator allocates when used on BigUint or &BigUint.

I’m sure some people think that’s pretty obvious (after all you can’t mutate the BigUint when your reference is immutable, and you can’t magically create and return a new reference that has the same lifetime as the original one either), but when I am trying to work on a task that’s already taking most of my brainpower, I think it’s easy to miss that.

Now we know exactly why there’s 4 allocations:

Nothing is dropped because the three results are returned by the lambda and still in memory, and the initialized TEN stays in memory until the program ends because it’s static.

This shows that using lazy_static was not my actual problem!

My problem was using multiplication and addition (for the same reasons) directly on BigUint.

We can try to avoid one allocation for each operation by using *= (MulAssign) and += (AddAssign):

fn main() {
    let ((allocs, re_allocs, de_allocs), v) = count_alloc(|| {
        let mut n: BigUint = TEN.clone();
        n *= &*TEN;
        n *= &*TEN;
        n += &*TEN;
        n += &*TEN;
        n += &*TEN;
        n
    });
    println!("Allocations: {}, Re-allocations: {}, De-allocations: {}, Value: {:?}",
             allocs, re_allocs, de_allocs, v);
}

Result:

Allocations: 4, Re-allocations: 0, De-allocations: 2, Value: 1030

Unfortunately, we’re still making more allocations than necessary. It appears that each multiplication by &*TEN adds one allocation and one de-allocation. Even multiplying by &*ONE causes an extra allocation each time.

The additions, as expected, do not seem to allocate, even when performing 10,000 additions by TEN it still did not allocate anything else.

This turns out to be a performance bug in the num-bigint crate which has been recently fixed!

If the improved num-bigint crate version from the master branch is used, then the exact same code above prints this instead:

Allocations: 2, Re-allocations: 0, De-allocations: 0, Value: 1030

Nice!

As I was using the latest released version at the time, 0.4.0 (as of writing, still the latest), my algorithm was extremely slow even with the small correction discussed above that should have avoided the extra allocations.

After updating the Rust project to get this improvement, the Rust code’s performance improves nearly 2-fold and nearly matches Common Lisp:

Rust after num-bigint update VS Lisp

Notice that there’s no source code changes between the two Rust versions above, just a library update.

Avoid copying Vecs if possible

The next performance mistake I had made was trying to use an immutable Vec<&String> to hold the partial solutions to the problem, which forced me to clone the vector every time a new potential word was found.

I was heavily criticized for doing this, with some people accusing me of not knowing even basic CS101.

I want to try to explain better now why I don’t think that is such a big noobie mistake.

First of all, as the previous blog post showed, cloning the vector allowed me to avoid a big “refactor” of the code base because passing down on recursion a mutable vector would require the use of lifetimes, which can sometimes make problems a lot harder than they need to be.

You may ask why that’s the case. Here’s the reason.

The owner of the words that we can use to create a new solution is the dictionary which is loaded at startup. The type of the dictionary is:

type Dictionary = HashMap<BigUint, Vec<String>>;

This means the dictionary owns the Strings (if it didn’t, we would’ve used something like HashMap<BigUint, Vec<&str>> and something else would need to become the owner of the Strings, but there’s no good alternative here).

The partial solutions should include some of those Strings because solutions can only use words that exist in the dictionary, hence the natural type for partial solutions is Vec<&String> (i.e. a partial solution is a vector that borrows words from the dictionary).

This silly example shows what would happen if I’d tried to use Vec<String> instead, as some people confidently suggested I should’ve done on some forums on the Internet:

Whoever it was who was spamming all discussions around the Internet saying that I should’ve used Vec<String> and how I was just disingenuously trying to denigrate Rust by claiming that it is complex when in fact it’s all very simple, please change this code without having to clone every single word and show us how it’s done.

Cloning a small (in this case, just a few items in most iterations) vector of pointers (which won’t clone the actual Strings!) is much cheaper than cloning every word (which copies the actual array of bytes representing the String).

use std::collections::HashMap;

type Dictionary = HashMap<u32, Vec<String>>;

fn main() {
    let dict = create_dict();
    let mut solutions: Vec<String> = Vec::new();
    if let Some(words) = dict.get(&1u32) {
        // Compiler ERROR: expected struct `String`, found `&String`
        solutions.push(words.first().unwrap());
    }
}

fn create_dict() -> Dictionary {
    todo!()
}

Let’s not forget that solutions may also contain replacement digits that do NOT come from the dictionary. We can only push references into a vector if the reference’s lifetime is the same or longer than that of the vector it is being pushed to.

fn main() {
    let mut vector: Vec<&String> = Vec::new();
    let s1 = "s1".to_string();
    // s2 only lives while this block is not ended
    {
        let s2 = "s2".to_string();
        vector.push(&s2);
    }
    // Compiler ERROR: `s2` does not live long enough
    println!("{:?}", vector);
}

My original code worked because I cloned the partial solutions vector instead of the words, then created a new instance of a String containing the replacement digit (so that both the vector and the new String had the exact same lifetime):

let mut partial_solution = words.clone();
let digit = nth_digit(digits, start).to_string();
partial_solution.push(&digit);
// recurse

Again, some people ridiculed me for doing this, claiming that this causes my algorithm to be quadratic.

That’s completely incorrect and you can only claim that if you really do not understand the problem being solved.

We only attempt to add a new word or digit to a partial solution when we actually find one that can be used! We don’t do this for every single word in the dictionary as only a tiny fraction of words will match at each iteration of the loop.

Also, the total size of the partial_solution vector in this case has a maximum bound of 50 because that’s the maximum length of each input. On average, though, as most words are quite a bit longer than 1, partial_solution should contain only as many items as the number of digits in the phone number divided by the average size of dictionary words.

In practice, that turns out to be a couple of words only per word that matches. In other words, cloning that vector (which means cloning only the pointers to the Strings, not the Strings themselves) is very cheap and done rarely.

In my previous blog post, the modification to stop doing that caused a chain reaction of other changes because of the lifetime changes, so the performance impact of doing this was lost among the impact of unrelated changes. However, Rust forbids us from using a mutable list here unless we go ahead and make all those same changes again.

For this reason, I decided to measure the impact of NOT doing this by going the other way around: make the other impactful change that was necessary without actually changing lifetimes and making the list mutable.

What was the other change? Avoiding creating a new String for each replacement digit, using a &'static str instead (because the only possible values of digits are 0 to 9).

It’s a tiny change, but its impact can be seen when we compare it with the previous version:

Rust after stopping allocating each digit VS Lisp and previous version

Not much, but we’re moving in the right direction.

Finally, let’s make the list mutable and introduce explicit lifetimes, as in the previous blog post, to see if all that cloning of the partial solutions really had such horrible effect as some people seemed to believe.

Rust after stopping cloning Vector VS Lisp and previous version

It’s a pretty good difference, but does it look like an algorithm going from quadratic to polynomial??

I say: hardly.

Moving the goal posts?

At this point, the Rust code is almost identical to the one in the previous blog post up until the Going the extra mile section.

The next sections discussed a different benchmark where I used a larger dictionary and shorter phone number lengths, which had a dramatic impact on the performance of the different programs.

As I mentioned in that blog post, the objective was to make the problem more challenging so that it ran for at least one minute. I didn’t do that because I wanted to make Rust look bad, but because I wanted to see if the trends I was observing for the smaller runs (notice that the longest run in the previous benchmark was a mere one second) would continue to hold for bigger problems, which is where speed actually matters.

Of course, I also wanted to know if once the JVM runtime had the time to “warm up”, it could catch up.

When the data came in and the JVM took a clear lead, I thought that was a fascinating result that was well worth sharing. It was also an extraordinary result, and extraordinary results require extraordinary evidence, hence this blog post.

Common Lisp has been forgotten?

I want to clarify that I didn’t include Common Lisp in that last section purely because I spent all the time I could gather trying to make accurate measurements and equivalent modifications to the Java and Rust programs.

My lack of knowledge of Common Lisp meant that I would’ve had to spend far too much time to perform the same level of optimisations using that language, which is why I stopped including it in the longer benchmarks.

Luckily, however, the Common Lisp community has joined the debate and gave me lots of suggestions on how to make the Common Lisp code run faster!

I will try to find time to write another blog post dedicated to Common Lisp as from my preliminary results, the performance you can get out of Common Lisp is just astonishing! Stay tuned.

Are we still bound by stdout?

The last criticism I would like to address is one first raised by @bugaevc in the discussion on his Pull Request, that nearly all the time spent by the optimised Rust implementation was on printing the solutions. My counter-argument was that I had seen no evidence of that in the flamegraphs I’d looked at, and that even if Rust was faster at finding solutions but slower at printing them, for the end user, what matters is the total time the application can present them with results.

However, this is an interesting question anyway because it can illuminate how we can avoid writing slow Rust programs, which is the objective of this post.

My previous blog post had clearly shown that printing out was not a big problem for the smaller benchmarks, but for the longer one used in the final sections, there was some evidence (see the above-mentioned discussion) that this was true.

Simply not printing out any solutions at all was just not acceptable to me because modern compilers are incredibly good at not doing any work if no one can observe the results anyway.

In particular, claims that Rust could actually find all solutions in seconds rather than minutes struck me as a clear sign that this is what was happening.

How can we fix that problem?

Well, there are several ways, but the one I decided to follow was to drastically reduce the number of possible solutions while not simplifying the problem to something that would allow optimisers to avoid having to first calculate them all.

I came up with the following modifications to the original algorithm:

This is very simple to implement, reducing the chances that I’ll mess up something important, while likely reducing the number of solutions to print considerably.

Code after these changes:

I understand that the new constraints make very little practical sense. The only objective with introducing them is to ensure that the number of solutions that get printed is very small without altering considerably the original programs.

Taking up from the end of the previous blog post, where we were trying to find all solutions for a single phone number of lengths 5, 10, 15, 20, 25, and finally 30… this is what the programs print to stderr when I run them again (showing only relevant output, both Java and Rust find exactly the same numbers of solutions, as expected):

Benchmarking...
Found solutions: 3, rejected: 60
Found solutions: 27, rejected: 19071
Found solutions: 0, rejected: 8460
Found solutions: 0, rejected: 1759680
Found solutions: 0, rejected: 18070560
Found solutions: 0, rejected: 21928320
Found solutions: 0, rejected: 36547200
Found solutions: 0, rejected: 201009600
Found solutions: 0, rejected: 1260878400
Cleaning up

Notice that from now on, the Rust code being benchmarked is different from the one discussed in the previous sections. It is the one from the Going the extra mile section which uses Vec<u8> as keys, avoiding BigUint arithmetics. That change gives Rust a 45% speed increase!

It never actually finds any solution matching my new requirements after the phone length goes past 10, but this allows us to see just how many solutions it would find and print in the final benchmarks I’d previously run.

The number of total solutions goes from 201,009,600 using length 29 to 1,260,878,400 using length 30!

More importantly, program runtimes go down by a lot when they don’t need to print as many solutions. With the maximum phone number length it goes from 10min 16.96sec for the fastest program, which was Java3, to 28.92sec, which was now achieved by the Rust program.

This is what it looks like on a chart:

Rust VS Java printing no solutions

It might be easier to read it using a logarithmic scale:

Rust VS Java printing no solutions - log scale

The Java1 program was removed as it didn’t really add much to the conversation as we already know it’s slower than both Java3 and Rust v4 and that it uses a completely different algorithm.

Given the above, I must admit that this can only be the case if Rust is much faster than Java at finding solutions (the rate varies between 1.5 and 3.7 for this example), but Java is better at printing out solutions!

To find out if this could be true, I decided to perform two extra experiments.

First, I would run benchmarks with an improved dictionary that spans the whole alphabet (in the previous run the dictionary was cropped to 100,000 words from the beginning, removing lots of words at the “bottom”) and using many phone numbers with a shorter length rather than a single one with increasing length. That should increase the space of possible solutions and allow the programs to actually find at least a few printable solutions, while making the programs again run for at least one minute (one of my goals for assessing “bigger problems”).

I would then revert the programs to start printing ALL solutions again. Because the number of solutions should be a lot higher in this setup, Java should start running faster again and more visibly if the advantage it had before was simply in printing out solutions faster!

More solutions, fewer printouts

This is the basic script I had used previously to generate the English words dictionary:

curl https://raw.githubusercontent.com/dwyl/english-words/master/words.txt --output tmp.txt
tail -n 100000 tmp.txt > words.txt
rm tmp.txt

To allow solutions to span the whole alphabet, I replaced that with this:

curl https://raw.githubusercontent.com/dwyl/english-words/master/words.txt --output tmp.txt

# the dictionary has 466k words, so we can skip every 5 words to get close to 100k
cat tmp.txt | awk "NR % 5 == 0 && NR < 500001" > words.txt
rm tmp.txt

The dictionary ends up having 93,310 words.

Here’s what we see when we run the programs as before, printing out only a few select solutions:

Solutions Printed vs Number of phone numbers

The number of solutions actually printed now grows very slowly to a maximum of just a few thousand, almost entirely removing IO from the equation.

The number of total solutions (including “rejected” ones that are not printed), however, became ridiculously large, ranging from 350 million for 1,000 phone numbers, to 1.5 billion for 5,000.

How long does it take for Java and Rust to find all of those solutions?

Solutions VS single phone number length

Both are incredibly fast, but the Rust program is now very solidly the fastest by a factor of 2, with so sign of that rate decreasing with larger inputs.

The longest Java run shown above took 1min 32.5sec, while the longest Rust run took 48.4 sec.

More solutions, even more printouts

The only question remaining is whether my previous blog post’s final benchmarks were distorted by how Java, despite finding solutions slower, was indeed better at just printing them out.

The previous results show that we will only need to go up to around a thousand phone numbers to start seeing very significant stdout overhead. By trial and error, I ended up using inputs of between 100 and 1,000 phone numbers:

Java VS Rust - Printing out again

As expected, we see the same pattern from the previous blog post, with Java taking over the lead with larger inputs, if just by a very small margin (up to 6% faster in this run).

The larger input size shown above is 1,000 phone numbers, which was just the smallest in the previous section.

We can see that the run time goes from 10sec without printing many solutions, to over 1min 20sec printing all solutions, again indicating that programs were severely struggling to print out all solutions in the previous blog post.

Why didn’t the Rust flamegraphs and Java profiler show more clearly the impact of printing solutions? It seems that the problem is that nearly all time is spent on invocations of print_translations. As that function is called recursively, that’s basically all we see in the stack traces. If you know how to “filter out” recursive calls in order to make profiling recursive algorithms more easily, do let me know and I will update this section with a note on that!

Bonus: Rust VS Rust using a faster hash

Finally, one more question that I’d left unaddressed in my previous blog post was about the Rust HashMap problem.

Some critics pointed out that Rust’s standard library’s HashMap uses a DoS-resistant hashing function, unlike other languages like Java (which is apparently DoS-resistant as well due to using a balanced tree internally).

Anyway, this section is about Rust only!

How much impact does swapping the hashing function to ahash makes in our latest benchmarks, which avoids IO issues and searches a huge space of solutions (so hashing performance should matter a lot)?

Rust VS Rust - ahash

Big difference! Average 46% faster.

Conclusion

🏁

I hope that you found this post useful and that it adequately addresses the flaws and misunderstandings in my previous ones.

I continue to stand up for many of the findings in my previous blog posts. For example, that:

However, the findings in this blog make me re-consider a few, while re-enforcing other points:

Thanks for reading and stay tuned for performance improvements in other languages based on the findings of this series of posts!

PS. A rant

😡

Being judged on the Internet can be an uncomfortable experience. However, all of us who like to write blog posts about things we find interesting need to be aware that everything we say can affect the decisions and thoughts of many people out in the real world, hence being judged is inevitable and I know that.

We do have the responsibility to review what we write carefully to make sure we don’t publish misleading content or things that may affect others negatively, even if we get absolutely nothing in return (as is my case - I don’t run ads, newsletters or anything like that) because that’s just the right thing to do.

However, I think that participants of Internet discussions need to be aware that sometimes people can make honest mistakes. Critics can be very unfair and make baseless claims using a tone that may make them seem like experts.

I think that in the case of my previous blog post, both things happened. I was unjustly accused of intentionally trying to badmouth Rust and of not knowing basic computer science concepts, while other criticism, such as not controlling for the effect of IO sufficiently, was clearly valid.

In reality, I am a big Rust fan, but not one that will blindly defend it (or any other technology that I enjoy using) and pretend that it has absolutely no problems at all and that all other languages suck in comparison. I always try to make rational decisions based mostly on cold data (and just a little on my own taste, I’m human after all), and that’s what my blog posts are about: gathering data and making comparisons that will be useful to me and, hopefully, to others who read my posts.

End of rant. Thanks for reading.