How to write really slow Rust code - Part 2
This blog post is a follow-up to How to write really slow Rust code, which itself was a follow-up to a paper analysis which discussed how programming language affects developer’s productivity and program’s efficiency.
I have received a lot of feedback after these posts, specially the latest one, hence I would like to clarify a few points that I think were not made clear enough, and respond to some criticism that I think was very unfair.
I will also go further and show even more ways of making Rust code faster!
Why was Rust slow, again?
As my previous blog post showed, nearly half of the performance difference between Rust and Common Lisp was due to
improper use of BigUint
values (from the num-bigint crate as Rust does
not natively supports Big Numbers).
I had failed to see that BigUint
is able to operate on other number types, which avoids having to allocate too many
of them.
That means I didn’t need to do this:
lazy_static! {
static ref ONE: BigUint = 1.to_biguint().unwrap();
static ref TEN: BigUint =10.to_biguint().unwrap();
}
fn word_to_number(word: &str) -> BigUint {
let mut n = ONE.clone();
for ch in word.chars() {
if ch.is_alphabetic() {
n = &n * (&*TEN) + &char_to_digit(ch);
}
}
n
}
Notice that
static ref ONE: BigUint = 1.to_biguint().unwrap()
was done becauseBigUint
does not implementFrom<i32>
and I was too distracted by the rest of the problems I was having to notice that it does implementFrom
for all un-signed number types, obviously, so this would’ve worked:static ref ONE: BigUint = 1u8.into();
And could do this instead (char_to_digit
was also modified to return Option<u8>
instead of u32
):
fn word_to_number(word: &str) -> BigUint {
let mut n = 1u8.into();
for digit in word.chars().filter_map(char_to_digit) {
n *= 10u8;
n += digit;
}
n
}
It wasn’t very clear in the previous blog post why this made a big difference, so let’s look into it in more detail.
Avoiding allocating too many BigUints
In the example above, we always need to allocate at least one new BigUint
instance to initialize n
(which we want to return), but that only happens once per word, so it’s not a big problem.
However, the multiplication in the loop happens for every single character in the word that can be converted to a digit.
Even though I had made an attempt at NOT allocating a new BigUint
inside that loop with (&*TEN)
, something went wrong
and that code was still allocating too many BigUint
s! But was it really because of &*TEN
??
To figure out what’s happening, we need to understand how dereferencing works in Rust and what exactly the
lazy_static
macro does.
At first sight, doing &*s
may look pointless because *s
means dereference s and &s
means reference s, so you might
think you just get back s
by doing that. That doesn’t always hold in Rust.
The reason is that several Rust operators are overloadable, including Deref
(the trait associated with the deference operator). Hence, if a type implements Deref
, you must know what type the
implementation returns to know what’s the type of *s
. Also, types implementing Deref
are subject to
implicit deref coercion,
which is what allows, for example, passing &String
to methods expecting &str
(String
derefs to str
), which is
something we did in the previous blog post.
The
&*s
construct even has a name, reborrow. For a nice explanation of whyDeref
was invented, see the accepted answer to the question Why doesOption<&String>
not coerce toOption<&str>
?
Now we can look at what the lazy_static
macro does. From the docs:
For a given static ref NAME: TYPE = EXPR;
, the macro generates a unique type that implements Deref<TYPE>
and
stores it in a static with name NAME
.
On first deref, EXPR
gets evaluated and stored internally, such that all further derefs can return a reference to the same object.
This is very useful if you need to store immutable, non-const global data in your program because with “normal” Rust, that’s basically impossible.
This makes it clear why, when using lazy_static
, the only way to access the data is by de-referencing it: the macro
makes sure to implement Deref
such that the unique type it generates for each value derefs to the declared type.
From the description, it looks like my implementation should have worked as intended, after all, it says that the value is initialized once, then the same object is returned every time.
But it doesn’t because of something else I didn’t fully understand.
Let’s use the alloc_count
crate to keep track of how many
allocations the code is doing.
How many allocations do you think the following code has?
lazy_static! {
static ref TEN: BigUint = 10u8.into();
}
fn main() {
let ((allocs, re_allocs, de_allocs), v) = count_alloc(|| {
(&*TEN, &*TEN, &*TEN, &*TEN)
});
println!("Allocations: {}, Re-allocations: {}, De-allocations: {}, Value: {:?}",
allocs, re_allocs, de_allocs, v);
}
If lazy_static
is doing its job, it should be only one. And that’s what we see:
Allocations: 1, Re-allocations: 0, De-allocations: 0, Value: (10, 10, 10, 10)
It works: the initial value for TEN
is allocated once, re-used and never dropped.
How about if we change it as below?
lazy_static! {
static ref TEN: BigUint = 10u8.into();
}
fn main() {
let ((allocs, re_allocs, de_allocs), v) = count_alloc(|| {
(&*TEN * &*TEN, &*TEN * &*TEN, &*TEN * &*TEN)
});
println!("Allocations: {}, Re-allocations: {}", allocs, re_allocs);
}
The answer may be surprising:
Allocations: 4, Re-allocations: 0, De-allocations: 0, Value: (100, 100, 100)
Can you see why?
It’s tricky if you don’t have the right knowledge.
The multiplication operator (as opposed to the deref unary operator that uses the same character) is also overloadable in Rust via the Mul trait:
pub trait Mul<Rhs = Self> {
type Output;
fn mul(self, rhs: Rhs) -> Self::Output;
}
BigUint
implements Mul
for all number types and itself (both ref and value):
impl Mul<u8> for BigUint;
impl<'a, 'b> Mul<&'b u8> for &'a BigUint;
impl Mul<u16> for BigUint;
impl<'a, 'b> Mul<&'b u16> for &'a BigUint;
...
impl Mul<BigUint> for BigUint;
impl<'a> Mul<BigUint> for &'a BigUint;
In all cases type Output
is BigUint
, which means that when you multiply a BigUint
by anything what you get back
is always a new BigUint
instance. In other words, the *
multiplication operator allocates when used on BigUint
or &BigUint
.
I’m sure some people think that’s pretty obvious (after all you can’t mutate the BigUint
when your reference is
immutable, and you can’t magically create and return a new reference that has the same lifetime as the original
one either), but when I am trying to work on a task that’s already taking most of my brainpower, I think it’s easy to
miss that.
Now we know exactly why there’s 4 allocations:
- 1 allocation to initialize
TEN
. - 1 allocation for each of the 3 multiplications.
Nothing is dropped because the three results are returned by the lambda and still in memory, and the initialized
TEN
stays in memory until the program ends because it’s static
.
This shows that using lazy_static
was not my actual problem!
My problem was using multiplication and addition (for the same reasons) directly on BigUint
.
We can try to avoid one allocation for each operation by using *=
(MulAssign)
and +=
(AddAssign):
fn main() {
let ((allocs, re_allocs, de_allocs), v) = count_alloc(|| {
let mut n: BigUint = TEN.clone();
n *= &*TEN;
n *= &*TEN;
n += &*TEN;
n += &*TEN;
n += &*TEN;
n
});
println!("Allocations: {}, Re-allocations: {}, De-allocations: {}, Value: {:?}",
allocs, re_allocs, de_allocs, v);
}
Result:
Allocations: 4, Re-allocations: 0, De-allocations: 2, Value: 1030
Unfortunately, we’re still making more allocations than necessary. It appears that each multiplication by &*TEN
adds one allocation and one de-allocation. Even multiplying by &*ONE
causes an extra allocation each time.
The additions, as expected, do not seem to allocate, even when performing 10,000 additions by TEN
it still
did not allocate anything else.
This turns out to be a performance bug in the num-bigint crate which has been recently fixed!
If the improved num-bigint
crate version from the master
branch is used, then the exact same code above
prints this instead:
Allocations: 2, Re-allocations: 0, De-allocations: 0, Value: 1030
Nice!
As I was using the latest released version at the time, 0.4.0
(as of writing, still the latest), my algorithm
was extremely slow even with the small correction
discussed above that should have avoided the extra allocations.
After updating the Rust project to get this improvement, the Rust code’s performance improves nearly 2-fold and nearly matches Common Lisp:
Notice that there’s no source code changes between the two Rust versions above, just a library update.
Avoid copying Vec
s if possible
The next performance mistake I had made was trying to use an immutable Vec<&String>
to hold the partial solutions to
the problem, which forced me to clone the vector every time a new potential word was found.
I was heavily criticized for doing this, with some people accusing me of not knowing even basic CS101.
I want to try to explain better now why I don’t think that is such a big noobie mistake.
First of all, as the previous blog post showed, cloning the vector allowed me to avoid a big “refactor” of the code base because passing down on recursion a mutable vector would require the use of lifetimes, which can sometimes make problems a lot harder than they need to be.
You may ask why that’s the case. Here’s the reason.
The owner of the words that we can use to create a new solution is the dictionary which is loaded at startup. The type of the dictionary is:
type Dictionary = HashMap<BigUint, Vec<String>>;
This means the dictionary owns the Strings (if it didn’t, we would’ve used something like HashMap<BigUint, Vec<&str>>
and something else would need to become the owner of the Strings, but there’s no good alternative here).
The partial solutions should include some of those Strings because solutions can only use words that exist in the
dictionary, hence the natural type for partial solutions is Vec<&String>
(i.e. a partial solution is a vector that
borrows words from the dictionary).
This silly example shows what would happen if I’d tried to use Vec<String>
instead, as some people confidently
suggested I should’ve done on some forums on the Internet:
Whoever it was who was spamming all discussions around the Internet saying that I should’ve used
Vec<String>
and how I was just disingenuously trying to denigrate Rust by claiming that it is complex when in fact it’s all very simple, please change this code without having to clone every single word and show us how it’s done.Cloning a small (in this case, just a few items in most iterations) vector of pointers (which won’t clone the actual Strings!) is much cheaper than cloning every word (which copies the actual array of bytes representing the
String
).
use std::collections::HashMap;
type Dictionary = HashMap<u32, Vec<String>>;
fn main() {
let dict = create_dict();
let mut solutions: Vec<String> = Vec::new();
if let Some(words) = dict.get(&1u32) {
// Compiler ERROR: expected struct `String`, found `&String`
solutions.push(words.first().unwrap());
}
}
fn create_dict() -> Dictionary {
todo!()
}
Let’s not forget that solutions may also contain replacement digits that do NOT come from the dictionary. We can only push references into a vector if the reference’s lifetime is the same or longer than that of the vector it is being pushed to.
fn main() {
let mut vector: Vec<&String> = Vec::new();
let s1 = "s1".to_string();
// s2 only lives while this block is not ended
{
let s2 = "s2".to_string();
vector.push(&s2);
}
// Compiler ERROR: `s2` does not live long enough
println!("{:?}", vector);
}
My original code worked because I cloned the partial solutions vector instead of the words, then created a new instance
of a String
containing the replacement digit (so that both the vector and the new String
had the exact same
lifetime):
let mut partial_solution = words.clone();
let digit = nth_digit(digits, start).to_string();
partial_solution.push(&digit);
// recurse
Again, some people ridiculed me for doing this, claiming that this causes my algorithm to be quadratic.
That’s completely incorrect and you can only claim that if you really do not understand the problem being solved.
We only attempt to add a new word or digit to a partial solution when we actually find one that can be used! We don’t do this for every single word in the dictionary as only a tiny fraction of words will match at each iteration of the loop.
Also, the total size of the partial_solution
vector in this case has a maximum bound of 50
because that’s the maximum
length of each input. On average, though, as most words are quite a bit longer than 1
, partial_solution
should contain
only as many items as the number of digits in the phone number divided by the average size of dictionary words.
In practice, that turns out to be a couple of words only per word that matches. In other words, cloning that vector (which means cloning only the pointers to the Strings, not the Strings themselves) is very cheap and done rarely.
In my previous blog post, the modification to stop doing that caused a chain reaction of other changes because of the lifetime changes, so the performance impact of doing this was lost among the impact of unrelated changes. However, Rust forbids us from using a mutable list here unless we go ahead and make all those same changes again.
For this reason, I decided to measure the impact of NOT doing this by going the other way around: make the other impactful change that was necessary without actually changing lifetimes and making the list mutable.
What was the other change? Avoiding creating a new String
for each replacement digit, using a &'static str
instead
(because the only possible values of digits are 0
to 9
).
It’s a tiny change, but its impact can be seen when we compare it with the previous version:
Not much, but we’re moving in the right direction.
Finally, let’s make the list mutable and introduce explicit lifetimes, as in the previous blog post, to see if all that cloning of the partial solutions really had such horrible effect as some people seemed to believe.
It’s a pretty good difference, but does it look like an algorithm going from quadratic to polynomial??
I say: hardly.
Moving the goal posts?
At this point, the Rust code is almost identical to the one in the previous blog post up until the
Going the extra mile
section.
The next sections discussed a different benchmark where I used a larger dictionary and shorter phone number lengths, which had a dramatic impact on the performance of the different programs.
As I mentioned in that blog post, the objective was to make the problem more challenging so that it ran for at least one minute. I didn’t do that because I wanted to make Rust look bad, but because I wanted to see if the trends I was observing for the smaller runs (notice that the longest run in the previous benchmark was a mere one second) would continue to hold for bigger problems, which is where speed actually matters.
Of course, I also wanted to know if once the JVM runtime had the time to “warm up”, it could catch up.
When the data came in and the JVM took a clear lead, I thought that was a fascinating result that was well worth sharing. It was also an extraordinary result, and extraordinary results require extraordinary evidence, hence this blog post.
Common Lisp has been forgotten?
I want to clarify that I didn’t include Common Lisp in that last section purely because I spent all the time I could gather trying to make accurate measurements and equivalent modifications to the Java and Rust programs.
My lack of knowledge of Common Lisp meant that I would’ve had to spend far too much time to perform the same level of optimisations using that language, which is why I stopped including it in the longer benchmarks.
Luckily, however, the Common Lisp community has joined the debate and gave me lots of suggestions on how to make the Common Lisp code run faster!
I will try to find time to write another blog post dedicated to Common Lisp as from my preliminary results, the performance you can get out of Common Lisp is just astonishing! Stay tuned.
Are we still bound by stdout?
The last criticism I would like to address is one first raised by @bugaevc in the discussion on his Pull Request, that nearly all the time spent by the optimised Rust implementation was on printing the solutions. My counter-argument was that I had seen no evidence of that in the flamegraphs I’d looked at, and that even if Rust was faster at finding solutions but slower at printing them, for the end user, what matters is the total time the application can present them with results.
However, this is an interesting question anyway because it can illuminate how we can avoid writing slow Rust programs, which is the objective of this post.
My previous blog post had clearly shown that printing out was not a big problem for the smaller benchmarks, but for the longer one used in the final sections, there was some evidence (see the above-mentioned discussion) that this was true.
Simply not printing out any solutions at all was just not acceptable to me because modern compilers are incredibly good at not doing any work if no one can observe the results anyway.
In particular, claims that Rust could actually find all solutions in seconds rather than minutes struck me as a clear sign that this is what was happening.
How can we fix that problem?
Well, there are several ways, but the one I decided to follow was to drastically reduce the number of possible solutions while not simplifying the problem to something that would allow optimisers to avoid having to first calculate them all.
I came up with the following modifications to the original algorithm:
- printed solutions must contain only interleaved words and digits.
- all words in printed solutions must be of the exact same length.
- the program must print both the number of printed and rejected solutions.
This is very simple to implement, reducing the chances that I’ll mess up something important, while likely reducing the number of solutions to print considerably.
Code after these changes:
I understand that the new constraints make very little practical sense. The only objective with introducing them is to ensure that the number of solutions that get printed is very small without altering considerably the original programs.
Taking up from the end of the previous blog post, where we were trying to find all solutions for a single phone number of lengths 5, 10, 15, 20, 25, and finally 30… this is what the programs print to stderr when I run them again (showing only relevant output, both Java and Rust find exactly the same numbers of solutions, as expected):
Benchmarking...
Found solutions: 3, rejected: 60
Found solutions: 27, rejected: 19071
Found solutions: 0, rejected: 8460
Found solutions: 0, rejected: 1759680
Found solutions: 0, rejected: 18070560
Found solutions: 0, rejected: 21928320
Found solutions: 0, rejected: 36547200
Found solutions: 0, rejected: 201009600
Found solutions: 0, rejected: 1260878400
Cleaning up
Notice that from now on, the Rust code being benchmarked is different from the one discussed in the previous sections. It is the one from the
Going the extra mile
section which usesVec<u8>
as keys, avoidingBigUint
arithmetics. That change gives Rust a 45% speed increase!
It never actually finds any solution matching my new requirements after the phone length goes past 10, but this allows us to see just how many solutions it would find and print in the final benchmarks I’d previously run.
The number of total solutions goes from 201,009,600
using length 29 to 1,260,878,400
using length 30!
More importantly, program runtimes go down by a lot when they don’t need to print as many solutions.
With the maximum phone number length it goes from 10min 16.96sec
for the fastest program, which was Java3,
to 28.92sec
, which was now achieved by the Rust program.
This is what it looks like on a chart:
It might be easier to read it using a logarithmic scale:
The
Java1
program was removed as it didn’t really add much to the conversation as we already know it’s slower than bothJava3
andRust v4
and that it uses a completely different algorithm.
Given the above, I must admit that this can only be the case if Rust is much faster than Java at finding solutions (the rate varies between 1.5 and 3.7 for this example), but Java is better at printing out solutions!
To find out if this could be true, I decided to perform two extra experiments.
First, I would run benchmarks with an improved dictionary that spans the whole alphabet (in the previous run the dictionary was cropped to 100,000 words from the beginning, removing lots of words at the “bottom”) and using many phone numbers with a shorter length rather than a single one with increasing length. That should increase the space of possible solutions and allow the programs to actually find at least a few printable solutions, while making the programs again run for at least one minute (one of my goals for assessing “bigger problems”).
I would then revert the programs to start printing ALL solutions again. Because the number of solutions should be a lot higher in this setup, Java should start running faster again and more visibly if the advantage it had before was simply in printing out solutions faster!
More solutions, fewer printouts
This is the basic script I had used previously to generate the English words dictionary:
curl https://raw.githubusercontent.com/dwyl/english-words/master/words.txt --output tmp.txt
tail -n 100000 tmp.txt > words.txt
rm tmp.txt
To allow solutions to span the whole alphabet, I replaced that with this:
curl https://raw.githubusercontent.com/dwyl/english-words/master/words.txt --output tmp.txt
# the dictionary has 466k words, so we can skip every 5 words to get close to 100k
cat tmp.txt | awk "NR % 5 == 0 && NR < 500001" > words.txt
rm tmp.txt
The dictionary ends up having 93,310
words.
Here’s what we see when we run the programs as before, printing out only a few select solutions:
The number of solutions actually printed now grows very slowly to a maximum of just a few thousand, almost entirely removing IO from the equation.
The number of total solutions (including “rejected” ones that are not printed), however, became ridiculously large, ranging from 350 million for 1,000 phone numbers, to 1.5 billion for 5,000.
How long does it take for Java and Rust to find all of those solutions?
Both are incredibly fast, but the Rust program is now very solidly the fastest by a factor of 2, with so sign of that rate decreasing with larger inputs.
The longest Java run shown above took 1min 32.5sec
, while the longest Rust run took 48.4 sec
.
More solutions, even more printouts
The only question remaining is whether my previous blog post’s final benchmarks were distorted by how Java, despite finding solutions slower, was indeed better at just printing them out.
The previous results show that we will only need to go up to around a thousand phone numbers to start seeing very significant
stdout overhead. By trial and error, I ended up using inputs of between 100
and 1,000
phone numbers:
As expected, we see the same pattern from the previous blog post, with Java taking over the lead with larger inputs, if just by a very small margin (up to 6% faster in this run).
The larger input size shown above is 1,000 phone numbers, which was just the smallest in the previous section.
We can see that the run time goes from 10sec
without printing many solutions, to over 1min 20sec
printing all solutions, again indicating that programs were severely struggling to print out all solutions in the previous
blog post.
Why didn’t the Rust flamegraphs and Java profiler show more clearly the impact of printing solutions? It seems that the problem is that nearly all time is spent on invocations of
print_translations
. As that function is called recursively, that’s basically all we see in the stack traces. If you know how to “filter out” recursive calls in order to make profiling recursive algorithms more easily, do let me know and I will update this section with a note on that!
Bonus: Rust VS Rust using a faster hash
Finally, one more question that I’d left unaddressed in my previous blog post was about the Rust HashMap
problem.
Some critics pointed out that Rust’s standard library’s HashMap
uses a DoS-resistant hashing function,
unlike other languages like Java (which is apparently DoS-resistant as well due to using a balanced tree internally).
Anyway, this section is about Rust only!
How much impact does swapping the hashing function
to ahash
makes in our latest benchmarks, which avoids IO issues and searches a huge
space of solutions (so hashing performance should matter a lot)?
Big difference! Average 46%
faster.
Conclusion
I hope that you found this post useful and that it adequately addresses the flaws and misunderstandings in my previous ones.
I continue to stand up for many of the findings in my previous blog posts. For example, that:
- Rust requires more attention than many other languages to avoid performance pitfalls.
- Java’s JIT can optimise hot paths just enough to make Java programs run competitively fast even when badly written.
- Java and Common Lisp programs can run as fast as Rust programs in certain cases, specially before optimisations are made.
- Choice of algorithm is much more important than which language is used.
- Programming language effect on a program’s speed (and programmer’s productivity) if often exaggerated in practice.
However, the findings in this blog make me re-consider a few, while re-enforcing other points:
- Optimised Rust code should run 2 to 3 times faster than optimised Java code (without going to extremes).
- Slow Rust programs relative to other languages indicates performance bugs in the Rust program or in the libraries it uses.
- Java seems to handle IO very well, to the point IO-bound tasks may run faster in Java than in Rust.
- Performance rarely can be guessed, the only way to know what’s faster is to measure it.
Thanks for reading and stay tuned for performance improvements in other languages based on the findings of this series of posts!
PS. A rant
Being judged on the Internet can be an uncomfortable experience. However, all of us who like to write blog posts about things we find interesting need to be aware that everything we say can affect the decisions and thoughts of many people out in the real world, hence being judged is inevitable and I know that.
We do have the responsibility to review what we write carefully to make sure we don’t publish misleading content or things that may affect others negatively, even if we get absolutely nothing in return (as is my case - I don’t run ads, newsletters or anything like that) because that’s just the right thing to do.
However, I think that participants of Internet discussions need to be aware that sometimes people can make honest mistakes. Critics can be very unfair and make baseless claims using a tone that may make them seem like experts.
I think that in the case of my previous blog post, both things happened. I was unjustly accused of intentionally trying to badmouth Rust and of not knowing basic computer science concepts, while other criticism, such as not controlling for the effect of IO sufficiently, was clearly valid.
In reality, I am a big Rust fan, but not one that will blindly defend it (or any other technology that I enjoy using) and pretend that it has absolutely no problems at all and that all other languages suck in comparison. I always try to make rational decisions based mostly on cold data (and just a little on my own taste, I’m human after all), and that’s what my blog posts are about: gathering data and making comparisons that will be useful to me and, hopefully, to others who read my posts.
End of rant. Thanks for reading.