Revenge of Lisp (Part 2⁄2)
Note: This is a follow-up blog post to Revenge of Lisp - Learning Common Lisp to beat Java and Rust on a phone encoding problem. Please read that blog post first for context.
In the previous Part of this blog post, I introduced basic Common Lisp to non-Lispers and discussed the phone number encoding problem, as well as Peter Norvig’s quick implementation of a solution in CL - which are the basis for this post.
We also saw that the CL code had very competitive performance, running faster than Java and Rust for the most part, while consuming much less memory than Java, and not very far from Rust’s nearly optimal memory usage.
However, the Java and Rust implementations were, as CL’s, written without much thought given to performance, after all the description of the original study which introduced the problem asked participants to focus on correctness and readability, not performance.
Besides that, there were some problems with the benchmark, as it was originally meant only for a quick comparison between the languages, not as a formal proof of relative performance. In any case, if we’re going to mention performance, I acknowledge that some improvements are required in order to avoid misleading the reader, so I spent quite some time thinking about it, and designing a few changes to both how the benchmarks are run and to how programs should be implemented in order to get a more accurate picture of how each language really performs.
In this post, I will discuss these improvements, then focus on the Common Lisp implementation and the available techniques for making it run faster. As with my previous Rust blog posts, I will show how each modification affects the performance of the code. The objective is to get the fastest implementation possible, and having optimised Java and Rust implementations to compare against is a motivator to keep going until there really isn’t anything else that can be tried!
Do you think Common Lisp can run as fast as well-optimised Rust? Read on if you want to find out.
Improving the benchmark
Note: feel free to skip this section if you’re only interested in learning how to make CL run faster!
In order to get a more accurate picture of how Java, Rust and CL perform relatively to each other, at least for things that can be measured by the implementations being used, I have made a few improvements to the benchmark, based on lots of feedback from readers and just thinking about what it is that we want to measure.
One thing that I, at least, find very important, is knowing how the language performs on a task from a cold start (i.e. time measured from process start, without warmup). Besides being more realistic if that’s how users are expected to run the application, which is definitely the case here, it is much harder to compare different language implementations otherwise.
However, it’s also interesting to know how things may change for tasks that need to run for much longer, because that’s when good performance matters the most. A good solution for a hard problem may run in hours or even minutes rather than days! In cases like this, languages like Java, with its advanced JIT compiler, may have the edge as the native code being executed can be automatically optimised on-the-fly based on actual runtime information, something not available to native binaries like those compiled from Rust.
For this reason, I decided to continue starting the benchmark using relatively small input files, but to also introduce a larger dictionary file and much longer phone number files.
As the Rust blog posts revealed, however, when using larger dictionaries and phone number files, most of the time
is spent on printing solutions rather than searching for them. We don’t want to measure just how fast the programs
can print to stdout
! To fix that, I decided to change all programs to support a new argument with one of the
values: print
or count
. With print
, the programs should behave as they are currently. With count
, instead
of printing all solutions, they should only print how many solutions were found. This removes all IO overhead.
German and English dictionaries
The programs search for words matching phone numbers’s encodings from a dictionary file. This dictionary was a German one in the original study, but I added also a larger English dictionary in the Rust blog posts. There was an issue, however, when using the English dictionary: all programs became extremely slow unless I only used short phone numbers as input. I was baffled by this but kind of worked around this problem in the previous blog posts by just using shorter numbers (or even just a single phone number, as even a long number can make the programrun for a lot of time). However, for this article, I decided to investigate further.
The problem was quite unexpected! To find out what was wrong, I had to do a little analysis to find out what’s different between the dictionaries.
First of all, the dictionaries are not extremely different in size, the English dictionary is just around 1⁄5 larger (words.txt
is English, dictionary.txt
is German - should’ve used better names!):
â–¶ du -h dictionary.txt
920K dictionary.txt
â–¶ du -h words.txt
952K words.txt
â–¶ wc -l dictionary.txt
73113 dictionary.txt
â–¶ wc -l words.txt
93310 words.txt
I wrote a little CL script to extract more details as I thought that, maybe, there was a skew towards certain letters which could cause the search space to grow larger somehow.
This little script counts the occurrence of each letter in the dictionaries:
(defglobal *counts* (make-hash-table :size 36))
(defun count-chars (word)
"Counts each different alphabetical character in the given word"
(loop for c across word
when (alpha-char-p c) do
(incf (gethash (char-downcase c) *counts* 0))))
(defun count-chars-in-file (file)
(with-open-file (in file)
(loop for word = (read-line in nil) while word do
(count-chars word))))
(defun print-counts ()
"Print the counts for each char"
(loop for c from (char-code #\a) to (char-code #\z) do
(let* ((ch (code-char c))
(count (gethash (char-downcase (code-char c)) *counts* 0)))
(format t "~a: ~a~%" ch count))))
(let ((args (rest sb-ext:*posix-argv*)))
(if (= 1 (length args))
(progn
(count-chars-in-file (first args))
(print-counts))
(error "expected one argument")))
I loved writing these scripts in Common Lisp, CL might just become my new favourite scripting language!
Plotting the resulting data, it’s clear this is very unlikely to be a problem, as both dictionaries have essentially the same distribution, or pretty close (English seems to like y
much more than German):
I wrote another, very similar script, to count the length of words to see if the distribution differs significantly between English and German.
Here’s the result:
As there are more English words than German words in the dictionaries I am using, the blue bars are higher, as expected, but we can also see that German words tend to be longer… but there’s a single, huge outlier in the English dictionary with a length of 45 characters! Here’s the culprit:
pneumonoultramicroscopicsilicovolcanoconiosis
I thought that this word could be causing the slow-down, so I removed it from the dictionary… and that made no difference at all, so that was a red-herring.
The dictionaries seem to have similarly sized words (German being slightly longer), and similar distribution of
characters, so from the algorithm point-of-view, they are almost the same! Still, the difference in runtime
using them is enormous. Here’s an example run using the fastest Rust implementation to count solutions,
using truncated German/English dictionaries of the same length, 30,000
words:
â–¶ time src/rust/phone_encoder/target/release/phone_encoder count words-30_000.txt input.txt
101435638
10.34s user 0.04s system 99% cpu 10.440 total
â–¶ time src/rust/phone_encoder/target/release/phone_encoder count dict-30_000.txt input.txt
27
0.05s user 0.01s system 90% cpu 0.067 total
This reveals the root problem (if not the cause): even using only the same amount of words,
101,435,638
solutions are found in English, but only 27
in German!!!
Correspondingly, the English dictionary run takes 10.34s
VS 0.05s
for the German one: a difference of over 200 times!
Perhaps, the fact that German words tend to be longer makes it harder for the algorithm to find solutions, which in turn makes the programs search a exponentially smaller solution space?!
To test the theory, I wrote yet another CL script that transformed the English dictionary by adding a few random characters (up to 8), with a 50-50 chance, to all words in order to skew the English dictionary towards longer words, making it more similar to the German one.
Here’s the updated distribution of words, with English words artificially enlongated:
After running everything again, truncating the dictionaries to 30,000
words, the difference is remarkable:
â–¶ time src/rust/phone_encoder/target/release/phone_encoder count dict-30_000.txt input.txt
27
0.03s user 0.01s system 79% cpu 0.047 total
â–¶ time src/rust/phone_encoder/target/release/phone_encoder count words-30_000-longer-words.txt input.txt
17746
0.06s user 0.01s system 94% cpu 0.076 total
Mystery solved! The English dictionary still results in far more solutions being found! But at least now the difference in time is much smaller, falling on the same order of magnitude.
We definitely don’t want to use the print
option with the English dictionary, as that would just cause the
programs to spend all their time printing to stdout
! We’ll also have to remove words from the English
dictionary because that’s preferrable to artificially turning it into the German dictionary with longer words
that do not exist. I hadn’t foreseen that a different language would have such a dramatic effect on speed, but
now that it’s clear it does, it’s a good idea to adjust accordingly.
Therefore, I decided to write one final CL script to remove half the words from the dictionary (every second line was removed), which I ran twice to keep only every fourth
word, bringing its size down to 23,328
words. That may not sound like much,
but here’s how many solutions are still found with only 1,000
phone numbers as input:
time src/rust/phone_encoder/target/release/phone_encoder count words-quarter.txt phones_1000.txt
183397116
16.41s user 0.05s system 99% cpu 16.518 total
Because the program now spends pretty much all its time searching for solutions, rather than just printing them, it is actually useful that it finds so many solutions and takes a longer time to run for benchmarking purposes.
New benchmark design
In summary, the new benchmark will still use the German dictionary for the print
runs, but the English
dictionary for the count
runs (which has nearly zero printing overhead).
Also, because we want the programs to run for longer, I increased the input sizes to
1,000
, 100,000
, 500,000
and 1,000,000
for print
. The new count
runs will use only 1,000
and 2,000
phone numbers, as the English dictionary causes a ridiculously large amount of solutions to be
found, as we’ve just seen in the previous sub-section. The print
runs may have non-negligible IO overhead,
but the count
runs obviously don’t because they always write only a single line: the number of solutions found.
The following implementations will be considered:
- Main.java - my initial Trie-based implementation in Java, with a few simple optimisations.
- Main2.java - the Java implementation closest to the initial CL one, but using a byte-array as key instead of
BigInteger
(this optimisation was explained in the Rust posts). - main.rs - the fastest Rust implementation, which uses
Vec<u8>
as keys instead of numbers. It also uses a fast hashing function,ahash
. - main.lisp - Peter Norvig’s CL implementation, which I will be trying to optimise in later sections.
The Java programs will be executed with the following JVM options to avoid the JVM just using as much memory as it can get away with:
-Xms20M -Xmx100M
The Common Lisp code could be compiled to either the FASL binary format or directly to a native binary quite easily. Either seems to have a small, but noticeable, improvement effect on the CL performance when compared to running from source. Running from FASL or from a native binary seems to not matter for performance measurements, so I chose to compile the source to binary by writing a build.lisp script.
Commit with all program and benchmark changes
Visualizing benchmark results
In previous blog posts, I was manually copying the CSV-like results printed by the benchmark into a Google Docs spreadsheet and manually creating charts to visualize the results. That was tedious and error-prone, and the charts were of pretty poor quality. For this reason, I spent some time writing a chart generator that can be incorporated into the benchmark. I couldn’t find a good CL library to use for that, but I did find one written in Rust: the Plotters crate.
Using that, it was pretty easy to create a Rust program that parses the CSV file printed by my improved benchmark.sh script and create nicer SVG charts automatically when the benchmark is executed.
Here’s the updated results with the alterations mentioned above, and using the latest compilers for each language (Java openJDK-17 2021-09-14, Rust 1.55.0, SBCL 2.1.9):
The bold lines are time
and are associated with the left y-axis,
the thin lines are memory
and use the right y-axis. For both time and memory, lower is better!
Raw Data
I had to use a logarithmic left y-axis (time) scale because otherwise the times for the shorter runs become impossible to see. That means that small differences in the higher areas of the chart mean a lot more than in the lower areas.
As you can see, even with all the optimisations made in Java and Rust, CL still runs fastest for the smallest run,
implying that it can load the dictionary data faster than anything
else! Think about it: it loaded over 70,000
words into a hash-table, then printed the encodings for 1,000
phone numbers in exactly 72ms
, faster than Rust’s 85ms
and much better than Java’s best time, 210ms
!
But on longer runs, it struggles and falls behind… still, it manages to perform better, or very similarly, to the
Java’s Main2
implementation, which is the one that is closer algorithmically to CL’s. For this reason,
I think it’s fair to say CL can perform similarly to Java even without using any optimisations!
CL’s longest run was 79.7sec
, against 65.2sec
for Java’s Main2
.
The Rust implementation is generally the fastest, but Java’s Main
catches up on the longer runs, once again showing that its JIT compiler is very powerful.
Because the Rust code implements a similar algorihm to Java’s Main2
, NOT Main
, we should not conclude that Java can beat Rust in speed! To the contrary, Rust seems to actually run much faster than Java when using the same algorithm! However, with a better algorithm, Java shows that it can still catch up anyway. It would be interesting to port the Trie
implementation to Rust to see just how fast it can get, but I’ll leave that to others.
CL appears to use less memory than Java, even when we explicitly limit the Java heap to 100MB, but still much more than Rust, which used only 25MB even with the larger dictionary and is clearly the best language if low memory usage is a priority.
Optimising Common Lisp
I gathered a lot of feedback from people who read my previous blog posts with suggestions for how to optmise the CL code.
However, my initial attempt to make the code faster was based on my own understanding of other languages (Java, Rust) and the previous optimisations that worked in those languages… and that turned out to not be very useful in CL, as we’ll see!
To measure the impact of each optimisation attempt I made to the CL code, I made a few small changes to the benchmark so that the results plot only shows the previously compiled version of the CL code against the current one. As the difference tends to be small for each attempt, I also made the time y-axis linear instead of logarithmic.
Mutating the words
list instead of cons
-ing
In both Java and Rust, this proved to be a very effective and easy optimisation. Just mutate the words
list
instead of creating a new one on each recursion.
In Rust, that meant doing this:
words.push(WordOrDigit::Word(word));
find_translations(results_opt, num, rest_of_digits, words, dict, writer)?;
words.pop();
In CL, we can do the same thing… first, we initialize words
to an array rather than nil
:
(words (make-array 8 :fill-pointer 0 :element-type 'string :initial-element ""))
Then, we mutate the array (or vector
, equivalently) in place, instead of using cons
:
(vector-push-extend word words)
(find-translations handler num digits (+ 1 i) words)
(vector-pop words)
Almost exactly the same as in Rust. However, unlike with Rust, this change does not really help in CL:
The ligh-yellow line is the original CL implementation running as an executable. Not showing the other languages until significant CL improvements are made. Raw Data
Clearly, this implementation made things worse, not better (so I reverted it before the next attempt). CL is not Rust, lesson learned!
cons
is perhaps THE fundamental Lisp construct, so it must be highly optimised. Thinking we can beat it with a
vector
seems to have been a mistake.
Using CL tools for profiling
I decided I had much to learn before continuing, so I started reading as much as I could about Lisp. One of the best resources I found was The CL Cookbook - Performance Tuning and Tips.
From that, I discovered time
, which can be used to run simple microbenchmarks and know much more quickly if
a small change like the one we’ve just attempted might work or not.
Here’s a REPL session using that to answer whether using a mutable vector might be faster than cons
:
Easy to see that cons
is much faster than vector-push-extend
. I love how CL’s time
shows not only the time,
but how many bytes were consed - i.e. allocated - and even how many processor cycles elapsed. Really awesome.
We’ll come back to the other nifty tools CL has available, but for now, I decided it was time to listen to the experts and use the optimisations that were suggested to me on the Lisp Reddit.
Using defglobal
instead of defvar
This is a really easy one! Just change a word… from:
(defvar *dict* nil)
To:
(defglobal *dict* nil)
Could this change anything? Well, it does, but so little it’s not possible to see in the chart, so I won’t even
bother including it here… you can check the chart for yourself, and the raw data. The shortest run took just about the same time, while the longest run improved from 46.97sec
to 46.39sec
. We’ll keep that!
Declaring the compiler optimisation level
CL has a very blurry distinction between compile time and runtime, as with Lisp in general. Things that you would think you need to pass as compiler arguments in other languages are just code in CL.
Here’s how we can tell the compiler to aggressively optimise the code and remove some safety checks:
(declaim (optimize (speed 3) (debug 0) (safety 0)))
declaim
goes right inside your
Lisp file! However, once this is added, the file compiles with a lot of warnings… many of which look like this:
; file: /Users/renato/programming/projects/prechelt-phone-number-encoding/src/lisp/main.lisp
; in: DEFUN MAIN
; (LOAD-DICTIONARY DICT DICT-SIZE)
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::LOAD-DICTIONARY
; (PRINT-TRANSLATIONS NUM (REMOVE-IF-NOT #'DIGIT-CHAR-P NUM))
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::PRINT-TRANSLATIONS
Looks like we need to declare the functions in the right order for the compiler to be able to optimise things.
So, besides adding the declaim
instruction to the file, I also sorted the functions so that they are all
declared before being used.
That fixed the above warnings, (there are many other to be fixed, still, but we’ll get back to them later) and running the benchmark again shows a bigger, but still tiny improvement in performance:
The ligh-yellow line is the previous CL implementation. Raw Data
Using the profiler
Before proceeding, I decided to have a look at the profiling tools available with CL.
CL has a profile
function which allows us to request profiling each function of interest…
but SLIME has a nice command, slime-profile-package
to just profile everything in a package, which is more convenient to use when you don’t know where the bottlenecks might be.
I had a go with that in the REPL:
I aborted the first try as it was taking too long… but still, we can see that there are two main functions for us to spend our effort optimising:
nth-digit
with39,421,577
calls!find-translations
with36,676,533
calls.
Unfortunately, the functions are all a bit too fast, it seems, as all their time/call are shown as 0.000000
.
But we can use the time
function to time function calls individually. That might help check the effect of optimisations later, so let’s start with the easier one, nth-digit
:
This should at least give us a rough idea of how long it takes to run the function as it currently is.
Using type annotations
When compiling with speed 3
, the compiler emits a warning for the nth-digit
function:
main.lisp:15:17:
note:
unable to
optimize
due to type uncertainty:
The first argument is a STRING, not a SIMPLE-STRING.
note:
unable to
avoid runtime dispatch on array element type
due to type uncertainty:
The first argument is a STRING, not a SIMPLE-ARRAY.
It seems that it cannot optimise due to the lack of a type annotation, so let’s try adding one with the
declaim
instruction:
For details about CL’s type system, check the CL Cookbook - Type System
(declaim (ftype (function (simple-string (integer 0 50)) (integer 48 57)) nth-digit))
(defun nth-digit (digits i)
"The i-th element of a character string of digits, as an integer 0 to 9."
(- (char-code (char digits i)) #.(char-code #\0)))
Notice how ftype
specifies a function
type of the form function (arg-types) return-type
.
simple-string
is a specialized one-dimensional simple array whose elements are of type character or a subtype of type character.
integer
specializes
rational
, real
, number
and the supertype of all objects, t
… without any arguments, an integer has no
magnitude limit, but using the syntax integer [lower-limit [upper-limit]]
we can specify both lower and upper limits, hence type (integer 48 57)
only allows values between 48
and 57
.
Initially, I erroneously declared the return type as base-char
,
when it should have been an integer
. The compiler gave me a nice warning like this:
Derived type of ((- (CHAR-CODE (CHAR DIGITS I)) 48)) is
(VALUES (INTEGER -48 1114063) &OPTIONAL),
conflicting with the declared function return type
(VALUES BASE-CHAR &REST T)
It almost got it right, but as we know this function will only be called with digits, i.e. 0
to 9
characters,
and their char-code
s are 48
to 57
, we can tell the compile exactly that and hope it’ll use that for
optimising our code as good as it can. We could improve the readability of the code by using a reader macro, as
we saw in Part 1: #.(char-code #\0)
instead of 48
.
Re-compiling the file now gives no warning for this function! Let’s time it again:
Previously, we had seen the time varying a bit, but it was around 0.000003 seconds
, with 2,500 processor cycles
approximately. On the 100-times run, it took just a little longer, using around 10,000 processor cycles
.
After the types were added, we see perhaps slightly shorter times with fewer processor cycles, specially on the
100-run, which used only 5,198 processor cycles
. I think that’s a win, but as it’s such a small function,
not by a lot.
Let’s run again the actual benchmark to see if that helped.
The ligh-yellow line is the previous CL implementation. Raw Data
Finally, we have a more substantial improvement, and it took only a one-line change!
If we can get that by adding types to a single, tiny function, maybe typing all functions would make things a lot faster?!
We can always give it a go.
I typed all functions and some of the local variables, enough to get rid of almost all compiler warnings (we’ll look at the remaining ones soon), and here’s the results:
The ligh-yellow line is the previous CL implementation. Raw Data
This is a really great improvement!
But the final few compiler warnings look concerning because they involve the arithmetics we’re doing to calculate the keys, and that’s our hottest code path.
More specifically, on this line (this code appears in two places in the current source):
(setf n (+ (* 10 n) (nth-digit digits i)))
The compiler complains about this:
main.lisp:95:43:
note:
forced to do GENERIC-+ (cost 10)
unable to do inline fixnum arithmetic (cost 2) because:
The first argument is a UNSIGNED-BYTE, not a FIXNUM.
The result is a (VALUES UNSIGNED-BYTE &OPTIONAL), not a (VALUES FIXNUM
&REST T).
unable to do inline (signed-byte 64) arithmetic (cost 4) because:
The first argument is a UNSIGNED-BYTE, not a (SIGNED-BYTE 64).
The result is a (VALUES UNSIGNED-BYTE &OPTIONAL), not a (VALUES
(SIGNED-BYTE 64)
&REST T).
etc.
main.lisp:95:46:
note:
forced to do GENERIC-* (cost 30)
...
Notice that the cost looks really high… it seems the compiler would prefer to use a non-generic fixnum arithmetic operation, but it couldn’t.
Luckily, a CL wizzard, @bpecsek, had shown me how to implement this function so that the compiler could optmise it properly!
Here’s the original function (with the type annotation we’ve just added):
(declaim (ftype (function (simple-string) integer) word->number))
(defun word->number (word)
"Translate a word (string) into a phone number, according to the rules."
(let ((n 1)) ; leading zero problem
(declare (type (integer 0) n))
(loop for i from 0 below (length word)
for ch = (char word i) do
(when (alpha-char-p ch) (setf n (+ (* 10 n) (char->digit ch)))))
n))
After some CL magic, it looks like this:
(declaim (ftype (function (simple-string) fixnum) fast-word->number))
(defun fast-word->number (word)
"Translate a word (string) into a phone number, according to the rules."
(loop with n = 1 ; leading zero problem
for i from 0 below (length word)
for ch of-type base-char = (char word i)
do (when (alpha-char-p ch)
(setf n (logand #.(1- (ash 1 (integer-length most-positive-fixnum)))
(+ (* 10 n) (char->digit ch)))))
finally (return n)))
Wondeful, right? You need to prove to the compiler (by using logand
) that the values fit into a fixnum
for it to believe you.
To make sure the compiler does transform this into something better, we can use the disassemble
function and see the actual Assembly code the compiler came up with for each version!!
I’m not really an expert in Assembly, but even I can see that the first version was indeed using the generic
+
and *
, but the fast version doesn’t… and to confirm the fast version is actually faster, I timed the
functions as well and it seems the new version is twice as fast.
I also updated the arithmetic code inside find-translations
according to @bpecsek’s version and re-compiled it.
Now, the compiler did not emit a single warning!! I ran the benchmark again:
The ligh-yellow line is the previous CL implementation. Raw Data
Not huge, but significant improvement.
There’s only one problem: if you paid attention, you might have noticed that this version of the code is kind of cheating…
It will compute the wrong key for any word or number longer than 18 characters
((integer-length most-positive-fixnum)
is 62
, hence we take only the first 61 bits of the key and ignore the
rest… 61 bits can only hold 18 characters safely). The tests don’t fail because it seems to be highly unlikely it will find a case where this problem causes the program to find an incorrect answer.
We can fix that by checking for the length of the word being encoded and automatically switching to the slower version of the key updating function only if necessary. Hopefully, doing that doesn’t completely revert the small improvement this change has given.
Here are the two version of the key updating function:
(defun update-key-fast (n digit)
(logand #.(1- (ash 1 (integer-length most-positive-fixnum)))
(+ (* 10 n) digit)))
(defun update-key (n digit)
(+ (* 10 n) digit))
Sure enough, trying to choose the right function in a naive way (i.e. in the inner loop) did make the program much
slower! I had to try to keep the conditional check outside the innermost loop for the change to have a smaller impact. In word->number
, it was simple: just choose which function to use depending on the length of the word, before doing anything else:
(defun word->number (word)
(let ((update-n (if (< (length word) 19) #'update-key-fast #'update-key)))
(loop with n = 1 ; leading zero problem
for i from 0 below (length word)
for ch of-type base-char = (char word i)
do (when (alpha-char-p ch)
(setf n (funcall update-n n (char->digit ch))))
finally (return n))))
We assign the right function to a local variable, update-n
, then use it later without having to check anything.
Inside find-translations
, it was a bit trickier… but I was able to use the same strategy:
(loop with found-word = nil
with n = 1 ; leading zero problem
with len = (length digits)
with update-n = (if (< (+ start len) 19) #'update-key-fast #'update-key)
for i from start below len
do (setf n (funcall update-n n (nth-digit digits i)))
(loop for word in (gethash n *dict*)
...
Unfortunately, the program still became noticably slower, but CL has another trick up its sleeve: if the loss in performance is mostly due to calling our new extra function, we can ask the compiler to try to inline it, as we’ll see in the next section.
Inlining functions
I noticed when looking at the Assembly code that the char->digit
function was not being inlined into
word->number
. Well, we can tell the compiler to do it, so I did that (also for the tiny nth-digit
function, along with update-key
and update-key-fast
for good measure):
(declaim (inline nth-digit char->digit update-key update-key-fast))
Will that really help?! Let’s see:
The ligh-yellow line is the previous CL implementation (before the fix for large keys). Raw Data
We can see the new implementation is just slightly slower, but it’s fixing the calculation of keys for large words, so that we have a completely correct implementation again.
Hence, this is going to be the final CL implementation in this post. If you can get this even faster, send me a message (see the footer for my contact details) and I’ll make sure to update this post with your changes.
Other attempts that didn’t work
Even though these attempts did not help in my case, it’s possible knowing about them might help others, so I decided to mention several other things I tried (it should also pre-empt a few know-it-all commenters’s objections, I suppose).
Block compilation
CL supports block compilation.
By enabling that, the CL compiler is able to perform more whole-program optimisations. Enabling it is as easy as adding one line to your code:
(setq *block-compile-default* t)
It didn’t really seem to help in my case, unfortunately.
I also tried compiling the CL code using the :root-structures
option of save-lisp-and-die
(list of the main entry points in any newly loaded systems), but again, it didn’t help.
I guess that on larger programs these things might make a difference, so it’s worth knowing about them.
Purify
The SBCL Manual has a section called Efficiency Hacks! Perfect for me to find something I can use to make the code even faster…
It mentions a SBCL-specific function called sb-ext:purify
, which causes SBCL first to collect all garbage, then to mark all uncollected objects as permanent, never again attempting to collect them as garbage.
This looks perfect for us to call just after the load-dictionary
function completes!
Except this function is not found at runtime… if anyone knows what that’s about, do let me know!
I noticed that there’s a :purify
option on save-lisp-and-die
, but that probably wouldn’t help much… unless we load the dictionary at compile-time?!?
Open file as ASCII
This one seems obvious, as the problem mentions that all text files are encoded as ASCII, so we can tell CL as follows:
(with-open-file (in nums :external-format :US-ASCII)
...)
However, if anything, time
reports this is a little bit slower than before!
In any case, the default for SBCL is UTF-8
, and US-ASCII
is a sub-set of UTF-8
, hence the program is correct as it is.
Remove default function arguments
The find-translations
function is called a lot, as we saw earlier… maybe removing the default arguments from it
might make it faster to dispatch?!
It’s worth a try. But no. It did not improve, it actually ran somewhat slower! Is that even possible?!
Remove global variables
The *dict*
is a global variable, as you may recall from the original implementation. In some languages,
access to global variables is slower than local ones, so I tried making it local.
It made exactly zero difference.
Use bit-vector for keys
CL has an interesting type called bit-vector
.
I tried using bit-vector as keys instead of using numbers, but unfortunately that was much slower.
Perhaps I made a basic mistake (I was still early in my attempts when I did that) in my implementation. If you find anything obvious, reach out to me and I’ll mention it here.
Use string for keys
I even tried using string for keys. Again, much, much slower.
Using strings as keys requires the hash-table
test-function to be set to equal
instead of eql
, which is probably what makes it so slow… maybe there’s a trick to intern strings or something like that which could make this fast?!
Add upper-case chars to the ecase
in char->digit
In char->digit
, we use the char-downcase
function to avoid listing both upper- and lower-case characters.
We could remove that, so that the implementation would look like this:
(defun char->digit (ch)
(ecase ch
((#\e #\E) 0)
((#\j #\J #\n #\N #\q #\Q) 1)
...
I did try this. The performance was exactly the same as the previous version.
Final benchmark
Finally, it’s time to see how CL is doing against Java and Rust!
The moment of truth!
You can check out branch fastest-implementations-print-or-count to see all implementations.
Final Common Lisp implementation against Rust and Java’s fastest implementations. Raw Data
CL managed to overtake the similar, optimised Java implementation by a good margin (longest run was 35.2sec
for CL against 51.2sec
for Java)! But, unfortunately, it seems it’s still a good way behind the fastest Rust and Java versions for the larger problems, despite the fact that we were able to close the gap considerably.
However, let’s not forget that the faster Java implementation uses a Trie (i.e. completely different algorithm), and that Rust is a system-language that can get pretty close to the theoretical maximum performance both speedwise and memorywise. So I don’t see this result as being bad for Common Lisp. It’s actually incredible, if you ask me, that it can beat the Java version with which it can be most directly compared against, given the enormous amounts of resources that get poured into the JVM, compared with the comparatively tiny work going on in SBCL.
Last, but not least, notice how Common Lisp is the fastest language of all, beating even Rust, on the smaller runs (and notice that this is no hello-world, it loads over 70,000
words into a hash-table, then encodes 1,000
phone numbers using those words - all of that in a mere 59ms
, well ahead of Rust, somehow, which needs 89ms
!).
Conclusion
Common Lisp is a great language to develop code interactively, with its famous REPL and debugger. While it doesn’t have the powerful IDEs that Java has, its alternative, SLIME, is a joy to work with and just as helpful as any IDE I’ve ever used.
As importantly, CL has incredible performance, specially when we remember it’s essentially a dynamic language! It does have an interesting type system, though, that can be used not just to document code, but to speed it up significantly, as we’ve seen.
I will definitely consider using Lisp for future projects, specially where performance and interactivity are important considerations.
If you also enjoyed learning about Common Lisp, here’s a few exciting Common Lisp projects that you might keep an eye on:
- ASDF (build tool for Common Lisp).
- Nyxt Browser (emacs-inspired browser for power users).
- Coalton (a language with a Haskell-like type system embedded in Common Lisp).
- Clasp (Common Lisp on LLVM, C++ interoperability).
- Quicklisp (the most used Common Lisp Package Manager).
- Woo (one of the fastest web servers).
- Clack (web application framework based on Woo).
- VSCode Alive (a modern IDE for Common Lisp).
- lparallel (parallel programming in Common Lisp).
- Bordeaux-Threads (multi-threading primitives).
- rtg-math (real-time graphics algorithms).
- ulisp (Lisp for micro-controllers).
- ECL (Embeddable Common Lisp).
- common-list.net/libraries (common-lisp.net list of libraries organized by subject).