I’ve been learning Elixir over the past month. I was happy to discover the Stream
module which implements lazy enumerations. I was intrigued by them even before Ruby introduced them in Ruby 2.0, and it was one of the appeals of learning Clojure.
Consider this Elixir code:
File.stream!("/usr/share/dict/words")
|> Enum.map(&strip_and_reverse/1)
|> Enum.each(&IO.puts/1)
This code reads the file line by line, then maps a strip_and_reverse
function over every line (which strips off a newline on the end and reverses the word on that line), and finally puts every reversed word onto standard output.1 Since I’m using functions from Enum
, this code will execute eagerly. Enum.map
and Enum.each
will wait until the previous step is finished before starting. So we’ll have some enumerator created before both of those calls, and that can be costly for 235,886 words.
The streaming/lazy version looks pretty similar:
File.stream!("/usr/share/dict/words")
|> Stream.map(&strip_and_reverse/1)
|> Stream.each(&IO.puts/1)
|> Stream.run
The Stream.run
at the end of the pipeline is needed to trigger the computation. The Stream
functions will wait until their result is actually demanded, and Stream.run
demands the whole result. However, the Stream
functions are so lazy that they work on only one line at a time. One line is read from the file, then it’s stripped and reversed, then that result is put to the standard output. Then the pipeline is triggered for the second line. And so it goes.
Intermediate enumerations of any sort are never needed, and the code will never have all 235,886 words in memory at any time. At least in theory.
I wrote two more variations: one with an eager map
and a lazy each
, the other with a lazy map
and an eager each
.
I expected the stream version to kick butt, but, no:
Version | time (s) |
---|---|
all stream | 4.408 |
eager map | 4.171 |
eager puts | 4.322 |
all eager | 3.986 |
Not only does the all eager version win, it’s pretty significant. Not a butt kicking, but still pretty impressive. I don’t know if I have any good excuses for this, but some things to consider:
Despite my hate for IO.puts
, I was reluctant to remove it since it actually did something, but without it, the times are much closer:
Version | time (s) |
---|---|
all stream | 1.471 |
eager map | 1.443 |
eager puts | 1.470 |
all eager | 1.467 |
Those look so close to not really matter, but I really wanted the stream version to kick some butt.
The other benefit from a stream is that the amount of work done is governed by the last operation. Consider these two blocks of code:
# eager
File.stream!("/usr/share/dict/words")
|> Enum.map(&strip_and_reverse/1)
|> Enum.take(20)
|> Enum.each(&IO.puts/1)
# stream
File.stream!("/usr/share/dict/words")
|> Stream.map(&strip_and_reverse/1)
|> Stream.take(20)
|> Stream.each(&IO.puts/1)
|> Stream.run
Both of them will only print out the first 20 words from the file. However, the eager version works strictly left to right (or top-down depending on how you read the code). So the entire stream of words from the file are stripped and reversed, then it picks the first twenty to print. The lazy version is controlled right to left, so the take(20)
controls the map
and asks it for only 20 stripped-and-reversed words; in turn, the map
asks the file for just 20 words. Prepare yourself:
Version | time (s) |
---|---|
eager first 20 | 1.390 |
stream first 20 | 0.001 |
Butt kicking acheived.
I’d like to try out the variations I listed above, increasing data sizes and algorithm complexity. I’m particularly curious what effect concurrency and the parallelism from multicores would have on this problem, but Elixir/Erlang favor server-level concurrency, not data level. But that all sounds good for a future blog post.
The script I used for my experiments is available if you’d like to play around with it. I’d appreciate any feedback if I did anything that was really stupid. Or even mildly stupid.
I am embarassed by the IO.puts
in this code, but not much. ↩