Haskell, Ruby and Infinity

(This is a snapshot of my old weblog. New posts and selected republished essays can be found at raganwald.com.)

Monday, February 26, 2007

Languages like Haskell support lazy evaluation. In principle, you only compute what you actually need, everything else just goes away. If you usually need everything you compute, this may seem like a frill: elegant, even interesting, but having little practical importance. I find it is very much like Tail Call Optimization: if you don’t have it, you code around it. Often it makes no difference, but from time to time there is a case where your code will be clearer and more maintainable if you express yourself succinctly and let the compiler do the work of making it efficient.

This post is a switch from mumbling about Y Combinators and using trivial cases to explain interesting ideas: I’m going to show some actual code I’m writing. I apologise in advance if this adds so much background that it obscures the message about lazy evaluation: extremely simplistic examples sometimes work against the argument because it is too easy to think of other ways to accomplish the needed results.

I don’t apologise at all for the unpolished nature of the code. This isn’t a textbook. And besides:

Anybody can say you can’t write. Let no one say you don’t.

—Ken Rand, courtesy of Chalain

I’ve been hacking some naïve code to cluster data sets.

In computer programming, a hacker is a software designer and programmer who builds elegant, beautiful programs and systems… For some, “hacker” has a negative connotation and refers to a person who “hacks” or uses kludges to accomplish programming tasks that are ugly, inelegant, and inefficient.

—Wikipedia

The clustering algorithm requires a very large, fixed set of curves.¹ I wrote the initial curve generation by building a gigantic list of parameter tuples, and then processing the list into records. Once the “search space” grew beyond a trivial size, the program began to eat enormous amounts of memory. The problem was that I was trying to write the generation code as clearly as possible, and that created a massive thicket of objects that all resided in memory simultaneously.

The curves being generated are paths composed of cubic beziérs. Each segment on the path requires specifications for four different control points.

I rejected the idea of a rewrite into looping, imperative form, I wanted to separate the “list comprehension” code from the “make it run in less than a gigabyte” code. Instead, I chose to use lazy evaluation: the principle is that although the code defines a huge data structure that takes up gigabytes, we only actually evaluate things as we need them, and we discard them when we are done, so the total memory footprint for the lazy form is about the same as the memory footprint for an imperative form.

The easiest way to perform the refactoring—besides a rewrite in Haskell—was to switch all of the arrays I used for a lazy list data structure. A lazy list is a linked list composed of head and tail tuples (known as cons cells). Where a normal linked list holds an element and the rest of the list, a lazy list holds an element and a function for computing the rest of the list (SICP calls them “streams.”)

If a modicum of care is taken not to pin objects down, you can build fantastically large lazy lists and process them at will. In fact, lazy lists can be infinitely large if you provide the appropriate function for generating the list

For that reason, you should never append one lazy list onto another lazy list. Consider odd = LazyList.unfoldr(1) { |n| n + 2 } and even = LazyList.unfoldr(0) { |n| n + 2 }. If you append odd onto even, the resulting list in theory has both even and odd numbers. But in practice you can never reach the odd numbers because there is an infinite quantity of even numbers at the front of the list.

Instead, merge them together to produce a lazy union of the two lists. Merge interleaves alternating elements from each list, so the resulting lazy list contains all the elements of each list, and if you take a finite sample, such as the first 1,000 elements, you get 500 of each. even.merge(odd) => (0 1 2 3 ...). (Other interesting operations that work with infinite lists include cons, pairwise and product.)

Any sufficiently complicated Lisp or Ruby program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Haskell.

My current version of a LazyList class in Ruby is here. As soon as I switched from the “eager” code using arrays to the “lazy” code using lazy lists, memory use went down and performance went up. Not as much as a switch to imperative (and not even close to writing a better clustering algorithm), but enough that I can move forward.

Here is an example of a function using arrays to enumerate choices (distribute(2, 1..4) -> [[1, 2], [1, 3], [1, 4], [2, 3], [2, 4], [3, 4]]):

def distribute choices, range
  spots = (range.end - range.begin) + 1
  return [] if choices <= 0 || choices > spots
  remaining_choices = choices - 1
  if remaining_choices == 0
    return range.to_a.map { |spot| [spot] }
  else
    (range.begin..(range.end - remaining_choices)).to_a.inject([]) { |acc, first_spot| acc + (distribute(remaining_choices, (first_spot + 1)..range.end).map { |remaining_distribution| remaining_distribution.unshift(first_spot) }) }
  end 
end

And rewritten with lazy lists:

def distribute choices, range
  spots = (range.end - range.begin) + 1
  return LazyList.new() if choices <= 0 || choices > spots
  remaining_choices = choices - 1
  if remaining_choices == 0
    LazyList.from_range(range).map { |spot| LazyList.from_element(spot) }
  else
    LazyList.from_range(range.begin..(range.end - remaining_choices)).mapr(:merge) do |first_spot|
      distribute(remaining_choices, (first_spot + 1)..range.end).map do |remaining_distribution| 
        LazyList.cons(first_spot, remaining_distribution) 
      end
    end
  end 
end

It’s almost identical. LazyLists replace arrays, mapr replaces accumulating a list with inject, and cons replaces unshift, but otherwise it’s the same code. Mission accomplished: we’ve changed the behaviour without having to change the way we express our algorithms. Had we wanted to, we could have supported enough array syntax that Lazy Lists look exactly like arrays, but that isn’t necessary.

And when running the entire generator, the memory footprint is dramatically lower. For example, this small routine quoted above no longer generates an array of arrays: it generates a structure that can generate the lists when needed. Switching the data structure changes the evaluation behaviour of the generator code: it gets us 80% of having an imperative structure without having to tie the code to the implementation.

This is exactly what separation of concerns is all about: green code here, yellow code there.

Lazy evaluation is the red pill

It’s great to make a change to the code’s behaviour without changing the way we think about the solution to our problem. But you know what? It isn’t interesting. In fact, if lazy optimization was something we needed on a regular basis, we ought to switch to a language that does it for us. Implementation details ought to be left to libraries, compilers, and virtual machines. That’s what they’re for.

A language that doesn’t affect the way you think about programming, is not worth knowing.

—Alan Perlis

Now that we hold in our hands a tool for making infinitely long lists, the question to ask is, “how does that affect the way we think about generating sample curves?”

The first thing that comes to mind is to ask why the set of sample curves we generate is finite. It isn’t really finite, there are an infinite number of curves. (And not just any infinity, it’s Aleph One, the children’s book One Two Three… Infinity explained that forty years ago.)

I’m not a Haskell user (yet), but The Haskell School of Expression: Learning Functional Programming through Multimedia has received rave reviews and comes with solid recommendations. It’s on my wish list if you’re feeling generous!

What we really want is a finite sample of that infinite set of curves. Now let’s zoom in and think about a single parameter, the y value for one of the control points on a curve. Let’s say it’s a rational number between 0.0 and 1.0 inclusive. It’s relatively easy to generate a list of rationals in that range. Say we generate 0.0000...00001 (with jillions of zeroes elided: we want the smallest number our computer can represent), followed by 0.0000...00002 (the second smallest), and so on.

If we leave the computer running for a few months or years, we should get all of the numbers between 0.0 and 1.0 that our computer can represent. That shouldn’t be hard. But even taking into account the finite limitations on a computer’s representation, that list is infinitely long.

Useful samples

We need to take a finite sample of all of the infinite possibilities in that list.

When working with an infinite list, what we want is that the front part of the list, the part we can access before the heat death of the Universe, contains a useful sample. If we were sampling a list like (0.0000...00001 0.0000...00002 0.0000...00003 0.0000...00004 ... ) for control points, the first values are not useful at all. We’d just get a line along the x access.

There are various strategies for generating a more useful list. We could create an infinite list of random values. I’ll outline another method below. But the take-away idea is this: we need lists where the most useful values are first. Working with infinite lists encourage us to think of infinitely sized sets of samples, ordered by usefulness.

Let’s say that when we generate y, we devise a list like (0.0 1.0 0.5 0.75 0.25 ... ). We’re successively bisecting the intervals (which is why the library method is called binary_search). We can stop at any time and we have decent distribution. (And we can pipe dream that in the future, we can train our generator to favour points more likely to be useful to us.) That is much more useful!

Here’s some Lazy List code that does just that: given two seeds and an optional bisection block, it generates an infinite list of values:

def binary_search seed1, seed2, &block
  bisection_block = block || lambda { |x, y| (x-y)/2 + y }
  LazyList.cons(
    seed1,
    LazyList.new(
      seed2,
      delay {
        bisect(seed1, seed2, &bisection_block)
      }
    )
  )
end

# infinitely bisects a space, exclusive of the seed values

def bisect seed1, seed2, &block
  return EMPTY if seed1 == seed2
  half = block.call(seed1, seed2)
  LazyList.new(
    half,
    delay {
      merge(
        bisect(seed1, half, &block),
        bisect(half, seed2, &block)
      )
    }
  )
end

And indeed, LazyList.binary_search(0.0, 1.0) -> (0.0 1.0 0.5 0.75 0.25 ... ).

The Grand Hotel

Satan, Cantor, and Infinity and Other Mind-boggling Puzzles is a five-star introduction to Cantor’s work on infinity, including a special treat: a completely different proof that Aleph One is greater than Aleph Zero based on games, very much in the style of Conway’s Surreal Numbers. Currently out of print, but please get yourself a used copy from a bookseller: you won’t be disappointed!

What happens when we try to combine two infinite lists? For example, we want a list of (x, y) tuples, generated as the Cartesian Product of our binary search list with itself. If we use a typical breadth-first or depth-first algorithm, we’re sunk. We get ( (0.0 0.0) (1.0 0.0) (0.5 0.0) (0.75 0.0) (0.25 0.0)... ). That has a nice distribution along one axis but still useless. What we need is a way of combining two infinite lists such that they give us a nice useful sample when we take a finite number of elements off the front.

Here’s a table describing how the ‘breadth-first’ algorithm for generating the product of two lists works. Consider two lists of three elements each. In the table below, columns represent one list and rows the other. The numbers represent the order that the product of the lists is generated:

Breadth-first mapping from elements to integers
1	2	3
4	5	6
7	8	9

This works fine for finite lists, but as we saw above it is useless for infinite lists. But looking at it in table form is useful. The product of two lists is tabular, the problem with a traditional algorithm is the order that we select elements from the table, not that the elements are tabular. Just as we solved the problem of how to sample y magnitudes by changing the order in which we selected rationals between 0.0 and 1.0, what we need to do with Cartesian Products, what we need to do is select the products in an order that provides useful results.

As it happens, there’s an order that generates useful finite samples when dealing the product of two infinite lists: Instead of generating rows and appending them to each other, it generates diagonals and merges them with each other. The first diagonal is 1, 5, 9. The second is 4, 8. The third is 2, 6. And so on. The advantage of this algorithm is that if given two infinite lists, it starts in the upper left-hand corner and supplies values, working its way right and down.²

Here’re the first four values of the product of two binary searches with each other: [{:x=>0.0, :y=>0.0}, {:x=>0.0, :y=>1.0}, {:x=>1.0, :y=>0.0}, {:x=>0.0, :y=>0.5}]. And the next four: [{:x=>1.0, :y=>1.0}, {:x=>1.0, :y=>0.5}, {:x=>0.5, :y=>0.0}, {:x=>0.0, :y=>0.25}]. And the next eight:

[{:x=>0.5, :y=>0.5}, {:x=>0.5, :y=>0.25}, {:x=>0.5, :y=>1.0}, {:x=>1.0, :y=>0.25}, {:x=>0.25, :y=>0.25}, {:x=>0.25, :y=>0.75}, {:x=>0.25, :y=>0.0}, {:x=>0.0, :y=>0.75}]

.

We can take as many samples as we want, and more sample give us more “resolution.” We now have the tools we need to get rid of all of the limits and finite combinations and focus on describing what we’re trying to generate without intermingling a lot of generating code.

And I’m thinking about eternity
Some kind of ecstasy got a hold on me

—Bruce Cockburn, Wondering Where The Lions Are

Now that we can generate an infinite list of useful magnitudes, plus we can combine infinite lists in a useful way, we can define an infinite list of sample curves. Here’s the exact definition of an infinite list of partial beziérs (there are only three control points because the origin of the beziér is always the terminus of the previous beziér in the curve we are generating):

def sample_cubic_beziers
  magnitudes = LazyList.binary_search(0.0, 1.0)
  control_points = LazyList.cartesian_product(magnitudes, magnitudes) do |x, y|
    { :x => x, :y => y }
  end
  LazyList.cartesian_product(control_points, control_points, control_points) do |p1, p2, p3|
    {
      :p1 => p1,
      :p2 => p2,
      :p3 => p3
    }
  end
end

That’s really simple, and really easy to understand.³ The domain-specific code that defines a cubic Beziér is concise and readable. And the infrastructure code that handles combinatorics is shunted away out of sight where it doesn’t clutter up our thinking.

So what have we seen?

Lazy evaluation solved a performance problem without needing an extensive rewrite of our generation code;

Taking the red pill, infinite lists allowed us to radically simplify the original code.

Are there some opportunities to steal an idea like lazy evaluation from other languages and use them to simplify your code?

The data is a set of (x,y) tuples where x is time and y is a magnitude. The effort remaining in a sprint backlog is an example of this kind of data set. I need a function that computes the distance between any two data sets. I already have a function that computes the distance between a curve through the xy space and a data set, so if I build a lot of curves and then take the distance between each curve and each data set, I can find the distance between pairs of data sets by searching for the curve with the lowest total distance.

This operation is On², but that’s why we invented distributed processing. Also, this operation is done infrequently relative to operations making use of the clusters, something like Google’s indexing being far less frequent than searching Google. And of course, when my MacBook overheats and burns its way through my desk, I can come back and replace the brute force operation with something more intelligent.

Well, all this requires generating the sample curves. I have code that makes curves out of Beziérs, provided I supply coördinates for the control points. the curves are actually ActiveRecord models. So all I needed was some simple code that generates all of the possible combinations and then write them to the database. It’s the kind of thing candidates write in job interviews where employers actually care about their stone cutting skills.

This problem is congruent to the proof that the set of all points on a plane is the same size as the set of all points in a line.

And it could be simpler yet: a “list comprehensions” DSL for lazy lists ~~is on the wish list~~ would make things far more readable.

Labels: lispy, ruby

¶ 1:49 PM

Comments on “Haskell, Ruby and Infinity”:

That's a really nice way to search a parameter space!

In code with lots of nested list generators, the trick is to make sure that all the intermediate lexical scopes can be garbage-collected. And in a really tricky example (such as yours, here), that can sometimes be challenging.

For an example of what can go wrong, see Data Mining the Yeast Genome in a Lazy Functional Language. They implemented a very sophisticated algorithm in days, and then spent a couple of weeks getting their heap size under control.

The usual solution in Haskell is to define a small number of well-understood list-walking functions, and define everything in terms of them. If, for example, you can write everything in terms of 'fold', you can automatically rule out various pathological behaviors.

# posted by

emk : 3:01 PM

if lazy optimization was something we needed on a regular basis, we ought to switch to a language that does it for us.

Interesting to see that link to ye olde Miranda, also note "CLEAN is by default a lazy language"

# posted by

Isaac Gouy : 4:00 PM

I used lazy techniques in the Notes Formula Language runtime as well. During formula startup, it does a semantic analysis of the runtime trees and figures out which expressions and subexpressions (tree nodes) are referentially transparent.

It then does partial evaluations of the trees, executing only branches that have side effects and lazily executes all referentially transparent subbranches.

The cool thing is expressions that evaluate and assign to a variable can instead be partially executed and the remaining referentially transparent computations are done only when the variable is later referenced. Also the engine can figure out "constant" expressions using the same mechanism, and cache the expression results for subsequent evaluations.

This might sound rather complicated, but the algorithm is simple to understand and the formula can be analyzed in O(N) time (N being the number of nodes in the parsed formula tree), so it incurs very little overhead. In the Notes formula engine, the analysis happens during the building of the runtime trees.

I understand most SQL interpreters do something similar (as I now work for MySQL, I should know more about all that soon).

# posted by

Damien : 2:42 PM

# posted by

Damien : 2:43 PM

<< Home