From Abstraction to Zipf
Alpha…Damien Katz raised an interesting question the other day: he wondered whether elements of computer programs followed a
Zipf distribution. In other words, he wondered whether the most frequently used item occurred a lot more than the next most frequent, which occurs a lot more than the third most frequent, and so forth.
That’s an interesting question. Is it true? And if it is, what does that tell us about composing programs?
Looking at computer programs, some things seem to follow Zipf’s law and others don’t. In C++ programs, loop indices called
i
probably are an order of magnitude more common than those called
j
. Unless you use
Apps Hungarian, of course. But are
for
loops that much more common than
if
statements? That might vary from place to place within a single program, but overall they might be roughly equal in frequency.
If we step up a level and look at idioms, my gut tells me that there is a strong Zipf distribution. Although each program will differ, some programs might make heavy use of lists, others of maps. Some might do a lot with closures, others with objects. These are fairly generic, of course.
If we have a look at the more domain-specific items in a program, we really see a Zipf distribution. Some things pop up again and again and again in a program, others are rare.
Why Abstract?1The most important tool we have for composing programs is
abstraction. When we create an abstraction for something, we get one and a half wins (a
sesqui-win).
The full win is that we get to take a common element and place its responsibility in once place. For example, if your language gives you a
map
function or method, there is one place for the code that applies a function to each element of a collection and produces a collection of the results. Unlike a “pattern,” you don’t have to re-implement map every time you use it, you just use it. The centralization of responsibility in a single place is very powerful way to
separate concerns.
The half win is that we don’t need to understand its implementation, we only need to understand its interface. I honestly don’t know whether Ruby’s built-in
map
method applies itself to a collection one at a time from the start to the end or in some other order. I suspect it’s from start to end for ordered collections, but I don’t know and I don’t have to know.
This is only half a win, because for anything too complicated to explain on a PowerPoint slide where you are promising the newest silver bullet, abstractions have a nasty habit of leaking.
Leaky abstractions force you to understand the implementation to use them.
There are some drawbacks to abstractions. As noted, sometimes the abstraction leaks. In that case, you often have to look up what an abstraction does. If there was no abstraction and the implementation was right there in the middle of your code, you wouldn’t have to look it up. So when you see:
class BlogPost << ActiveRecord::Base
has_many :comments
# ...
end
You obviously to need to know that
has_many
creates methods like
comments
,
comment_ids=
and
comments<<
automagically for you. In fact, that is kind of the point of
has_many
. But if you are doing anything non-trivial, you also need to know whether
has_many
actually
defined comments
,
comment_ids=
and
comments<<
or whether it has merely modified
BlogPost
to
handle those messages. In that sense, the abstraction leaks.
You get much the same thing if you build a Java class hierarchy that is umpteen levels deep festooned with abstract base classes, factories, and dependency injection. In one sense you get a really tight abstraction. In another sense, you have a big leak: you need to know its internals to get any real work done.
Another drawback to abstractions fall out of their strength. If you haven’t seen a particular abstraction before, you don’t know what it does and how it works. Contrary to popular hype, self-documenting names are not enough. If you are a Rails user, raise your hand if you can tell me what all of the various options for
has_many
are and what they do. You have to learn each abstraction’s interface, just as you had to learn what a
for
loop does for you.
The win (not having to know how it works) is also a loss (not knowing what it does).
An abstraction is an abstraction is an abstractionI think most people agree that named subroutines (or functions, or procedures) are an excellent abstraction. First-class functions and anonymous functions seem to be an acquired taste (just one taste, and your programs will quickly acquire them). Objects—in the strong sense of polymorphic programs—are well-regarded abstractions.
Macros, DSLs, and other means of writing your own mini-language provoke the wildest claims of productivity and equally baseless fears of unreadable, fragile, “magic” code built out of gossamer dreams.
What these all have in common is that these functional abstractions live happily within the syntax provided by the host programming language. They abstract function without abstracting syntax. For the most part, non-syntactic abstraction is uncontroversial.
There seems to be debate over syntactic abstraction. Macros, DSLs, and other means of writing your own mini-language provoke the wildest claims of productivity and equally baseless fears of unreadable, fragile, “magic” code built out of gossamer dreams. The basic “problem” with syntactic abstractions like
has_many
in Rails or
let
in Scheme is that syntactic abstractions force you to learn their interface.
They are right in your face, especially if they are very brief. Consider:
BlogPost.find_or_create_by_name('From Abstraction to Zipf').comments << Comment.new('bleh!', 'anonymous')
Don’t bother searching for the
find_or_create_by_name
method or the
comments<<
method in the code base.
All the same, syntactic abstractions are
just like functional abstractions. You get some wins, and you pay some costs in understanding and potential “leakiness.”
Zipf to the rescueSo should we zealously abstract everything we can? Or should we conservatively avoid “magic” like syntactic abstractions?
Every abstraction has a fixed learning cost. That cost is amortized over all of the instances we’ll encounter. So you need to learn how
has_many
works just once, and then you benefit every time you read it or write it. This is why abstractions built into a programming language are big wins: you amortize their cost over every program written in the language. Isn’t that why Java and C# look so much like C++ which looks so much like C? They were offering abstractions that have been paid up for years.
Zipf’s law tells us that each application only needs a very few new kinds of abstractions to get most of the benefits. There are only a few low-hanging fruits, but each is plump and juicy.
An abstraction built into a framework like
Rinda
or
MapReduce
is amortized over fewer programs. A domain-specific abstraction for a single organization is amortized over a very few programs, and an abstraction built into one program is only amortized over instances in the one code base. Unsurprisingly, opportunities for abstraction follow the Zipf distribution.
Of course, that distribution is self-similar: the abstractions
within a program follow the same distribution as the abstractions
within an organization. So even if you aren’t inventing a new programming language, Zipf can help.
Quite simply, it pays to aggressively abstract the few things that are encountered the most frequently. In a CRUD web application, schema relationships like
belongs_to
are incredibly frequent. There’s a big win from creating a syntactic abstraction, even if the learning curve looks daunting to newcomers. Creating a domain-specific language for database schema changes is also a big win.
Note that we aren’t saying that
has_one
,
belongs_to
, and
has_many
appear more than 50% of the time. They may be quite infrequent. The point is that they are far more frequent than something else you could abstract, and the other thing will take just as much time to learn to read but will pay off far less often.
But should you create a DSL for
list comprehensions in your CRUD application or just use the language’s built-in collections? I would say, you would need an application that uses an awful lot of lists before a syntactic abstraction for lists is a win.
It might be, but the nice thing is, it probably won’t be close: Zipf’s law tells us that each application only needs a very few new kinds of abstractions to get most of the benefits. There are only a few low-hanging fruits, but each is plump and juicy.
…OmegaI think it is a win to use aggressive abstraction techniques like metaprogramming, but only if you restrain yourself to abstracting the very few things that appear most frequently in your code base. The things that are used less frequently should be abstracted using less aggressive techniques.
The resulting program may not be magically 50% shorter than an unabstracted program, because the “long tail” of idioms and patterns are simply not worth abstracting away even though we have the tools for doing so.
You can probably tell whether a program has the proper level of abstraction just by checking its distribution.
If you find a big abstraction framework for one thing while frequently used idioms are expressed using patterns instead of abstractions, you are probably looking at an improperly abstracted program. You have to wade through verbiage to read everything, and the one time you don't need to abstract the code away, you have to chase definitions through indirection.
And in a well-written program, I would expect that the things that occur most frequently are expressed at the highest and most powerful levels of abstraction, while things that happen infrequently are expressed in a more direct and explicit way. Such a program would be easy to read throughout.
- Why Abstract? is a delightful and wonderfully brief book that explains the appeal of abstract art in a simple and direct way. It’s one of the treasures on my bookshelf and I urge you to find a copy through a used book seller.
- Suprised? There is no footnote two. But if there was, it would be to mention that Paul Graham’s incredible book On LISP: Advanced Techniques for Common LISP is freely available online. This is a must read for anyone interested in composing syntactic abstractions in any language.
Labels: lispy, ruby