(This is a snapshot of my old weblog. New posts and selected republished essays can be found at raganwald.com.)

Friday, April 13, 2007
  From Abstraction to Zipf


Damien Katz raised an interesting question the other day: he wondered whether elements of computer programs followed a Zipf distribution. In other words, he wondered whether the most frequently used item occurred a lot more than the next most frequent, which occurs a lot more than the third most frequent, and so forth.

That’s an interesting question. Is it true? And if it is, what does that tell us about composing programs?

Looking at computer programs, some things seem to follow Zipf’s law and others don’t. In C++ programs, loop indices called i probably are an order of magnitude more common than those called j. Unless you use Apps Hungarian, of course. But are for loops that much more common than if statements? That might vary from place to place within a single program, but overall they might be roughly equal in frequency.

If we step up a level and look at idioms, my gut tells me that there is a strong Zipf distribution. Although each program will differ, some programs might make heavy use of lists, others of maps. Some might do a lot with closures, others with objects. These are fairly generic, of course.

If we have a look at the more domain-specific items in a program, we really see a Zipf distribution. Some things pop up again and again and again in a program, others are rare.

Why Abstract?1

The most important tool we have for composing programs is abstraction. When we create an abstraction for something, we get one and a half wins (a sesqui-win).

The Ruby Way is the perfect second Ruby book for serious programmers. The Ruby Way contains more than four hundred examples explaining how to implement advanced ideas like metaprogramming.

The full win is that we get to take a common element and place its responsibility in once place. For example, if your language gives you a map function or method, there is one place for the code that applies a function to each element of a collection and produces a collection of the results. Unlike a “pattern,” you don’t have to re-implement map every time you use it, you just use it. The centralization of responsibility in a single place is very powerful way to separate concerns.

The half win is that we don’t need to understand its implementation, we only need to understand its interface. I honestly don’t know whether Ruby’s built-in map method applies itself to a collection one at a time from the start to the end or in some other order. I suspect it’s from start to end for ordered collections, but I don’t know and I don’t have to know.

This is only half a win, because for anything too complicated to explain on a PowerPoint slide where you are promising the newest silver bullet, abstractions have a nasty habit of leaking. Leaky abstractions force you to understand the implementation to use them.

There are some drawbacks to abstractions. As noted, sometimes the abstraction leaks. In that case, you often have to look up what an abstraction does. If there was no abstraction and the implementation was right there in the middle of your code, you wouldn’t have to look it up. So when you see:

class BlogPost << ActiveRecord::Base
has_many :comments
# ...

You obviously to need to know that has_many creates methods like comments, comment_ids= and comments<< automagically for you. In fact, that is kind of the point of has_many. But if you are doing anything non-trivial, you also need to know whether has_many actually defined comments, comment_ids= and comments<< or whether it has merely modified BlogPost to handle those messages. In that sense, the abstraction leaks.

You get much the same thing if you build a Java class hierarchy that is umpteen levels deep festooned with abstract base classes, factories, and dependency injection. In one sense you get a really tight abstraction. In another sense, you have a big leak: you need to know its internals to get any real work done.

Another drawback to abstractions fall out of their strength. If you haven’t seen a particular abstraction before, you don’t know what it does and how it works. Contrary to popular hype, self-documenting names are not enough. If you are a Rails user, raise your hand if you can tell me what all of the various options for has_many are and what they do. You have to learn each abstraction’s interface, just as you had to learn what a for loop does for you.

The win (not having to know how it works) is also a loss (not knowing what it does).

An abstraction is an abstraction is an abstraction

I think most people agree that named subroutines (or functions, or procedures) are an excellent abstraction. First-class functions and anonymous functions seem to be an acquired taste (just one taste, and your programs will quickly acquire them). Objects—in the strong sense of polymorphic programs—are well-regarded abstractions.

Macros, DSLs, and other means of writing your own mini-language provoke the wildest claims of productivity and equally baseless fears of unreadable, fragile, “magic” code built out of gossamer dreams.

What these all have in common is that these functional abstractions live happily within the syntax provided by the host programming language. They abstract function without abstracting syntax. For the most part, non-syntactic abstraction is uncontroversial.

There seems to be debate over syntactic abstraction. Macros, DSLs, and other means of writing your own mini-language provoke the wildest claims of productivity and equally baseless fears of unreadable, fragile, “magic” code built out of gossamer dreams. The basic “problem” with syntactic abstractions like has_many in Rails or let in Scheme is that syntactic abstractions force you to learn their interface.

They are right in your face, especially if they are very brief. Consider:

BlogPost.find_or_create_by_name('From Abstraction to Zipf').comments << Comment.new('bleh!', 'anonymous')

Don’t bother searching for the find_or_create_by_name method or the comments<< method in the code base.

All the same, syntactic abstractions are just like functional abstractions. You get some wins, and you pay some costs in understanding and potential “leakiness.”

Zipf to the rescue

So should we zealously abstract everything we can? Or should we conservatively avoid “magic” like syntactic abstractions?

Every abstraction has a fixed learning cost. That cost is amortized over all of the instances we’ll encounter. So you need to learn how has_many works just once, and then you benefit every time you read it or write it. This is why abstractions built into a programming language are big wins: you amortize their cost over every program written in the language. Isn’t that why Java and C# look so much like C++ which looks so much like C? They were offering abstractions that have been paid up for years.

Zipf’s law tells us that each application only needs a very few new kinds of abstractions to get most of the benefits. There are only a few low-hanging fruits, but each is plump and juicy.

An abstraction built into a framework like Rinda or MapReduce is amortized over fewer programs. A domain-specific abstraction for a single organization is amortized over a very few programs, and an abstraction built into one program is only amortized over instances in the one code base. Unsurprisingly, opportunities for abstraction follow the Zipf distribution.

The Art of the Metaobject Protocol is one of the most important books ever written on the subject of building abstractions with objects. If the idea that there is more than one way to structure objects and classes seems surprising, then learning about the Meta-Object protocol will teach you what your language has kept secret.

Of course, that distribution is self-similar: the abstractions within a program follow the same distribution as the abstractions within an organization. So even if you aren’t inventing a new programming language, Zipf can help.

Quite simply, it pays to aggressively abstract the few things that are encountered the most frequently. In a CRUD web application, schema relationships like belongs_to are incredibly frequent. There’s a big win from creating a syntactic abstraction, even if the learning curve looks daunting to newcomers. Creating a domain-specific language for database schema changes is also a big win.

Note that we aren’t saying that has_one, belongs_to, and has_many appear more than 50% of the time. They may be quite infrequent. The point is that they are far more frequent than something else you could abstract, and the other thing will take just as much time to learn to read but will pay off far less often.

But should you create a DSL for list comprehensions in your CRUD application or just use the language’s built-in collections? I would say, you would need an application that uses an awful lot of lists before a syntactic abstraction for lists is a win.

It might be, but the nice thing is, it probably won’t be close: Zipf’s law tells us that each application only needs a very few new kinds of abstractions to get most of the benefits. There are only a few low-hanging fruits, but each is plump and juicy.


I think it is a win to use aggressive abstraction techniques like metaprogramming, but only if you restrain yourself to abstracting the very few things that appear most frequently in your code base. The things that are used less frequently should be abstracted using less aggressive techniques.

The resulting program may not be magically 50% shorter than an unabstracted program, because the “long tail” of idioms and patterns are simply not worth abstracting away even though we have the tools for doing so.

You can probably tell whether a program has the proper level of abstraction just by checking its distribution.

If you find a big abstraction framework for one thing while frequently used idioms are expressed using patterns instead of abstractions, you are probably looking at an improperly abstracted program. You have to wade through verbiage to read everything, and the one time you don't need to abstract the code away, you have to chase definitions through indirection.

And in a well-written program, I would expect that the things that occur most frequently are expressed at the highest and most powerful levels of abstraction, while things that happen infrequently are expressed in a more direct and explicit way. Such a program would be easy to read throughout.

  1. Why Abstract? is a delightful and wonderfully brief book that explains the appeal of abstract art in a simple and direct way. It’s one of the treasures on my bookshelf and I urge you to find a copy through a used book seller.

  2. Suprised? There is no footnote two. But if there was, it would be to mention that Paul Graham’s incredible book On LISP: Advanced Techniques for Common LISP is freely available online. This is a must read for anyone interested in composing syntactic abstractions in any language.

Labels: ,


Comments on “From Abstraction to Zipf:
From the pov of self-education, I think it really pays off big to go through a serious[*] program and do everything right, even the things at the far end of the long tail. This is because doing so makes the whole program a lot less noisy, and this makes it possible to see the latent structure, and learn what abstractions you need at the large-scale design level.

What's especially fun is that when you start, there are often multiple, incompatible ways to do everything right. Sometimes cleaning everything up reveals a way to restructure things to satisfy multiple conceptions of the right thing. This is very satisfying. You can also learn that the reason that there were multiple ways to clean everything up is because there were actually multiple designs in your program, and untangling them helps you understand the underlying engineering tradeoffs more deeply.

[*] A program is "serious" if it has intellectual mass, but it doesn't have to have lots and lots of LOC. I like toy compilers for this purpose; they require minimal algorithmic trickery and there's a lot of theory to help structure their design.
A bunch of aspects of both code and running systems have been shown to follow the Zipf distribution (or close relatives). My own studies involve asking how one aspect (the distribution of complexity in a codebase) varies depending the design techniques applied, in particular TDD. This seems to be well modelled by a Zipf distribution.

And it does look as if the more TDD and refactoring programmers do the "steeper" the resulting distribution of complexity, ie lesser complexity gets more frequent and greater complexity less frequent.

This generally isn't a surprise to people who do a lot of TDD, but it's nice to be able to use a simple tool like the Zipf distribution to quantify people's intution about design.

Zipf's explanation for this distribution of word frequencies in natural language has to do with speakers and listeners seeking to minimize their effort when comunicating, I expect a similar explanation to apply with programming too.

One note of caution, though, Zipf (or any distribution) is descriptive not prescriptive: the existiance of a Zipfian distribution of abstractions does not "[tell] us that each application only needs a very few new kinds of abstractions to get most of the benefits" without knowing anything else about the application.

At best, we might expect to show that a small set of abstractions are overwhelmingly most frequently used of all the abstractions used in a large corpus of applications, which certainly would be a huge clue to the abstractions we'd likely need in the next application we write.

And in fact, we know that they do, and what they are. They are (as you suggest) the ones that persist in the libraries that we use. This might go some way towards explaining why the number of really new fundamental programming techniques, algorithms, data structures, etc found each year seems to have been declining precipitously since the early days of the industry
Neel, Keith: thanks so much for the insights!

<< Home
Reg Braithwaite

Recent Writing
Homoiconic Technical Writing / raganwald.posterous.com

What I‘ve Learned From Failure / Kestrels, Quirky Birds, and Hopeless Egocentricity

rewrite_rails / andand / unfold.rb / string_to_proc.rb / dsl_and_let.rb / comprehension.rb / lazy_lists.rb

IS-STRICTLY-EQUIVALENT-TO-A / Spaghetti-Western Coding / Golf is a good program spoiled / Programming conventions as signals / Not all functions should be object methods

The Not So Big Software Design / Writing programs for people to read / Why Why Functional Programming Matters Matters / But Y would I want to do a thing like this?

The single most important thing you must do to improve your programming career / The Naïve Approach to Hiring People / No Disrespect / Take control of your interview / Three tips for getting a job through a recruiter / My favourite interview question

Exception Handling in Software Development / What if powerful languages and idioms only work for small teams? / Bricks / Which theory fits the evidence? / Still failing, still learning / What I’ve learned from failure

The unary ampersand in Ruby / (1..100).inject(&:+) / The challenge of teaching yourself a programming language / The significance of the meta-circular interpreter / Block-Structured Javascript / Haskell, Ruby and Infinity / Closures and Higher-Order Functions

Why Apple is more expensive than Amazon / Why we are the biggest obstacles to our own growth / Is software the documentation of business process mistakes? / We have lost control of the apparatus / What I’ve Learned From Sales I, II, III

The Narcissism of Small Code Differences / Billy Martin’s Technique for Managing his Manager / Three stories about The Tao / Programming Language Stories / Why You Need a Degree to Work For BigCo

06/04 / 07/04 / 08/04 / 09/04 / 10/04 / 11/04 / 12/04 / 01/05 / 02/05 / 03/05 / 04/05 / 06/05 / 07/05 / 08/05 / 09/05 / 10/05 / 11/05 / 01/06 / 02/06 / 03/06 / 04/06 / 05/06 / 06/06 / 07/06 / 08/06 / 09/06 / 10/06 / 11/06 / 12/06 / 01/07 / 02/07 / 03/07 / 04/07 / 05/07 / 06/07 / 07/07 / 08/07 / 09/07 / 10/07 / 11/07 / 12/07 / 01/08 / 02/08 / 03/08 / 04/08 / 05/08 / 06/08 / 07/08 /