Can your type checking system do this?

(This is a snapshot of my old weblog. New posts and selected republished essays can be found at raganwald.com.)

Monday, July 09, 2007

I tend to work with dynamic languages. This is a personal preference at the moment. I have liked statically typed languages with powerful type checking systems in the past, and my mind is wide open to using them again in the future. Unfortunately, when you mention “static typing,” most people think you’re talking about Java. There is more to static typing than Java, a lot more. For example, Brian over at Enfranchised Mind pointed out that languages like OCaml have a static type checking system powerful enough to enforce constraints like:

All possible variants of a type must be dealt with—for example, that a list has no elements or at least one element must both be handled.

The read_async function only takes file handles opened with the async flag set, while the close function can take file handles with the async flag set or not set.

The buffer returned from read_async can not be accessed until the asynchronous read is completed.

A Postgresql cursor can only be created and accessed with a transaction.

He’s absolutely right. Serious programming languages can do all of that.

Joel Spolsky once explained why he likes using Apps Hungarian naming conventions: it helps avoid problems with various SQL and Javascript injection vulnerabilities. The idea is that you want to make it obvious when you’re violating certain constraints on the use of string values in your application. And Tom Moertel responded by showing how Haskell’s static type checking can completely eliminate the problem. Once again, a powerful type checking system can be used to enforce an important constraint.

My own pet peeve problem is the dreaded NullPointerException. Why, I whinge, must the compiler demand that I type boilerplate like:

final Map<String, String> dictionary = new HashMap<String, String> () {{
  // ...
}}

But it will not do a simple little thing for me in return like check all of my code paths to ensure that I never pass a parameter that might be null to a method that assumes the parameter can never be null? Wouldn’t it be nice if the compiler would deign to help me with these things?

What do all of these things have in common? They are semantic constraints. Constraints on the meaning of a program. The kind of type checking you find in popular programming languages that is based on interfaces, on the physical type of your values and variables, that kind of type checking is a step up from syntax, but not much further. I would call it grammar checking, checking that your “sentences” have the right parts of speech in the right places. There is no attempt to check that your sentences are meaningful.

the value of semantic validation

Powerful type systems (type systems like those found in Ocaml and Haskell are often called “expressive” type systems), can be put to work enforcing a degree of semantic correctness. (There are other ways to do it: there are static code analysis tools, for example you can buy tools that check C/C++ code for memory leaks and buffer overflow vulnerabilities. And if you use a language with Design by Contract features, you have another tool in your toolbox.)

The Little MLer introduces ML (and Ocaml) through a series of entertaining and straightforward exercises leading up to the construction of the Y Combinator.

ML and OCaml introduce powerful strong typing and type inference. Both are great languages to learn: you will stretch your understanding of defining types and writing correct programs.

There is an argument that testing can replace type checking. Quite honestly, I believe that extensive automated testing almost completely replaces the need for “grammar checking” in the software I write.

Why? Because checking that you don’t send the wrong message to an object, or call a function with the wrong number of arguments is something that seems to get found when my test suite reaches a certain tiiping point in code path coverage. So far I have always found these things very quickly.

That being said, Theory P warns me that there are some I haven’t found, of course, but I have some faith that the result of such an error will not be catastrophic. With strongly typed languages like Lisp, this kind of failure is gracefully handled. We aren’t talking about a buffer overflow that can turn exploited into a privilege escalation.

The nice thing about grammar errors is that you don’t have to try very hard to catch them. If you have reasonable code coverage, you’ll find them. So you concentrate on your test cases, and while you are testing your logic you will validate your grammar.

So that’s why dynamic languages in conjunction with automated test suites are working for me. YMMV, so let’s not argue about whether it would work for you or not, because I want to move ahead and say that I don’t believe that extensive automated testing has quite the same value for semantic checking.

I believe the hard part of building software to be the specification, design, and testing of this conceptual construct, not the labor of representing it and testing the fidelity of the representation. We still make syntax errors, to be sure; but they are fuzz compared with the conceptual errors in most systems.

—Frederick P. Brooks, Jr.

Semantic errors, you see, are not covered “for free” while you exercise your program logic. If you want to rely on testing to find them, you must become very good at thinking up edge cases where things can go wrong. You must try situations where things fail and you get nulls. You must try to create race conditions in concurrent code. You must try really bad data, like setting up cases where an employee’s hire date is after their termination date, or there is a cycle in the graph of reports and managers, or someone has eight bosses.

You have to try empty lists, lists with one item, all of the cases. Grammar checkers only check that you are passing a list, you need semantic checking to ensure you have coded for every case.

what to do

Now, you do not get semantic checking for free if you start writing code in Ocaml or Haskell. Why? Because you will be a Real Programmer, someone who can write Fortran in any language.

A language that doesn’t affect the way you think about programming, is not worth knowing.

—Alan Perlis, Epigrams in Programming

To take advantage of a language like Ocaml, you must actually think through the semantics of your program and think of ways to make the type system work for you. You must “think in types” instead of thinking in values. This is a valuable exercise, because it fundamentally changes the way you view programming an application.

I’m not a regular Haskell user (yet), but The Haskell School of Expression: Learning Functional Programming through Multimedia has received rave reviews and comes with solid recommendations. It’s no longer on my wish list: I picked it up and it has been enjoyable and educational.

In my experience, this thinking through the semantic constraints of your application may be the most valuable result of the exercise. You see this when people write a very database-centric application. They think about the entities and the relationships. They write constraints and triggers. There’s a great value in arguing all these cases out, in drawing the arrows on the white board and marking them with little question marks, plus signs, asterixes, and ones. It makes you think things through.

I was once asked in an interview to name a feature from C++ I wanted to use in Java. I didn’t name Templates (this was before generics), or Multiple Inheritance. I named the const keyword. I am a huge fan of const correctness, of using the C++ compiler to enforce this one little semantic constraint. It’s an enormous burden, because you literally have to go through your entire program, usage by usage, to make sure you are fully const correct and have correctly declared everything. It would be far easier if C++ had const inference.

But const correctness is valuable precisely because it forces you to think very hard about a semantic constraint: when should something change, and when should it be considered immutable? The act of bringing a code base into const correctness forces you to think about what the code does to its data, not just whether it has “spelling errors.”

I can see the same value in using a language like Ocaml or Haskell to enforce semantic constraints. Sure, the compiler will catch errors for you. But the act of designing your application’s types to enforce your application’s constraints will be a powerful force driving you to deeper understanding of your domain.

If your current programming language can’t do this, I encourage you to study ML, Ocaml, Haskell, or one of the other fine languages with powerful typing systems.

Update (April, 2008): 7 Actually Useful Things You Didn’t Know Static Typing Could Do: An Introduction for the Dynamic Language Enthusiast

¶ 11:30 PM

Comments on “Can your type checking system do this?”:

Joel's example is atrocious. If all the code did was echo out "Hello, $name" to the user that submitted $name (and to no one else) then there is no reason to prevent the user from inputing javascript.

Secondly, why would he design the app in such a way that he has to juggle unsafe and safe data? That's over-complicated and will lead to mistakes. Anyone who has used Hungarian Notation knows that it's only a matter of time before it will lie to you.

A far better solution is to sanitize user input immediately. No contrived naming conventions required. No juggling of unsafe and safe data.

# posted by

Sean : 1:50 AM

Would you call it "graceful" to shut down Arianne 5's inertial reference system?

# posted by

Chung-chieh Shan : 2:18 AM

> Would you call it "graceful" to shut down Arianne 5's inertial reference system?

Except Flight 501 crash happened because the Ada interrupt handler had been disabled for performance reasons (and only there in the whole code, which is even more stupid). The Ada runtime would have found it at runtime and could've handled it more gracefully (the check was fully disabled at compile time since they knew they were coercing a 64bits FP value to a 16 bits signed integer), it's not really the kind of stuff a type system could've catched afaik

# posted by

Masklinn : 3:20 AM

> A far better solution is to sanitize user input immediately. No contrived naming conventions required. No juggling of unsafe and safe data.

Sean, there's a problem with the sanitize-immediately approach: until you use a string, you don't know whether it is safe or unsafe for that particular use, nor do you know how to sanitize it properly. That is, "safeness" is a not a property of a string but rather a relationship between a string and a context in which it is used. Thus you can't determine safeness until you how and where a string will be used.

A string representing user input, for example, might be considered safe for writing to a file (where no string interpretation takes place) but unsafe for inserting into an HTML or URL template (where parts of the string could be misinterpreted as HTML or URL markup). Note that you need to use different sanitization methods for HTML and URL markup, so it would be difficult to pre-sanitize a string that might be used in either context.

# posted by

Tom Moertel : 3:55 AM

Of course you should sanitize strings, but not when they are input and instead when they are used. Like this:

public interface StringSanitizer {
public String sanitize(String str);
}

and in the code:

String toUse = context.getStringSanitizer().sanitize(input);

it's as simple as that.

Of course there is the problem of determining whether that particular String was untrusted, or not. I recommend wrapping the input String into an UntrustedString object and passing that around, and modifying Sanitizer interface to deal with UntrustedStrings.

Or, if you use AspectJ, you can just taint Strings when they are input and force sanitization when it's needed. I don't like that kind of magic though

# posted by

albert bulawa : 4:44 AM

To everyone writing to say:

There is no need for static typing to solve problem X, you should simply _____ whenever you want to _____.

The point is not that static typing will magically do this for you, but rather that static typing will catch you should you forget to follow your discipline.

I suspect we're back to Theory D vs Theory P: Theory D says we simply make a plan (sanitize strings when we read them, for example), and follow it, without exception.

Theory P says we will forget, or we will not understand what we were thinking, or... or... and we need a process for correcting our transgressions.

The point of my post is that for a certain class of such issues, the ones I'm calling "grammar" issues, like what methods we call on an object, I am (at the moment) very comfortable making a plan, following it, and using automated test suites to check my grammar.

For another class of issues, semantic issues, I have found the static type checking provided by powerful type systems to be very useful.

# posted by

Reginald Braithwaite : 6:42 AM

Joel's example is atrocious. If all the code did was echo out "Hello, $name" to the user that submitted $name (and to no one else) then there is no reason to prevent the user from inputing javascript.

I think you have accidentally introduced a Strawman.

Joel's company makes FogBugz, a program with several features that involve user X submitting data that user Y sees, like bug reports and the discussion forums.

So I think he has definite need for this kind of safety.

And of course, the XSS example extends to XML injection vulnerabilities. What if you are writing a banking application that transfers money uing XML as a data interchange? an XML injection vulnerability could be very serious.

And don;t get me started on SQL injection vulnerabilities...

# posted by

Reginald Braithwaite : 6:47 AM

String toUse = context.getStringSanitizer().sanitize(input);

it's as simple as that.

Until you forget to do it once, and lo and behold, SQL Injection/XSS vulnerability until you have a security audit or someone actually takes advantage of your hole.

Which is why raganwald posted this in the first place: when your type system can tell you that you're using strings in an unsafe manner, even if you forget you'll be reminded because your project can't compile.

The point is that tools such as a good type system or a string tainting framework is a safety net (and I guess it could even lead to automatic sanitizing of strings on output)...

# posted by

Masklinn : 7:17 AM

Masklinn: So you can wrap your input Strings in UntrustedString container and pass _that_ to the Sanitizer. Then, by correct design of class relations, you won't either be able to "forget" because UntrustedString is NOT a String. Maybe you even cannot unwrap it without sanitizing because you forced yourself to code like this:

.... [input is UntrustedString]

String toUse = input.sanitize(context.getStringSanitizer())

or, use AspectJ to taint Strings at input and an advice to sanitize when it's needed.

# posted by

albert bulawa : 7:41 AM

I think you have a typo:

> I was once asked in an interview
> to name a feature from C++ I
> wanted to use in Java... It
> would be far easier if C++
> had const inference.

That should be "if Java had const inference", right?

# posted by

Bill Mill : 8:58 AM

I totally agree on the "thinking experience". I am playing nowadays with Scala and really feel that in order to get most benefits of the language, you have to "think in it".

I am also wondering how much the type constraints are actually saving on the assertions I would have to do in my test cases. I would like to see a complete study comparing 2 programs, one with dynamic typing and the other with static typing, with respect to the amount of necessary unit testing. For instance, David Pollak, the author of the lift web framework in Scala claims a reduction of 25% of the loc for unit testing, compared to equivalent testing for a rails application.

As far as I am concerned, I found that complementing "classical" unit tests with a tool such as ScalaCheck (inspired from Haskell's QuickCheck - thanks to Tom Moertel for writing about it!) was also very effective to convey the intention of some "semantic" tests and produce test hundreds of relevant cases at once [If you're a Ruby fan, you can check out RushCheck, the Ruby variant].

Once again, it's a nice thought experiment, to think about your code in terms of "properties" (and then you get tests -almost- for free!).

# posted by

Eric : 8:58 AM

I actually meant I would have liked C++ to have const inference. I guess I mean I would have liked some sort of system where I didn't have to type the keyword "const" everywhere, but the compiler would look for inconsistencies just as ML finds type inconsistencies without needing manifest type declarations everywhere.

The flip side of that argument is that if the compiler did the checking without declarations, I wouldn't have been forced to wade through the code, thinking about it carefully. It's a good exercise... once...

# posted by

Reginald Braithwaite : 9:04 AM

Masklinn: That's why you use a good templating library. Almost all of them will let you set a default output filter that's applied to every placeholder on the page, and let you change that filter on a per-page/per-placeholder basis to accomodate different outputs (HTML, XML, e-mail, URL-in-HTML, SQL file, etc). As long as you *only* exchange data with the outside world through templates, the escaping is handled for you.

Strong static typing has its place, but it's much easier and more reliable to set some policy about where escaping occurs and do it reliably at interface boundaries. Static typing is more useful if you need to computationally manipulate data of different formats and keep track of which data is in what format. For example, manipulating financial instruments in different currencies. Much programming is just putting the damn data on the damn screen, though, and you don't need any fancy type systems for that.

# posted by

Jonathan Tang : 11:39 AM

Johnathan:

Strong static typing has its place, but it's much easier and more reliable to set some policy about where escaping occurs and do it reliably at interface boundaries.

If that's easier for you, sure. I don't think it's easy, but for me there is a trade-off between various language features, and at the moment it works like this:

Clients insist on Java for some kinds of work. Easy trade off: money vs. code security. They pay for the extra time I take to write the code, so that's their call to make.

For my own projects, dynamically typed languages have offered other benefits that--to me--outweighed the benefits of static typing for catching string semantic errors.

But for me, there is a benefit to the kind of type checking that can prevent string semantics errors, it's just a question of priorities. If I was using a language that offered checking I would use checking.

Static typing is more useful if you need to computationally manipulate data of different formats and keep track of which data is in what format. For example, manipulating financial instruments in different currencies.

Speaking of "office space"...

Umm... I'm going to have to, uh, go ahead and disagree with you there.

I think you're describing "grammar checking" again. I can do that in Java today. Instead of using a String for SSNs, for example, I can create an SSN class. It's a pain to deal with multiplying hectares by acres, but I can prevent simple errors.

The "semantic checking" I'm talking about is not mixing up gallons with litres. the simplest case I can think of is finding fencepost errors.

If I write a routine that manipulates lists, can the compiler detect that I haven't dealt with the case of the empty list? This is trivial in an expressive type system, it's maybe the first page of "The Little MLer," but try twisting Java into checking that for you.

Just kidding about Office Space, but please take a moment to think about the difference between grammar and semantics... maybe I am not doing a good job of explaining the difference.

# posted by

Reginald Braithwaite : 11:53 AM

Nice post -- I haven't really started sorting out this sort of advantage yet, but I'm really liking some other aspects of F#, which I've started to use in production code.

I've started thinking in types to some degree -- and wondering how I've lived this long without discriminated unions / algebraic data types. They're simply brilliant for writing structural DSLs for use in your code.

# posted by

AN : 12:44 PM

In one of those happy coincidences, two days ago I started The Little MLer and today I read your article.

I've just experienced first-hand a little taste of the power of these type systems in catching errors. In the Little MLer, an excercise is to add Steak to a another dish. So, there's this datatype:

datatype meza =
Shrimp
| Calamari
| Escargot
| Hummus

and the steak meal:

datatype main =
Steak
| Ravioli
| Chicken
| Eggplant

Bothe of these when evaluated give:

- [opening /tmp/sml241HY8]
datatype meza = Calamari | Escargot | Hummus | Shrimp
val it = () : unit
- [opening /tmp/sml241X5e]
datatype main = Chicken | Eggplant | Ravioli | Steak
val it = () : unit

and the add_steak function:

fun add_a_steak (Shrimp)
= (Shrimp,Steak)
| add_a_steak(Calamari)
= (Calamari, Steak)
| add_a_steak(Escargots)
= (Escargots, Steak)
| add_a_steak(Hummus)
= (Hummus, Steak)

and now, the SML compiler borks (and, by the way, this is a REPL, like in Lisp or Python, when I say "compiler"):

- [opening /tmp/sml241kDl]
/tmp/sml241kDl:1.5-8.22 Error: match redundant
Shrimp => ...
Calamari => ...
Escargots => ...
--> Hummus => ...

uncaught exception Error
raised at: ../compiler/TopLevel/interact/evalloop.sml:63.48-63.56
../compiler/TopLevel/interact/evalloop.sml:44.55
../compiler/TopLevel/interact/evalloop.sml:291.17-291.20

Now, I don't know what the error is but, upon inspection, I noticed I type "Escargot" in the the datatype, instead of "Escargots". I correct that and re-evaluate the datatype and the function:

- [opening /tmp/sml241xNr]
datatype meza = Calamari | Escargots | Hummus | Shrimp
val it = () : unit

- [Opening /tmp/sml241-Xx]
val add_a_steak = fn : meza -> meza * main
val it = () : unit

And there you have it: add_a_steak is correct now.

I just got syntax-checking *for free*!

# posted by

jt : 2:02 PM

"But it will not do a simple little thing for me in return like check all of my code paths to ensure that I never pass a parameter that might be null to a method that assumes the parameter can never be null?"

I would be interested to see, how the compiler can do this with reflection or dependency injection or across frameworks.

Null warnings in IDEs could be better though.

Peace
-stephan
--
Stephan Schmidt :: stephan@reposita.org
Reposita Open Source - Monitor your software development
http://www.reposita.org
Blog at http://stephan.reposita.org - No signal. No noise.

# posted by

Stephan : 10:35 AM

<< Home