What does Barbara Liskov have to say about Equality in Java?

(This is a snapshot of my old weblog. New posts and selected republished essays can be found at raganwald.com.)

Sunday, April 15, 2007

Erick Reid posted some code showing his standard implementations of equals and hashCode in Java (updated link). His implementations draw heavily on Joshua Bloch’s principles. I think he’s done a great job in implementing the Java standard implementation of equality. On the other hand, I’m a little disquieted by the way equality works in Java.

Here’s a shorter version of his equality method for a class, TempBean, that has members _foo, and _bar:


public boolean equals(Object object) {
  if (object != this) {
    if (object != null && getClass().equals(object.getClass())) {
      final TempBean other = (TempBean) object;
      return (_foo == null ? other._foo == null : _foo.equals(other._foo)) &&
          (_bar == null ? other._bar == null : _bar.equals(other._bar));
    }
    return false;
  }
  return true;
}

The interesting part for me is the test for class equality: getClass().equals(object.getClass()). Compare and contrast to:


public boolean equals(Object object) {
  if (object != this) {
    if (object != null && object instanceof TempBean) {
      final TempBean other = (TempBean) object;
      return (_foo == null ? other._foo == null : _foo.equals(other._foo)) &&
          (_bar == null ? other._bar == null : _bar.equals(other._bar));
    }
    return false;
  }
  return true;
}

This works when object is a TempBean. That’s almost the same as the first code. It’s different when object is an instance of TempBean but not a TempBean.

For example:


void doSomethingSeussian (final TempBean thingOne) {
    final TempBean thingTwo = TempBean.getThingFromBox(/* ... */);

    if (thingTwo.equals(thingOne)) {
        // ...
    }
}

thingTwo is obviously a TempBean. Or is it? What does the factory method getThingFromBox do? And what if thingOne is an instance of a subclass of TempBean? What if it has exactly the same _foo and _bar as thingTwo? Why isn’t it equals?

Does this sound farfetched? What if it doesn’t even have any different members, it only varies in some subtle implementation way? For example, what if thingOne is a ThreadSafeTempBean that guarantees thread safety in a way that is otherwise transparent to clients? Or a MockTempBean that is used for unit tests?

All equalities are equal, but some are more equal than others

If you program in Java, you probably already know that == is not the same thing as equals. One measures identity , whether two objects are the exact same identical object in memory at the same location. (Unless they are those dratted primitives, in which case it measures value equality). Identity is almost useless in Java, it breaks all over the place, and that’s why it’s rightfully rare.

Value equality is where the interesting problem lies. There are two kinds of value equalities: physical equality (“these two are identical copies of the same object in every respect”) and logical equality (“for my purposes, these two are equivalent”).

A Little Java, a Few Patterns is brought to you by the authors of The Little Schemer, a book that teaches recursion and first-class functions, as well as The Little MLer, a book that teaches you how to design programs with a powerful strong, static type system.

So it is no surprise to discover that their book about Java eschews the same old, same old Hello World approach in favour of teaching deep ideas about Object-Oriented design and Pattern-Oriented design using Java as the language of instruction.

The basic problem here is that equals captures physical equality, which ignores the Liskov Substitution Principle, one of the fundamental principles in OO design. It is leaking the implementation, just like a leaky abstraction.

(I think a lot of the problem starts when people conflate interfaces with classes. There’s a big difference between saying “this is the interface TempBean, and here is how to tell if any two TempBeans are equals to each other” and saying “this is an implementation of a TempBean, and if you want to know whether it is the exact same thing as another implementation, here is how to do it.”

The former ought to be very common, and the latter quite rare in OO programs.)

It isn’t easy to fix the physical equality problem. If you use the second form of code, you get an equals method that is not symmetric (thanks, Marc!). In the examples above, thingOne.eqauls(thingTwo) is not always the same as thingTwo.equals(thingOne)! And if you’re implementing a HashMap, which kind of equality do you want to use for deciding whether a key is already in the collection?

This last question is another way of pointing out that equality is not an easy thing to define if you mix the notion of polymorphism into a language. Do you always want logical equality? I can easily think of code that would break horribly if you made a HashMap that respected the Liskov Substitution Principle for keys. And I can also think of code where I want a map or set that respects the Liskov substitution principle.

What can we do about this?

To make equality adhere to the Liskov Substitution Method in a language like Java, you can’t make it an object method. One of the other cherished notions in OO design is that every object ought to know everything about its behaviour. But this objects can’t know how much detail a client wants in making the comparison.

The piece of code that ought to know how to compare two TempBeans is the TempBean class (which is also the TempBean interface in Java, but Java doesn’t allow code in interfaces). We’re off to a good start, the code for equals is in the TempBean class. But where we went wrong is making it an instance method. What happens if it’s a static method?


public static boolean equivalent(TeampBean left, TempBean right) {
  if (left != right) {
    if (left != null && right != null) {
      return (left._foo == null ? right._foo == null : left._foo.equals(right._foo)) &&
          (left._bar == null ? right._bar == null :left. _bar.equals(right._bar));
    }
    return false;
  }
  return true;
}

Now you can write code like:


void doSomethingSeussian (final TempBean thingOne) {
    final TempBean thingTwo = TempBean.getThingFromBox(/* ... */);

    if (TempBean.equivalent(thingTwo, thingOne)) {
        // ...
    }
}

Without worrying about whether either variable holds a TempBean, a ThreadSafeTempBean, or a MockTempBean. It’s true that sometimes you care very deeply about the difference. But most of the time, you probably do not, and using an equals method that does care is wrong.

Labels: java

¶ 9:40 PM

Comments on “What does Barbara Liskov have to say about Equality in Java?”:

The whole LSP thing in Java is interesting. I've just written something about it and how it relates to duck typing in Java.

http://www.kirit.com/Walking%2C%20talking%20and%20quacking%20in%20Java

In truth the only way to fully reconcile any non-identity notion of equality with LSP is to have the terms of equality determined by the client code, i.e. where the equality test is used.

If identity is broken as you say it is (and I don't know enough Java to really know) then Java is conflating value and object semantics. This of course is why the distinction between object and value semantics is important and why solving the equality test problem in Java looks to be so hard.

Any test that is devised within an object using value semantics where the object itself logically follows object semantics is doomed to fail for a lot of uses.

The best way out of it would be to fix the use of identity in Java, but that might just not be possible.

# posted by

Kirit Sælensminde : 1:32 AM

Although it doesn't do much good in Java proper, this is one of the things I found most pleasant when I was still coding in Nice (the language, not the city). Multiple dispatch provides a very elegant solution to this problem, because you can define equals(TempBean, TempBean) and then SubclassOfTempBean can override not only equals (SubclassOfTempBean, TempBean) but also equals (TempBean, SubclassOfTempBean), so that when the more specific class is present on the right hand argument of equals, the overridden method is used.

There are a number of cases where this helps a lot.

# posted by

David R. MacIver : 3:39 AM

That equals method you quote, while certainly fine, should be generated by your IDE; they do that sort of thing automatically and can ensure that the accompanying hashCode() is in lockstep.

Speaking of hashCode, you forgot something rather critical:

Any attempt to offload Equivalency Testing to a separate codebase (much like java.util.Comparator offloads ordering of objects to a separate implementation, so that you can sort or create TreeSets/TreeMaps (maps/sets that sort themselves as you add items to it) according to an ordering of your own liking) must by neccessity include custom hashCode() code, or you will never be able to use this equivalency test for hashmaps or hashsets. That'd be a rather enormous and unhappy restriction.

I certainly thing it would be prudent to add an Equivalent class that is analogout to Comparator, and have versions of HashSet, HashMap, and various 'contains' methods in the Collections API that use it, but this Equivalent interface must force the programming to supply both an equals and a hashing algorithm, or its mostly useless.

How do other languages 'solve' this problem?

# posted by

Reinier Zwitserloot : 5:49 AM

"How do other languages 'solve' this problem?"

Most other languages don't have the problem to start with. You can find it though in other languages that confuse value and object semantics.

Javascript is an extreme example of this and is one reason why it is hard to use for anything really complex. Smalltalk suffers to a lesser extent because it maintains object identity, but often makes true value semantics hard.

It is much easier to handle values in an system that maintains identity than it is to handle identity in one that loses it.

I don't really know what Reg means when he says identity in Java is broken or in what way. I'd like to see a deeper explanation of what he means.

The usual way to deal with it though is to use an extra level of indirection - introduce smart pointers that can be copied and dealt with as values but can find the unique object when required. You then get effective identity back again. My guess is that if you were to analyse Java O/RM solutions this is the approach that they take (assuming that it is a problem of identity confusion).

Of course if the problem is actually that the class designs are broken in that they confuse object and value semantics within the object ecology then there is little hope of avoiding confusion in the use of those objects.

# posted by

Kirit Sælensminde : 8:16 AM

Kirit:

Maybe I should choose my words more carefully! I may revise that paragraph after I've finished my espresso.

What I meant was, identity is almost never what you want in Java. I consider it a design flaw that the easiest thing to use, the == operator, is of little use to everyday programmers.

It's very hard to use identity in non-trivial programs. As you point out, you need to be very careful that when you construct two objects that represent the same "thing," that you make sure they are represented using the same object.

For example, the Java compiler tries to do this with String constants, so that when you write "Hello World" in two places you get the same string.

But will "Hello " + "World" be identical to "Hello World"? If not, why not?

Horribly broken? Perhaps I overstated it. But it looks very much to me like identity in Java is an obscure concept of great interest to people building infrastructure (like ORMs) and almost off-limits to everyday programming.

# posted by

Reginald Braithwaite : 9:01 AM

Back when I took a class that used Liskov's textbook, we were taught that Java's class system doesn't fully support the LSP, because it didn't handle contravariance properly. That is: According to the LSP, if C2 is a subclass of C1 and D2 is a subclass of D1, and there's a method in D1 that returns an object of class C2, then a method with the same name in D2 can return an object of class C1. But Java required the return type of the method in the subclass to be exactly the same as the return type of the method with the same name in the superclass.

Has this changed with Java generics? Can you declare "<T extends C2> method();" in D1 and then declare "<T extends C1> method();" in D2?

# posted by

Seth Gordon : 9:44 AM

OK Reg. I understand what's going on a lot better now. Never discount the likely hood of me misreading something :-)

As somebody who has built a large and complex O/RM and web framework in C++ (not as stupid as it at first seems) object identity is something that I've had to think about very hard so people using the system don't have to.

This seems to be something that C++ does get right.

std::string str( "Hello world" );
assert( str == "Hello world" );
assert( str == std::string( "Hello" ) + " world" );

This fails horribly if you insist on using C idioms in C++, but so long as you are writing modern C++ everything works pretty well.

The reason is that the difference between the value and object semantics is explicit in the syntax.

I've been working off and on to write an article exploring these things in more depth - and they do need quite some depth to be understood fully. It's been taking a while, but this Java stuff is useful extra input.

# posted by

Kirit Sælensminde : 9:52 AM

As Keith pointed out over on PRC, Scheme has four different predicates for object equality/equivalence:

Scheme's equivalence predicates

# posted by

Reginald Braithwaite : 10:30 AM

Seth:

As of Java 1.5 return types of overridden methods are covariant. i.e. you can declare the return type of the overridden version to be any subtype of the return type of the method it overrides.

Perhaps slightly unfortunately, although the return types are covariant the argument types are not contravariant. You can't broaden the argument type of the overridden method (or rather, you can, but it will be a separate overload rather than overriding the parent).

reginald:

It's mostly irrelevant to your point, but "hello " + "world" actually will be == to "hello world", because the compiler collapses string literals together when you concatenate them.

On the other hand, String is a final class, so doesn't really run into the equals problem anyway.

# posted by

David R. MacIver : 10:52 AM

Oh, incidentally, I disagree that == is rarely what you want.

For mutable objects, value equality is a very dangerous thing to use. Reference equality makes much more sense. Consequently, I often would choose not to override equals from the default implementation for my objects unless they were immutable.

# posted by

David R. MacIver : 10:54 AM

Did I mix up covariance and contravariance again? Crap.

# posted by

Seth Gordon : 10:55 AM

Crap, I forgot that compiler optimization. I guess I should revise my comment to throw in some indirection like a variable.

Actually, this kind of strengthens my point, I think.

You shouldn't have to know in your head that "Hello " + "World" is == "Hello World" but that something more complicated like "Hello World".substring(0) is only == "Hello World" with a certain level of optimization turned on.

# posted by

Reginald Braithwaite : 11:00 AM

Would you be very upset with me if I pointed out that the String implementation includes a special case that actually means that myString.substring(0) == myString? :-)

(I agree this doesn't in any way negatively impact your point and in fact strengthens it. I'm just being pedantic).

# posted by

David R. MacIver : 11:09 AM

No, it doesn't upset me. Thanks goodness this isn't one of those evil "Wheel of Fortune" job interviews!

I accept that I haven't been paying close attention to the exact semantics of string identity in the JVM.

So I may not have been able to cook up a really simple example (my next attempt would have been "Idempotent Hello World".substring(11)...).

But perhaps someone can set me straight on the general case: is it the case that two strings that print the same are always identical?

Or just the case that the compiler happens to reuse the same string constants where it can, possibly for memory optimization purposes, but the language makes no such guarantee and therefore you are not safe using == for strings even though you are safe using == for primitives like ints.

Which brings up another question: if == always works for ints, des it always work for Integers?

# posted by

Reginald Braithwaite : 11:23 AM

Equality predicates is one area where I really wish more languages would use three-valued logic. It would be nice if these functions could distinguish between "definitely equal", "definitely not equal", and "heck if I know".

# posted by

Seth Gordon : 11:26 AM

The language makes very few guarantees about == for String. It just happens to do a lot of clever reuse of strings in order to minimise instantiation of String objects.

What is guaranteed is that for any representation of a String there is a canonical interned String. Two strings which have the same value will have == interned strings.

So "hello world" == myHelloWorld.intern() but there is no guarantee that it == myHelloWorld. In situations where you want to use == to compare Strings for performance reasons you should intern() all your strings. Some XML parsers do this. You need to be very careful when doing this though because it will chew through your memory like anything.

The situation is even more awful for Integers. :)

If your Integer objects lie between -128 and 127 then two Integer objects representing the same value will usually be the same object. Outside of this range no guarantees are made.

In particular, Integer.valueOf(128) != Integer.valueOf(128).

However 128 == Integer.valueOf(128). This is because Integers get unboxed to ints rather for operators, rather than the other way around. (Which is why myInteger == 0 will throw a NPE if myInteger is null).

# posted by

David R. MacIver : 11:48 AM

Thanks Raganwald, you have justified my complete avoidance of equals(): There are too many possible interpretations of "equals" to allow equals() in the core libraries.

Yes, my code is full of "a.compareTo(b)==0"
Yes, I have my own collections library, but I have had them since them since 1.1 days.

Like you suggest, I make my own comparison methods to define what I mean by "equals". I sometimes provide an intern() method to map "equal" objects to a canonical representative for identity comparisons. This allows me to use == more than most.

On another note: Just say "Hello " + "World" != "Hello World" without compiler optimizations.

David R. MacIver: Multiple dispatch does not solve the problem any better than single dispatch does. As you can see from erick reid's code, he is dispatching on the parameters too. I will admit language support for multiple dispatch is cleaner, but it does not solve the confusion of "equals".

Furthermore, contravariant parameter overloading, in an OO environment, is WRONG! :) http://www.arcavia.com/kyle/Analysis/Covariant%20Specialization.html

# posted by

Kyle Lahnakoski : 12:15 PM

Reminder: comments can include links, but it isn't automatic. You have to use an HTML "a" tag.

Walking, talking and quacking in Java

Covariant Specialization

# posted by

Reginald Braithwaite : 12:21 PM

There's still the confusion over what equals should mean, but it eliminates the implementation difficulties.

The issue is that if I define a class Foo with equals I can define a default implementation for equals.

The problem is that I don't know at this time what subclasses are going to do to this class. It may be perfectly valid for them to continue using the equals implementation I provide, or it may not. There is *no* way to know this in advance.

With multiple dispatch when you subclass you can modify the behaviour of the original equals when applied to an argument of my subclass, so can mitigate this problem greatly.

Essentially there are two issues here:

a) Making value equality work properly in an environment with subtyping is hard, particularly when also worrying about reference equality.

b) Single dispatch makes it very hard to provide a decent implementation of value equality which interacts well with subtyping.

Multiple dispatch eliminates b) while doing nothing to alleviate a).

I didn't really follow the argument for why contravariant parameter subtyping is wrong in an OO environment. (Java programmer or not, I still think like a functional programmer most of the time). I'll read through it in more detail later.

P.S. I swear the captcha gets longer with each post I make...

# posted by

David R. MacIver : 12:32 PM

Just to clarify:

"Hello" != new String("Hello");
Integer.valueOf(0) != new Integer(0);

Because "Hello" and Integer.valueOf(0) return interned instances, whereas new String() and new Integer() return new instances, therefore they have different identity.

# posted by

Matthew : 1:55 PM

@Reginald:

This is always false:

new String("asdf") == new String("asdf")

as is

new String("asdf") == "asdf"

Likewise in the case of Integers: if you new them both, or even one of them, then == can never be true.

If you don't use new, then it matters how they were actually created behind the scenes -- specifically, whether objects are ever reused or a request for an object always results in a new object.

For example, if you look inside java.lang.Integer at the following method:

public static Integer valueOf(int i);

you can see that the class maintains a cache of integer objects for values from -128 to +127. If you call Integer.valueOf(121) twice, you get the same object (== on objects would be true), but if you call Integer.valueOf(129) twice, you get two different objects.

In all these cases, you should be using .equals for comparisons, so whether it's the same object or not doesn't matter.

# posted by

j.k. : 8:47 PM

> I certainly thing it would be prudent to add an Equivalent class that is analogout to Comparator, and have versions of HashSet, HashMap, and various 'contains' methods in the Collections API that use it, but this Equivalent interface must force the programming to supply both an equals and a hashing algorithm, or its mostly useless.

This has been discussed here: http://forums.java.net/jive/thread.jspa?messageID=104566. It would be a good candidate for a contribution: http://jdk-collaboration.dev.java.net/

Matthias

# posted by

Matthias : 8:14 AM

A couple of people have suggested that your IDE ought to handle equals and hashCode for you.

If the IDE can, why can't the language just do it? Seems to me like all objects should automatically override equals with the "obvious" definition automatically, and not even let you override it, forcing you to use a more descriptive name if your equals is not equality.

Hash tables and such could always take predicates to do the hashing and equality checks that would default to calling the (now compiler-generated) equals and hashCode methods.

I suppose, though, that caches and such would need custom equals. How about letting you disable equals for a class (at compile-time), in which case members of that class wouldn't be taken into account by the automatically generated equals and hashCode?

# posted by

me22 : 11:08 AM

A couple of people have suggested that your IDE ought to handle equals and hashCode for you.

If the IDE can, why can't the language just do it?

Of course, Java does do it, however it does it in such a way that the default implementation in the Object class is unworkable for most purposes.

Now that being said, you may want to build your own definition, perhaps an ObjectWithSensibleEqualitySemantics that has a good defaulty equals and hashCode implementation.

Only now, you cannot use it much because Java prohibits multiple inheritance or mixins.

For all of the complaints about C++ MI being "too hard to use properly," MI works incredibly well when you design a program to have small classes that each have one specific responsibility, such as providing just the default equality and hash code methods.

See also Ruby's mixins.

FWIW, I strongly dislike the idea (joke!) of the IDE doing it. This is called scaffolding or static code generation. So what happens when you modify a class to add or remove members. Does your IDE remember to update the equality and hashcode methods?

This has to be baked into the language or do something at run time with reflection to have a ghost of a chance of working properly.

In the post, I suggested TempBean.equivalent(...). If you like Java Generics, you might prefer:

EqualityLibrary.equivalent(TempBean.class, foo, bar).

# posted by

Reginald Braithwaite : 11:18 AM

Small suggested correction: when you say that equality is non-reflexive in Java, the example you give demonstrates that it is (in some situations) non-symmetrical.

For those who need a refresher, reflexive means x == x, whereas symmetrical means x == y implies that y == x.

# posted by

Marc Hamann : 10:26 PM

Reginald,

you're contradicting yourself. In your post you explain very well why object equality depends on the client: "But this objects can’t know how much detail a client wants in making the comparison."

In your last comment, however, you want a mechanism to mix-in an equality method into objects.

We should instead go for equivalence objects (like Comparator, just with hashCode), usable in HashMap, and the system should provide a useful implementation of those _per class_:

interface Equivalence<T> {
public boolean eq(T one, T two);
public int hash(T instance);
}

class Class<T> {
public final Equivalence<T> EQ; // builtin, same as the IDE
// generates today
};

So we can do SomeBean.class.EQ.eq(bean1, bean2).

or do a

new HashMap<SomeBean, String>(SomeBean.class.EQ);

# posted by

Matthias : 8:23 AM

No contradiction!

When I say that mixins or MI plus reflection would be helpful for defining useful default semantics, what makes you think we are talking about instance methods?

Oh right... in addition to not allowing mixing in of implementation semantics, Java also does not allow any kind of class-level implementation inheritance: Static methods are really global procedures with some name space dressing.

# posted by

Reginald Braithwaite : 9:27 AM

i started a new blog and have reposted the utility code this posting references: http://erickreid.blogspot.com/2008/06/sourcecode-generation-of-good-equals.html

# posted by

erickreid : 11:10 PM

<< Home