Learning Java By EvolutionThis tutorial is in a draft state and was last changed at Friday April 9 13:38:37 BST 1999 . Although it is consistent, and all the programming work done, the detailed explanantions start to peter out around step five. Caveat surfer! |
A good way to become familiar with a programming language is
to roll up your sleeves and try to do something practical with
it. In that spirit, this tutorial tries to teach you some of the
features of Java's brand of object-oriented programming by making
various attempts at solving a particular problem: how to
construct and process trees of numbers. This tutorial presents
something of a pastiche when it comes to programming paradigms:
we start of from a purely procedural point of view and from there
develop an object-oriented solution with functional aspects. Some
of the issues covered are language-specific (how can this be
coded in Java?), others are paradigm-specific (what is the
correct object-oriented approach?).
The straight-forward way to represent a tree in C is to define a data structure as follows:
struct treenode {
int value;
struct treenode *left, *right;
}
The usual way to interpret this structure is that a null left and right values would signify a terminal node. A creator function would be written
struct treenode *tree(int v, struct treenode *l, struct treenode *r){
struct treenode *temp=malloc(sizeof struct treenode);
temp->value=v;
temp->left=l;
temp->right=r;
return temp;
}
so that the above tree could be created with the following expression:
tree(0, tree(0, tree(3, NULL, NULL), tree(4, NULL, NULL)), tree(5, NULL, NULL));
and then a treesumming functon written that traversed the data structure and performed the appropriate calculation on the leaf nodes.
int treesum(struct treenode *tree){
int result;
result=tree->value;
if(tree->left!=NULL)result+=treesum(tree->left);
if(tree->right!=NULL)result+=treesum(tree->right);
return result;
}
The source for the full program can be seen here. It lacks many features of a polished C program, the malloc()-ed storage is never free()-ed for example, but it demonstrates a procedural attempt at an initial solution which is the basis for our first step into Java.
You can go through a process of translating the code as simply as possible; this eventually leads to the code seen here which is a direct translation and makes almost no use of any object-oriented features to help solve the original problem.
Two classes are declared,
Classes are the only organising structure that Java provides the programmer and fulfil the same role as both C's structs (which gather together related data items) and modules (which gather together top-level functions and variables). A Java program is any class which defines a main() method, so the SummedTree program is implemented by the SummedTree class.
The TreeNode class implements the C version's treenode struct, containing an int and two other TreeNodes. Notice that the C version requires pointers to the contained structures (as the compiler does not allow recursive structure definitions) whereas the Java version has TreeNode objects inside a TreeNode. In fact this because all Java object variables (instantiations of classes) really store a pointer (or reference) to the actual object, and so the definition is not so dissimilar to what you may be used to in C. Although Java never makes pointers visible to the programmer, many pointer-style semantics are retained so that Java's pass-by-value semantics for function arguments means functions can modify arguments which are objects (TreeNode, String, arrays) but not arguments which are fundamental types (int, boolean, double, char etc).
The first thing to note is that in an object oriented language, the data and the behaviour of the data are encapsulated together. In other words the functions which handle creating and initialising a node as well as the functions which handle processing a node should be stored together with the node data. This is revolutionary for C programers who cannot define functions inside structs but normal for C++ programmers. And while we're taking this (possibly) radical step, we will put the TreeNode class into its own file, separate from the SummedTree application class file.
In effect, all we have done is to move the functions (or methods as they are known in Java) from the program that used the data structure into the data structure definition itself. Object-oriented program allows us to go a step farther, and make the methods an integral part of the objects themselves so that they use the values embedded in the objects' data. As an example, the treesum method could give any particular tree the ability to sum its own values in the expression mytree.treesum(). This contrasts with the C philosophy in which a global method is passed a particular data item to work on using an expression like treesum(mytree).
Since TreeNodes are now separate objects with their own independent behaviour, we can turn the tree creation routine into a constructor that will be used to construct each new TreeNode. The function will be called every time a new TreeNode is created. This just requires the name of the function to be changed to be the same as the name of the class. The function should return no value and indeed have no return type in the function prototype.
See the steps required for the definition and use of constructors that go to make up version 1.2.
A tree either contains a single integer or another subtree, but never both. We can enforce this requirement by defining two constructors, one for each case. (version 1.3 of the TreeNode class). This helps to reduce the amount of unnecessary information that has to be specified every time a new tree is made, changing the old tree expression
TreeNode mytree=new TreeNode(0, new TreeNode(0, new TreeNode(3, null, null), new TreeNode(4, null, null)), new TreeNode(5, null, null));to the slightly more compact
TreeNode mytree=new TreeNode(new TreeNode(new TreeNode(3), new TreeNode(4)), new TreeNode(5));in version 1.3 of the main program.
Note that we now have two functions which are differentiated only by the number and types of their arguments. This is a technique known as overloading, and can be used for all functions, not just constructors as in this case.
We have now defined objects of a single type (TreeNode) which can be constructed according to two cases. However they are consructed, the two separate cases also dominate their processing, i.e. subtrees are only to be processed if they are present and values must only be regarded if subtrees are not present. It actually makes a lot of sense to completely split these two cases into two separate types: a class of Nodes and a separate class of Leafs. A Leaf object contains a data value, whereas a Node object contains two other Node or Leaf objects.
Although the TreeNode class is split into two separate classes, both of these share an idea of treeness, so that a tree can consist of a Node containing two Leafs, but not a Node containing a String and an int. This concept of treeness is captured as a part of the class hierarchy by an interface called Tree. The purpose of an interface is to declare the 'abstract features' of a type without giving any details of how these features are to be implemented. In this case, the interface simply declares that an object of type Tree should be able to sum its components, i.e. it should contain a sumOfLeaves() method. Any class which contains a sumOfLeaves() method can be declared to be of type Tree.
Both the Node and Leaf classes fulfil the criterion for treeness, and declare themselves to be concrete implementations of the Tree interface. Separating Tree implementation into two separate data types allows the code for each behaviour to be clearly isolated and can be seen as particularly helpful in later versions. This isolation of cases by type is one of the principles elaborated in "Programming by Numbers - A Programming Method for Complete Novices" by Hugh Glaser and Pieter Hartel, and in fact this example is directly lifted from that paper.
You can follow the steps for producing the Leaf and Node classes from the previous version, or you can inspect the current version. Note that the process of separating cases has not noticably changed the usage of the Tree api by the main program, but it has significantly simplified the implementation. The downside is that two classes must now be co-ordinated to achieve the effect that one had previously.
The innovation of the last step required the Node and Leaf implementations to define the summing all nodes behaviour. This is terribly task specific: why should a tree know about adding up? How would one normally approach this in Java?
A typical approach is to reduce any particular data structure to a simple Enumeration, i.e. allow any data structure to have its contents iterated over by simple loop. The following cliche is frequently invoked
Enumeration e=myReallyComplicatedDataStructure.enumerate(); while(e.hasMoreElements()) process(e.nextElement());
Because the Enumeration API is so simple it leads to very simple code structures which can be applied to many situations.
Here the Tree abstraction imposes the requirement that implementing subclasses (Nodes and Leafs) should be able to enumerate their components. No longer a pure interface, it has become an abstract class, a class specification which defines some methods and imposes a requirement that subclasses define others. This particular abstract class requires that its subclasses know how to enumerate their own components (via the enumerate method), but provides a ready-made method (nodes) which creates an enumeration object from the enumerated leaf members.
The net result is that the ability to traverse the specific data structure is held by the implementing subclasses (Leaf and Node), the ability to construct an Enumeration object is held by the abstract Tree class, and the task-specific ability to perform arithmetic on nodes is held by the user of the Tree class.
Note that using an Enumeration does involve a couple of complications for the main program. For a start, the enumeration stores Leafs, not ints, and so the processing loop must extract the int from each Leaf using the .value field. Also, being a general purpose storage facility, an Enumeration is defined to store any kind of Object, not just Leafs. Although e.nextElement() does indeed return an Object which happens to be a Leaf, the static type checking does not realise this and so a cast must be invoked. The full expression which returns the next int value from the enumeration therefore looks like this:
((Leaf)e.nextElement()).value
This recognises that the Tree definition is actually suitable for a more general DataStructure. The Tree abstract class becomes an abstract class for DataStructures, with specific classes implementing BinaryTrees, PerfectBinaryTrees and Lists. These take over from the old Node class, and mainly serve to implement different constructors and enumerators to maintain the semantics of their particular data structures. Leaf still plays a very active part!
Source available here.
Tries to solve the problem: how can I have a null tree or a tree with only one branch? It defines a null class as another specific implementation of a DataStructure which implements its own peculiar constructors and enumerators. Chalk another success up for the type system!
Source available here.
Here is a different attempt to hide the traversal semantics of the data structure. Instead of mapping them onto a known, simple structure (here an Enumeration, in Lisp a list, in Scheme a Stream) which must still be traversed, albeit in a trivial fashion, extend the DataStructure with the concept of accumulators. This allows the Test driver to abandon all form of iteration, instead passing the closest thing that Java can manage to a lambda expression to the data structure itself. Because functions are not first class objects in Java, the lambda expression is written as an anonymous class which contains a single method f() which is the accumulation function. Anonymous subclasses can be created which override the accumulation function: the default is to simply add the next value onto the (zero-based) accumulator.
Source available here.
Trivial extension to the previous version which adds String conversion facilities to the DataStructure(s) to allow them to be simply output.
Source available here.
If you want to define mapping over a data structure you must be able to create new structures. The only way to do this is to use 'new' and constructors. make the type system sing for its supper here. The DataStructure class defines an abstract map method which is implemented by every data structure.
This has raised a problem which requires specific code to disambiguate for the compiler type definitions which would be obvious at runtime. Sigh. ie the simple expression
new PureBinaryTree(left.map(m), right.map(m))
is disallowed, because the type of each map cannot be predetermined.
Source available here.
The previous solution compensated for compile-time type ambiguity by disambiguating the run-time types in the source code. This was done by the 'user' of the API which was frankly horrid. This solution moved the disambiguation closer inside the PureBinaryTree API instead, by creating a generic two-object constructor which explicitly checked its argument types.
Really this has just moved the requirement of a horrid work-around from one place to another, putting the burden of proof on the constructor instead of the user of the constructor. In fact this should have been born by the type system, but what the heck. This second solution has in fact made the PureBinaryTree data structure less 'safe' since a tree can be constructed with arbitrary objects, with no warning. Ick. I guess I could have thrown an Exception, but that would make the example less readable. In fact I chose to replace bad construction arguments by Null objects.
Source available here.
Solutions for overcoming this yukkiness are presented in step 11. In the meantime...
This version just reworks the accumulator facility along the same lines as the mapping facility, ie it is delegated to the subclasses and no longer uses the fiction of an Enumeration.
Source available here.
Steps 9 and 9.5 showed that it is difficult to arrange for the type system to be able to conclusively check DataStructure expressions at compile time. Both solutions required run-time code checking of the dynamic types. To get around this limitation we can (thanks to the inspiration of Luc Moreau) define a new type which is a union of the types that can appear in a particular data structure. In particular the PureBinaryTreeable interface is supported only by the PureBinaryTree, Leaf and Null classes. This allows a single PureBinaryTree constructor to be written in such a way that it will allow only the right types of objects to be used.
The cost? A proliferation of intermingled types. We have a "supreme" DataStructure class which is specialised into lots of specific data structure classes (here PureBinaryTree, BinaryTree and List as well as Null and Leaf). We have now created a "master" DataStructureable interface which is (trivially) extended into specific data structure interfaces (here PureBinaryTreeable, BinaryTreeable and Listable). The "supreme" class implements the "master" interface, and the specific data structure classes implement the specific data structure interfaces.
The added value comes from the fact that Null and Leaf (acceptable as components of all data structures) implement all the data structure interfaces and so each data structure implementation class (e.g. PureBinaryTree) requires only one constructor that can accept ONLY data structures of acceptable classes.
The one down side is that the map function needs to explicitly cast its return values (which have to be declared as DataStructures in order to concrete-ify the abstract superclass). OK, so that's not particularly disabling, but it is irritating.
Source available here.
To eliminate the nasty proliferation of types, classes and interfaces this variant just defines separate Null and Leaf types specific to each data structure. Is this a bit of a retrograde step? Discuss!
Source available here.
You'll notice that I've kept away from Design Patterns. Let's make up a factory! Let's have a Visitor or two....
This document tracks a dynamic process of discovery on behalf of the author! If you want to read authoritative words about designing and using data structures in Java, try the following documents from the Java Developer Connection
The first is a general approach to defining data structures and includes many of the lessons learned in this document. The remainder are about the new data structure framework included in JDK 1.2 and above. I can recommend it!
Les Carr