Thursday, August 05, 2010

Notes on Software Design, Chapter 8. A simple property: Distance (part 1)

You're writing a function in your preferred language. At some point, you look at your code and you see that a portion just does not belong there. You move it outside, creating another function. Asked why, you may answer that by doing so you made it reusable; that you want to respect the Single Responsibility Principle; that you want to increase cohesion; that code just looks "cleaner" and "easier to read" that way; and so on. I guess it sounds rather familiar.

By moving the code outside the function, you also increased its distance with the code that is left inside. This is probably not so familiar: distance is not a textbook property of software. However, the notion of distance is extremely important. Fortunately, distance is a very simple concept. It has an immediate intuitive meaning, and although it's still rather informal, we can already use it to articulate some non-trivial reasoning.

I'm trying to keep this post reasonably short, so I'll cover only the artifact side of the story here. I'll talk about the run-time world next time.

A Concept of Distance
Consider this simple function (it's C#, but that's obviously irrelevant)

int sum( int[] a )
{
int s = 0;

foreach( int v in a )
s += v ;

return s;
}

I used two blank lines to split the function in three smaller portions: initialization - computation - return. I didn't use comments - the code is quite readable as it is, and you know the adage: if you see a comment, make a function. It would be unnatural (and with dubious benefits) to split that function into sub-functions. Still, I wanted to highlight the three tiny yet distinct procedural portions (centers) within that function, so I used empty lines. I guess most of you have done the same at one point or another, perhaps on a larger scale.

Said otherwise, I wanted to show that some statements were conceptually closer than others. They don't have to be procedural statements. I have seen people "grouping" variable declarations in the same way, to show that some variables sort of "lump together" without creating a structure or a class. I did that by increasing their physical distance in the artifact space.

A Measure of Distance
Given two pieces of information P1, P2, encoded in some artifacts A1, A2, we can define their distance D( P1, P2 ) using an ordinal scale, that is, a totally ordered set:

P1 and P2 appear in the same statement - that's minimum distance
P1 and P2 appear in the same group of consecutive statements
P1 and P2 appear in the same function
… etc

for the full scale, including the data counterpart, see my Summary at the Physics of Software website.

Note that Distance is a relative property. You cannot take a particular piece of information and identify its distance (as you could do, for instance, with mass). You need two.
Also, the ordinal scale is rather limiting: you can do no math with it. It would be nice to turn it into a meaningful interval or ratio scale, but I'm not there yet.

Is Distance useful?
As in most theories, individual concepts may seem rather moot, but once you have enough concepts you can solve interesting problems or gain better understanding of complex phenomena. Right now, I haven't introduced the concept of tangling yet, so Distance may seem rather moot on itself. Still, we can temporarily use a vague notion of coupling to explore the value of distance. It will get better in the [near] future, trust me :-).

Consider a few consecutive statements inside a function. It's ok if they share intimate knowledge. The three segments in sum are rather strongly coupled, to the point that it's ineffective to split them in subfunctions, but that doesn't bother me much. It's fine to be tightly coupled at small distance. As we'll see, it's more than fine: it's expected.

Functions within a class are still close together, but farther apart. Again, it's ok if they share some knowledge. Ideally, that knowledge is embodied in the class invariant, but private functions are commonly tied with calling functions in a rather strong way. They often assume to be called in specific states (that could be captured in elaborated preconditions), and the caller is responsible to guarantee such preconditions. Sequence of calls are also expected to happen in specific orders, so that preconditions are met. Again, that doesn't bother me much. That's why the class exists in the first place: to provide a place where I can group together "closely related" functions and data.

Distinct classes are even more distant. Ideally, they won't share much. In practice, classes inside the same component often end up having some acquaintance with each other. For instance, widgets inside a widget library may work well together, but may not work at all with widgets inside a different library. Still, they're distant enough to be used individually.

We expect components / services to be lightly coupled. They can share some high-level contract, but that should be all.

Applications shouldn't be coupled at all – any coupling should appear at a lower level (components).

The logical consequence here is that coupling must decrease as distance increases. There is more to this statement than is immediately obvious. The real meaning is:
a) large distance requires low coupling
b) small distance requires high coupling

When I explain the concept, most people immediately think of (a) and ignore (b). Yet (b) is very important, because it says:
1) if coupling with the surroundings is not strong enough, you should move that portion elsewhere.
2) the code should go where the coupling is stronger (that is, if code is attracted elsewhere, consider moving it elsewhere :-)). That's basically why feature envy is considered a bad smell – the code is in the wrong place.

Cohesion as an emergent property
Cohesion has always been a more elusive concept than coupling. Looking at literature, you'll find dozens of different definitions and metrics for cohesion (early works like Myers' Composite/Structured Design used to call it "strength"). I've struggled with the concept for a while, because it didn't fit too well with other parts of my theory, but then I realized that cohesion is not a property per se.

Cohesion is a byproduct of attraction and distance: an artifact is cohesive if its constituents are at the right distance, considering the forces of attraction and rejection acting upon that artifact. If the resulting attraction is too strong or too weak, parts of that artifact want to move either down or up in the distance hierarchy, or into another site at the same level.

Attraction is too weak: the forces keeping that code together are not strong enough to warrant the short distance at which we placed the code. For instance, a long function with well-identified segments sharing little data. We can take that sequence of statements and move it up in the hierarchy - forming a new function.

Attraction is too strong: for instance, we put code in different classes, but those classes are intimately connected. The easier thing is to demote one class to a set of functions (down in the hierarchy) and merge those functions with the other class. But perhaps the entire shape is wrong, at odd with the forcefield. Perhaps new abstractions (centers) must be found, and functions, or even statements, moved into new places.

This is closing the circle, so to speak. Good software is in a state of equilibrium: attraction and rejection are balanced with proper distance between elements.

Note: I'm talking about attraction and rejection, but I have yet to present most attractive / repulsive forces. Still, somehow I hope most of you can grasp the concepts anyway.

An Alexandrian look on the notion of distance
I've quoted Christopher Alexander several time in an early discussion on the concept of form. Now, you may know that Alexander's most recent theory is explained in 4 tomes (which I haven't deeply read yet) collectively known as "The Nature of Order". A few people have tried to relate some of his concepts with the software world, but so far the results have been rather unimpressive (I'm probably biased in my judgment :-).

On my side, I see a very strong connection between the concept of equilibrium as an interplay between distance and the artifact hierarchy and the Alexandrian concept of levels of scale: “A balanced range of sizes is pleasing and beautiful”.
Which is not to say that you should have long functions, average functions, small functions :-). I would translate that notion in the software world as: a balanced use of the artifact hierarchy is pleasing and beautiful. That is:
Don't use long function: use multiple functions in a class instead.
Don't use long classes: use multiple classes in a component instead.
Don't create huge components: use multiple components inside an [application/service/program] instead

This is routinely ignored (which, I think, contributes to the freescale nature of most source code) but it's also the very first reason why those concepts have been introduced in the first place! Actually, we are probably still missing a few levels in the hierarchy, as required for instance to describe systems of systems.

Gravity, efficiency, and the run-time distance
Remember gravity? Gravity (in the artifact world) provides a path of least resistance for the programmer: just add stuff where there is other vaguely related related stuff. Gravity works to minimize distance, but in a kind of piecemeal, local minimum way. It's easy to get trapped into local minimum. The minimum is local when we add code that is not tightly connected with the surroundings, so that other forces at play (not yet discussed) will reject it.

When you point out incoherent, long functions, quite a few programmers bring in "efficiency" as an excuse (the other most common excuse being that it's easier to follow your code when you can just read it sequentially, which is another way to say "I don't understand abstraction" :-).
Now, efficiency is a run-time concept, and I haven't explained the corresponding concept in my theory yet. Still, using again the informal notion of efficiency we all have, we can already see that efficiency [in the run-time world] tends to decrease as distance [in the artifact world] increases. For instance, moving lines into another function requires passing parameters around. This is a first-cut, rough explanation of the well-known trade-off between run-time efficiency and artifact quality (maintainability, readability, reusability).

Coming soon:
the concept of distance in the run-time world, and distance-preserving transformations
efficiency in the physics of software
tangling
not sure yet : ), but probably isolation and density

5 comments:

cyrille said...

Hi Carlo,

Your investigations and posts around your "Physics of Software" are totally fascinating and exciting to read.

You're not the first to try that, but your approach looks to me the most relevant and sound I've ever seen by far.

I like the jokes too! I'm eager to read more, hoping for a full system of inter-related concepts that we can -hopefully- reason on.
Cheers,

Carlo Pescio said...

Thanks Cyrille,
really appreciated!

I hope/expect to complete at least 2 chapters this month. I've got a lot of half-baked material, and although I want to ponder on tangling a little more, I too am eager to share a few new ideas with you guys : )

Unknown said...

Hi Carlo,
a comment only to have fun :)
Seems that you are not the only one that try to use gravity to model others concepts:

http://www.phdcomics.com/comics/archive.php?comicid=1354

cheers,
Eros

Carlo Pescio said...

you know, the 3D drawing of the potential well looks remarkably similar to something I've drawn time ago while thinking about the forcefield:

http://www.eptacom.net/blog/ffpic1.jpg

(took me a while to dig it up :-)

Romano Scuri said...

It could also be only a style issue.

In yours first posts not it was trace of lines of separation between a paragraph and other. The bold it lacked completely.

Thus for the code. As learning to write improves the style benefit of who reads.

Then the program even worked equal even if structured various.