Wednesday, January 19, 2011

Can we really control the quality/cost trade-off?

An old rule in software project management (also known as the Project triangle is that you can have a system that:
- has low development cost (Cheap)
- has short development time (Fast)
- has good quality (Good)
except you can only choose two :-). That seem to suggest that if you want to go fast, and don't want to spend more, you can only compromise on quality. Actually, you can also compromise on feature set, but in many real-world cases, that happens only after cost, time and quality have already been compromised.

So, say that you want to compromise on quality. Ok, I know, most engineers don't want to compromise on quality; it's almost like compromising on ethics. Still, we actually do, more often that we'd like to, so let's face the monster. We can:

- Compromise on external quality, or quality as perceived by the user.
Ideally, this would be simply defined as MTBF (Mean Time Between Failures), although for many users, the perceived "quality" is more akin to the "user experience". In practice, it's quite hard to predict MTBF and say "ok, we won't do error checking here because that will save us X development days, and only decrease MTBF by Y". Most companies don't even measure MTBF. Those who do, use it just as a post-facto quality goal, that is, they fix bugs until they reach the target MTBF. We just don't know enough, in the software engineering community, to actually plan the MTBF of our product.

- Compromise on internal quality, or quality as perceived by developers.
Say goodbye to clarity of names, separation of concerns, little or no duplication of logic, extendibility in the right places, low coupling, etc. In some cases, poor internal quality may resurface as poor external quality (say, there is no error handling in place, so the code just crashes on unexpected input), but in many cases, the code can be rather crappy but appear to work fine.

- Compromise on process, or quality as perceived by the Software Engineering Institute Police :-).
What usually happens is that people start bypassing any activity that may seem to slow them down, like committing changes into a version control system :-) or being thorough about testing. Again, some issues may resurface as external quality problems (we don't test, so the user gets the thrill of discovering our bugs), but users won't notice if you cut corners in many process areas (well, they won't notice short-term, at least).

Of course, a "no compromises" mindset is dangerous anyway, as it may lead to an excess of ancillary activities not directly related to providing value (to the project and to users). So, ideally, we would choose to work at the best point in the quality/time and/or quality/cost continuum (the "best point" would depend on a number of factors: more on this later). Assuming, however, that there is indeed a continuum. Experience taught me the opposite: we can only move in discrete steps, and those are few and rather far apart.

Quality and Cost
Suppose I'm buying a large quantity of forks, wholesale. Choices are almost limitless, but overall the real choice is between:

- the plastic kind (throwaway)

it works, sort of, but you can't use it for all kinds of food, you can't use it while cooking, and so on.

- the "cheap steel" kind.

it's surely more resistant than plastic: you can actually use it. Still, the tines are irregular and feel uncomfortable in your mouth. Weight isn't properly balanced. And I wouldn't bet on that steel actually being stainless.

- the good stainless steel kind.

Properly shaped, properly balanced, good tasting :-), truly stainless, easy to clean up. You may or may not have decorations like that: price doesn't change so much (more on this below).

- the gold-plated kind.

it shows that you got money, or bad taste, or both :-). Is not really better than the good stainless fork, and must be handled more carefully. It's a real example of "worse is better".

Of course, there are other kind of forks, like those with resin or wood handles (which ain't really cheaper than the equivalent steel kind, and much worse on maintenance), or the sturdy plastic kind for babies (which leverage parental anxiety and charge a premium). So, the quality/cost landscape does not get much richer than that. Which brings us to prices. I did a little research (not really exhaustive – it would take forever :-), but overall it seems like, normalizing on the price of the plastic kind, we have this price scale:

Plastic: 1
Cheap steel: 10
Stainless steel: 20
Gold-plated: 80

The range is huge (1-80). Still, we have a big jump from throwaway to cheap (10x), a relatively small jump from cheap to good (2x), and another significant jump between good and gold-plated (4x). I think software is not really different. Except that a 2x jump is usually considered huge :-).

Note that I'm focusing on building cost, which is the significant factor on market price for a commodity like forks. Once you consider operating costs, the stainless steel is the obvious winner.

The Quality/Cost landscape for Software
Again, I'll consider only building costs here, and leave maintenance and operation costs aside for a moment.

If you build throwaway software, you can get really cheap. You don't care about error handling (which is a huge time sink). You copy&paste code all over. You proceed by trial and error, and sometimes patch code without understanding what is broken, until it seems to work. You keep no track of what you're doing. You use function names like foo and bar. Etc. After all, you just want to run that code once, and throw it away (well, that's the theory, anyway :-).
We may think that there is no place at all for that kind of code, but actually, end-user programming often falls into that case. I've also seen system administration scripts, lot of scientific code, and quite a bit of legacy business code that would easily fall into this category. They were written as throwaways, or by someone who only had experience writing throwaways, and somehow they didn't get thrown away.

The next step is very far apart: if you want your code to be any good, you have to spend much more. It's like the plastic/"cheap steel" jump. Just put decent error handling in place, and you'll spend way more. Do a reasonable amount of systematic testing (automated or not) and you're gonna spend more. Aim for your code to be reasonably readable, and you'll have to think more, go back and refactor as you learn more, etc. At this stage, internal quality is still rather low; code is not really stainless. External quality is acceptable. It doesn't feel good, but it gets [some] work done. A lot of "industrial" code is actually at this level. Again, maintenance may not be cheap, but I'm only considering building costs here; a 5x to 10x jump compared to throwaway code is quite reasonable.

The next step is when you invest more in -ilities, as well as in external quality through more systematic and deeper testing. You're building something that you and your users like. It's truly stainless. It feels good. However, it's hard to be "just a little stainless". It's hard to be "somehow modular" or "sort of thread-safe". Either you're on one side, or you're on the other. If you want to be on the right side, you're gonna spend more. Note that just like with fork, long-term cost is lower for good, stainless code, but as far as building cost is concerned, a 2X increase compared to "cheap steel" code is not surprising.

Gold-plating is obvious: grandiose design that serves no purpose (and that would appear as gratuitous complexity to a good designer), coding standards that add no value (like requiring a comment on each and every parameter of every function), etc. I've seen systems spending a lot of energy trying to be as scalable as Amazon, even though they only had to serve a few hundred concurrent clients. Goldplated systems spend more on building, and may also incur higher maintenance costs afterward (be careful not to scratch the gold :-). Of course, it takes experience to recognize goldplating. To someone that has never designed a system for testability, interfaces for mocking may seem like a waste. To an experienced designer, they tell a different story.

But software is... soft!
As usual, a metaphor can be stretched only so far. Software quality can be easily altered over time. Plastic can't be transformed into gold (if you can, give me a call :-), but we can release a bug-ridden version and fix issues later. We may incur intentional technical debt and pay it back in the next release. This could be the right strategy in the right market at the right time, but that's not always the case.

Indeed, although start-ups, customer discovery, innovative products, etc seem to get all the attention lately, if you happen to work in a mature market you may want to focus on early-on quality. Some customers are known to be forgiving about quality issues, in exchange for early access to innovative products. It's simply not the general case.

Modular Quality?
Theoretically, we could invest more in some "critical" components and less in others. After all, we don't need the same quality (external and internal) in every single portion. We can accept quirks here and there. Due to the fractal nature of software, this can be extended to the entire hierarchy of artifacts. It's surely possible. However, it takes a lot of maturity to develop on the edge of quality. Having to constantly reason about the quality/cost trade off can easily cost you more than you're saving: I'd rather spend that reasoning to create quality code.

Once again, however, we may have a plan to start with plastic and grow into stainless steel over time. We must be aware that it is easier said than done. I've seen it many times over my long career. Improving software quality is just another case of moving our software in the decision space. We decided to make software out of plastic early on, and now we want to move it to another place, say the cheap steel spot. Some work is required. Work is proportional to the distance (which is large – see above), but also to the opposing force. Whatever form that force takes, a significant factor is always the mass to be moved, and here lies the key. It's hard to move a monolithic mass of code into another place (in the decision space). It's easier if you can move it a module at time.

Within a small modular unit, you can easily improve local quality. Change a switch/case into polymorphism. Add an interface to decouple two concrete classes. It's easy and it's cheap, because mass is small. Say, however, that you got your modular structure wrong: an important concept has not been identified, and is now scattered all around. You have to tweak a lot of code. While doing so, you'll find that you're not just fighting the mass you have to move: you have to fight the gravitational attraction of the surrounding code. That may reveal more scattered concepts. It's a damn lot of work, possibly more than it's worth. Software is not necessarily soft.

So, can we really move software around? Yes, in the small. Yes, if you got the modular structure right, and you're only cleaning up within that modular structure. Once again, the fractal nature of software can be our best friend or our worst enemy. Get the right partitioning between applications / services, and you only have to work inside. Get the right partitioning between components, and you only have to work inside. And so on. It's fine to have a plastic component, inasmuch as the interface is right. Get the modular structure wrong, and sooner or later you'll need to make wide-range changes, or accept a "cheap steel" quality. If we got it wrong at a small scale, it's no big deal. At a large scale, it can be a massacre.

Once again, modularity is the key. Not just "any" modularity, but a modularity aligned with the underlying forcefield. If you get that right, you can build plastic components and evolve them to stainless steel quality over time. Been there, done that. Get your modular structure wrong, and you'll lack locality of action. In other words: if you plan on cutting corners inside a module, you have to invest at the interface level. That's simpler if you design top-down than if you're in the "emergent design" camp, because the module will be underengineered inside, and we can't expect a clean interface to emerge from underengineered code.

Sunday, January 02, 2011

Notes on Software Design, Chapter 13: On Change

We have a long history of manipulating physical materials. Cavemen built primitive tools out of rocks and sticks. I would speculate :-) that they didn't know much about physics, yet through observation, serendipity and good old trial and error they managed to survive pretty well, or we wouldn't be here today talking about software.

For centuries, we only had empirical observations and haphazard theories about the real nature of the physical world. The Greek Classical Element were Earth, Water, Air, Fire, and Aether, which is not exactly the way we look at materials today. Yet the Greeks managed to build large and incredibly beautiful temples.

Of course, today we know a quite a bit more. Take any kind of physical material, like a copper wire. In the real world, you can:

- apply thermal stress to the material, e.g. by warming it. Different materials react differently.

- apply mechanical stress to the material, e.g. by pulling it on both ends. Different materials react differently.

- apply electromagnetic forces, e.g. by moving the wire inside a magnetic field. Different materials react differently.

- etc.

To get just a little more into specifics, if we restrict ourselves to mechanical stress, and focus on the so-called static loading, we end up with just a few kind of stress: tension, compression, bending, torsion, and shear (see here and also here).

Although people have built large, beautiful structures without any formal knowledge of material science, it is hard to deny the huge progress made possible by a deeper understanding of what is actually going on with materials. Knowledge of forces and reactions allows us to choose the best material, the best shape, and even to create new, better materials.

The software world
Although we can build large, even beautiful systems, it is hard to deny that we are still in the dark ages. We have vague principles that usually can't be formally defined. We have different materials, like statically typed and dynamically typed languages, yet we don't have a sound theory of forces, so we end up with the usual evangelist trying to convert the world to its material of choice. It's kinda funny, because it's just as sensible as having a Brick Evangelist trying to convince you to use bricks instead of glass for your windows ("it's more robust!"), and a Glass Evangelist trying to convince you to build your entire house out of glass ("you'll get more light!").

Indeed, a key realization in my exploration on the nature of software was that we completely lack the notion of force. Sure, we use a bunch of terms like "flexibility", "heavy load", "fast", "fragile", etc., and they all seem to be somehow inspired by physics, but in the end it's just vernacular usage without any kind of theory behind.

As I realized that, I began looking for a small, general yet useful set of "software forces". Nothing too fancy. Tension, compression, bending, torsion, and shear are not fancy, but are incredibly useful for physical materials (no, I wasn't looking for similar concepts in software, but for something just as useful).

Initially, I focused my attention on artifacts, because design is very much about form, and form is made visible through artifacts. I got many ideas and explored several avenues, but most seemed more like ad hoc concepts or turned out to be dead ends. Eventually, I realized we already knew the fundamental forces, though we never used the term "force" to characterize them.

Interestingly, those forces are better known in the run-time world, and are usually restricted to data. Upon closer scrutiny, it turned out they also apply to procedural knowledge, and to the artifact world as well. Also, it turned out that those forces were intimately connected with the concept of entanglement, and could actually shed some light on entanglement itself. More than that, they could be used to find entangled information, in practice, no more philosophy, thank you :-).

So, let's take the easy road and see what we can do with run-time knowledge expressed through data, and build from there.

It's almost too simple
Think about a simple data-oriented application, like a book library. What can you do with a Book record? It's rather obvious: create, read, update, delete. That's it. CRUD. (I'm still thinking about the need for an identity-preserving Move outside the database realm, but let's focus on CRUD right now). CRUD for run-time data is trivial, and is a well-known concept.

CRUD in the artifact world is also relatively simple and intuitive:

Create: we create a new artifact, being it a component, a class, a function.

Read: this got me thinking for a while. In the end, my best shot is also the simplest: we (the contingent intepreter) read the function, as an act of understanding. More on the idea of contingent/essential interpreter in a rather whimsical :-) presentation by Richard Gabriel, that I already suggested a few years ago.

Update: we change an artifact; for instance, we change the source code of some function.

Delete: we remove an artifact; for instance, we remove a member function from a class.

Note that given the fractal nature of software, some operations are recursively identical (by deleting a component we delete all the sub-artifacts), while others change nature as we move into the fractal hierarchy (by creating a new class inside a component we're also updating the component). This is true in the run-time world of data as well.

Extending CRUD to run-time procedural knowledge is not necessarily trivial. Can we talk about CRUD for functions (as run-time, not compile-time, concepts)? We usually don't, but that's because the most common material and tools (programming languages) we're using are largely based on the idea that procedural knowledge is created once (through artifacts), then compiled and executed.

However, interpreted languages always had the possibility of creating, updating and deleting functions at run-time. Some compiled languages allow dynamic function creation (like C#, through Reflection.Emit) but not update/delete. The ability to update a function through reflection would be an important step toward AOP-like possibilities, but most languages have no support for that. This is fine, though: liquids can't be compressed, but that didn't prevent us from defining the concept of compression. I'm looking for concepts that apply in both worlds (artifacts and run-time) but not necessarily to all materials (languages, etc.). So, to summarize:

- Create, Update, Delete retain the usual meaning also for the run-time concept of functions. It means we're literally creating, updating and deleting a function at run-time. The same applies to the (run-time) concept of class (not object). We can Create, Update, Delete classes at run-time as well (given the appropriate language/material).

- Read, in the artifact world, is the human act of reading, or better, is the contingent interpreter (us) giving meaning to (interpreting) the function. At run-time, the equivalent notion is execution, with the essential interpreter (the machine) giving meaning to the function. Therefore, reading a function in the run-time world simply means executing the function. Executing a function may also update some data, but this is fine: the function is read, data are updated; there is no contradiction.

Where do we go from here?
I'm trying to keep this post short, so I'll necessarily leave a lot of stuff for next time. However, I really want to anticipate something. In my previous post I said:

Two clusters of information are entangled when performing a change on one immediately requires a change on the other.

I can now state that better as:

Two clusters of information are entangled when a C/R/U/D on one immediately requires a C/R/U/D on the other.

Yes, the two definitions are not equivalent, as Read is not a Change. The second definition, however, is better, for reasons we'll explore in the next few posts. One reason is that it provides guidance on the kind of "change" you need to consider. Just restrict yourself to CRUD, and you'll be fine. As we'll see next, we can usually group CD and RU, and in fact I've defined two concepts representing those groupings.

Entanglement through CD and/or RU is not necessarily a bad thing. It means that the entangled clusters are attracting each other, and, inasmuch as that specific force is concerned, rejecting others. In other words, they want to form a single cluster, possibly a composite cluster in the next hierarchical level (like two functions attracting themselves and creating a class). Of course, it might also be the case that the two clusters should be merged into a single cluster.

We'll see more about different form of CRUD-induced entanglement in the next few posts. We'll also see how many concepts (from references to database normalization) are actually just ways to deal with the many forms of entanglement, or to move entanglement from the artifact space to the run-time space. Right now, I'll use my remaining time to explain polymorphism through the entanglement perspective.

Polymorphism explained
Consider the usual idea of multiple shapes (circle, triangle, rectangle, etc.) that can be drawn on a screen. In the pre-object era (say, in ANSI C) we could have defined a ShapeType enum, a Shape struct with a ShapeType field and possibly a ShapeData field, where ShapeData would have probably been a union (with all fields for all different shapes). Draw would have been implemented as a switch/case over the ShapeType field.

We have U/U entanglement between ShapeType and ShapeData, and between ShapeType and Draw. U/U means that whenever we update ShapeType we need to update ShapeData as well. I add the Spiral type, so I need to add the relevant fields in the ShapeData union, and add logic to Draw. We also have U/U entanglement between ShapeData and Draw: if we change the way we store shape data, we have to change the Draw function as it depends on those data. We could live with that, but is not pretty.

What if we have another function that depends on ShapeType, like Area? Well, we need another switch-case, and Area will have the same kind of U/U entanglement with ShapeType and ShapeData as Draw. It is that entanglement that is bringing all those concepts together. ShapeType, ShapeData, Draw and Area want to form a cluster.

Note that without polymorphism (language-supported or simulated through function pointers) the code mass is organized into a potentially degenerative form: every function (Draw, Area, etc) is U/U-entangled with ShapeType, and is therefore attracting new procedural knowledge related to each and every ShapeType value. The forcefield will make those functions bigger and bigger, unless of course ShapeType is no longer extended.

Object-oriented polymorphism "works" because it changes the gravitational centers. Each concrete class (Circle, Triangle, etc) becomes a gravitational center for procedural and structural knowledge entangled with a single (ex)ShapeType value. In fact, ShapeType no longer needs to exist at all. There is still U/U-entanglement between the (ex)ShapeData portion related to Circle and the procedural knowledge in Circle.Draw and Circle.Surface (in a Java/C#ish syntax). But this is fine, because they have now been brought together into a new cluster (the Circle class) that then rejected the other artifacts (the rest of the ShapeData fields, the other functions). As I said in my previous post, entanglement between distant element is bad.

Note that we don't really change the mass of code. We just organize it differently, around different gravitational centers. If you draw functions as tall boxes in the switch-case form, you can see that classes are just "wide" boxes cutting through functions. The area of the boxes (the amount of code) does not change.

Object-oriented polymorphism in a statically-typed language also requires [interface] inheritance. This brings in a new form of U/U-entanglement, this time between the interface and each and every concrete class. Whenever we change the interface, we have to change all the concrete classes.

Interestingly, some of those functions might be entangled in the run-time world: they tend to be called together (R/R-entanglement in the run-time space). That would suggest yet another clustering (in the artifact space), like separating all the drawing operations from the geometrical operations in different interfaces.

Note that the polymorphic solution is not necessarily better: it's just a matter of probability. If the set of functions is unstable, plain polymorphism is not really a good answer. We may want to explore a more complex solution based on the Visitor pattern, or some other clustering of artifacts. Understanding entanglement for Visitor is indeed an interesting exercise.

Coming soon (I hope :-)
CD and RU-entanglement (with their real names :-), paving our way to the concept of isolation, or shielding, or whatever I'm gonna settle for. Oh guys, by the way: happy new year!