Friday, September 02, 2011

Notes on Software Design, Chapter 15: Run-Time Entanglement

So, here I am, back to my unpopular :-) series on the Physics of Software. It's time to explore the run-time side of entanglement, or more exactly, begin to explore, as I'm trying to write shorter posts, hoping to increase frequency as well.
A lot of time has gone by, so here is a recap from Chapter 12: Two clusters of information are entangled when performing a change on one immediately requires a change on the other (to maintain overall consistency). If you're new to this, reading chapter 12, 13 and 14 (dealing with the artifact-side of entanglement) won't hurt.

Moving to the run-time side, some forms of entanglement are obvious (like redundant data). Some are not. Some are rather narrow in scope. Some are far-reaching. I'll cover quite a few cases in this post, although I'm sure it's not an exhaustive list (yet).

  • Redundancy
This is the most obvious form of run-time entanglement: duplicated information.
Redundancy occurs even without your intervention: swap-file vs. RAM contents, 3rd-2nd-1st level cache vs. RAM and vs. caches in other cores/processors, etc; at this level, entanglement satisfaction is usually called coherence.
In other cases, redundancy is an explicit design decision: database replicas, mem-cached contents, etc. At a smaller scale, we also have redundancy whenever we store the same value in two distinct variables. Redundancy means our system is always on the verge of violating the underlying information entanglement. That's why caches are notoriously hard to "get right" in heavily concurrent environments without massive locking (which is a form of friction). This is strictly related with the CAP Theorem, which I have discussed months ago, but I'll come back to it in the future.
Just like we often try (through hard-won experience) to minimize redundancy in artifacts (the Keep it dry principle), we learnt to minimize redundancy in data as well (through normalization, see below), but also to somehow control redundancy through design patterns like the observer. I'll get back to patterns in the context of entanglement in a future post.

  • Referential Integrity
This is perhaps less obvious, but easy to understand. Referential integrity, as known within the relational database culture, requires (quoting Wikipedia) that any field in a table that is declared a foreign key can contain only values from a parent table's primary key or a candidate key. That means you cannot remove some records unless you update others, or that you cannot create some records unless some other is in place. That's a form of entanglement, and now we can start to grasp another concept: the idea of transaction is born out of the need to have intermediate micro-states that do not respect the entanglement constraint, yet make those micro-states invisible. As far as the external observer is concerned, time is quantized, and transactions are the ticks.
Interestingly, database normalization is all (just? :-) about imposing a specific form on data entanglement. Consider a database that is not in second normal form; I'll use the same example of the Wikipedia page for simplicity. Several records contain the same information (work location). If we change the work location for Jones, we have to update three records. This is entanglement through redundancy, of course. Normalization does not remove entanglement. What it does is to impose a particular form on it: through foreign keys. That makes it easier to deal with entanglement within the relational paradigm, but it is not without consequences for modularity (which would require an entire post to explore). In a sense, this "standardization" of entanglement form could be seen as a form of dampening. Views, too, are a practical dampening tool (on capable hands). More on this in a future post.

  • Class Invariant
This, I must confess, was not obvious at all to recognize at first, which is weird in insight. A class invariant is a constraint on the internal state (data) of any object of that class. That means, of course, that those data cannot change independently. Therefore, they are entangled by definition. More precisely, a class invariant can often be stated as a set of predicates, and each predicate is revealing some form of entanglement between two or more data members.
Curiously enough (or perhaps not :-), invariant must hold at the beginning/end of externally observable methods, though it can be broken during the actual execution of those methods. That is to say, public methods play the role of transactions, making time quantized once again.

  • Object Lifetime
Although this can be seen as a special case of class invariant (just like a cascade delete is ascribed to referential integrity) I'd like to make this explicit, for reasons we'll understand better in a future post. Some objects cannot outlive others. Some objects cannot be created without others. Reusing the terminology of chapter 13, this is a D-D or C-C form of entanglement. The same form of entanglement can be found underneath several DB normalization concepts, and thanks to the RT/Artifact dualism, will turn out to be a useful concept in the artifact world as well.

Toward a concept of tangling through procedural knowledge
All the above, I guess, it's pretty obvious in hindsight. It borrows heavily on the existing literature, which always considered data as something that can be subject to constraints. The risk here is to fall into the "information as data" tar pit. Information is not data. Information includes procedural knowledge (see also my old posts Lost ... and Found? for more on the concept of Information and Alexandrian centers). Is there a concept of entanglement for procedural knowledge?

Consider this simple spreadsheet:

AB
11020
2200240

That may look like plain data, but behind the curtain we have two formulas (procedural knowledge):

A2=A1*B1
B2=A2*1.2

In the picture above, the system is stable, balanced, or as I will call it, in a steady state. Change a value in A1 or B1, however, and procedural knowledge will kick in. After a short interval (during which the system is unbalanced, or as I'll call it, in a transient state), balance will be restored, and all data entangled through procedural knowledge will be updated.
A spreadsheet, of course, is the quintessential Dataflow system, so it's also the natural trait d'union between knowledge-as-data and procedural knowledge. It should be rather obvious, however, that a similar reasoning can be applied to any kind of event-driven or reactive system. That kind of system is sitting idle (steady state) until something happens (new information entering the system and moving it to a transient state). That unleashes a chain of reactions, until balance has been restored.

Baby steps: now, what was the old-school view of objects (before the watering down took place)? Objects were like tiny little machines, communicating through messages. Right. So invoking a method on an object is not really different than sending the object a message. We're still upsetting its steady state, and the procedural knowledge inside the method is restoring balance (preserving the invariant, etc.)

So, what about the traditional, top-down, imperative batch system? It's actually quite straightforward now: invoking a function (even the infamous main function) means pushing some information inside the system. The system reacts by moving information around, creating new information, etc, until the output is ready (the new steady state). The glue between all that information is the procedural knowledge stored inside the system. Procedural knowledge is basically encoding the entanglement path for dynamic information.

Say that again?
This is beginning to look too much like philosophy, so let's move to code once again. In any C-like languages, given this code:
int a2 = a1 * b1;
int b2 = a2 * 1.2;
int a3 = a1 + b1;
we're expecting the so-called control-flow to move forward, initializing a2, then b2, then a3. In practice, the compiler could reorder the statements and calculate a3 first, or in between, and even the processor could execute the two instructions "out of order" (see the concept of "observable behavior" in the C and C++ standard). What we are actually saying is that a2 can be calculated given a1 and b1, and how; that b2 can be calculated given a2, and how; and that a3 can be calculated given a1 and b1, and how. The "can be calculated given ..." is an abstract interpretation of the code. That abstract interpretation provides the entanglement graph for run-time entities. The "how" part, as it happens, is more germane to the function than to the form, and in this context, is somehow irrelevant.

What about the control-flow primitives, like iteration, choice, call, etc? Iteration and choice are nothing special: at run-time, a specific branch will be taken, for a specific number of times, resulting in a specific run-time entanglement between run-time entities. Call (procedure call, function call) is somewhat different, because it reveals the fractal nature of software: whatever happens inside the function is hidden (except via side effects) from the caller. The function will carry out its job, that is, will restore the fine-grained input-output balance upon invocation.

This is the general idea of procedural knowledge: given an input, control-flow goes on, until the program reaches a steady state, where output has been produced. As noted above, due to the fractal nature of information, this process takes place at different levels. A component instance may still be in a transient state while some object contained in that instance is already back to a steady state.

In this sense, a procedure (as an artifact) is how we teach a machine to calculate an output given an input, through the stepwise creation and destruction of entangled run-time information. Phew :-).

Next steps
The beauty of the Run-Time/Artifact dualism is that whatever we learn on one side, we can often apply on the other side (with the added benefit that many things are easier to "get" on one side or another).
Here, for instance, I've introduced the concept of steady and transient state for run-time information. The same reasoning can be applied to the artifact side. Whenever you create, update or delete an artifact A1, if there is any other artifact that is entangled with A1, the system goes into an unbalanced, transient state. You have to apply work (energy) to bring it back to the steady state, where all the entangled information is properly updated (balanced). A nice quote from Leonardo da Vinci could apply here: "Motion is created by the destruction of balance" (unfortunately, I can't find an authoritative source for the quote :-). Motion in the information space is created by unsettling the system (moving to a transient state) until entanglement is finally satisfied again, and a steady state is reached. This is as true in the artifact space as in the run-time space.
Overall, this has been more like a trampoline chapter, where a few notions are introduced for the benefits of forthcoming chapters. Here is a short list of topics that I'll try to explore at some point in the future:

- The truth about multiplicity, frequency, and attraction/rejection in the run-time world.
- As above, in the artifact world.
- Moving entanglement between run-time and artifacts (we do this a lot while we design).
- The Entanglement Diagram.
- A few examples from well-known Design Patterns.
- Cross-cutting concerns and Delocalized plans: relationships with entanglement.
- More on distance, entanglement, probability of failure, and the CAP theorem.
- Dampening (isolation). This will open a new macro-chapter in the Physics of Software series.

If you read so far, you should follow me on twitter, or perhaps even share this post (I dare you :-)

2 comments:

Cyrille Martraire said...

Carlo,

Interesting read as usual! The redundancy part immediately reminds me of Degree Of Freedom analysis, which I used to blog on, with influence from posts by Michael L. Perry: http://cyrille.martraire.com/2009/12/degrees-of-freedom-analysis
Clearly we'd like to minimize, or at least be aware of any extra degree of freedom, and of the potential inconsistencies that could result from them.

On invariants, Domain-Driven Design strongly suggests that looking for invariants is a strong guide to find out the right concepts, as explained by Alberto Brandolini teaching the course. This often leads to derived values instead of redundant values, that would be "extra" (useless and potentially harmful) degrees of freedom.

Going further, the DDD concept of aggregates as a bigger boundary of consistency is an alternate way to discuss "entanglement", with all the consequences in transaction, foreigh keys, Delete-Delete rule etc.


I find the idea of degrees of freedom more intuitive (taught at school) than entanglement which is evocative of quantum weirdness, frightening!

These ideas are taking shape, I'm eager that all that gets consolidated soon into some consistent and affordable form. Ideally I'd prefer that existing vocabulary and concepts be reused whenever possible, but I understand that you'd like to borrow from the metaphor of physics to help reasoning. At the same time these metaphors exclude pure CS people with not enough exposure to physics, just like my beloved concept of DoF exclude people with no background in mechanical engineering...

Anyhow such physics of software, whatever it's called, could become more popular if it could help modernize our set of principles (SOLID etc.) now that functional programming ideas and alternate architectural styles (CQRS, Event Sourcing...) are getting traction.

Cheers, (whenever you come to Paris we could share a beer)

Carlo Pescio said...

Hey there Cyrille,
great comment, like always! I didn't know about Michael's work, very interesting...

I'm always keeping an eye on terminology. In most cases, whenever I'm not reusing a familiar term from Computer Science, it's because I can't find an exact correspondence, and I'm trying not to muddle existing terminology. I understand that taking inspiration from physics may alienate some, but in practice, I've found this to be most consistent with my view of software as a material to be shaped. Another reason for me to shun the existing terminology is that it often comes with a good / evil attachment (coupling is bad, separation of concern is good, etc) which hinders the idea that design is about balancing forces.

You're right however, using concepts from the physics of software to review existing principles (and patterns, I would add) is probably an excellent way to make it a bit more popular. For instance, I'll soon introduce the idea of R/R and U/U entanglement in the run-time space, and basically say that things that are read together and/or updated together are attracted. This is easily explained by effiency.
Interestingly, the dual statement in the artifacts space is that artifacts which are reused together, or updated together, are also attracted. This is the basis for the well-known "reuse release equivalence principle" (http://c2.com/cgi/wiki?ReuseReleaseEquivalencePrinciple).

Along the same lines, my investigation of forces and materials has led me to formulate an alternative Open/Closed principle, that is not limited to classes, but extends to generics, functions, etc. I actually have a ton of notes, and now need some time to write everything down. I'll probably try other media as well: I was thinking of a short video to discuss the forcefield around some well-known pattern.

I'll trade the beer for some gateaux (I used to speak French, but that was like 30 years ago; I barely remember a few words now :-)