Friday, September 02, 2011

Notes on Software Design, Chapter 15: Run-Time Entanglement

So, here I am, back to my unpopular :-) series on the Physics of Software. It's time to explore the run-time side of entanglement, or more exactly, begin to explore, as I'm trying to write shorter posts, hoping to increase frequency as well.
A lot of time has gone by, so here is a recap from Chapter 12: Two clusters of information are entangled when performing a change on one immediately requires a change on the other (to maintain overall consistency). If you're new to this, reading chapter 12, 13 and 14 (dealing with the artifact-side of entanglement) won't hurt.

Moving to the run-time side, some forms of entanglement are obvious (like redundant data). Some are not. Some are rather narrow in scope. Some are far-reaching. I'll cover quite a few cases in this post, although I'm sure it's not an exhaustive list (yet).

  • Redundancy
This is the most obvious form of run-time entanglement: duplicated information.
Redundancy occurs even without your intervention: swap-file vs. RAM contents, 3rd-2nd-1st level cache vs. RAM and vs. caches in other cores/processors, etc; at this level, entanglement satisfaction is usually called coherence.
In other cases, redundancy is an explicit design decision: database replicas, mem-cached contents, etc. At a smaller scale, we also have redundancy whenever we store the same value in two distinct variables. Redundancy means our system is always on the verge of violating the underlying information entanglement. That's why caches are notoriously hard to "get right" in heavily concurrent environments without massive locking (which is a form of friction). This is strictly related with the CAP Theorem, which I have discussed months ago, but I'll come back to it in the future.
Just like we often try (through hard-won experience) to minimize redundancy in artifacts (the Keep it dry principle), we learnt to minimize redundancy in data as well (through normalization, see below), but also to somehow control redundancy through design patterns like the observer. I'll get back to patterns in the context of entanglement in a future post.

  • Referential Integrity
This is perhaps less obvious, but easy to understand. Referential integrity, as known within the relational database culture, requires (quoting Wikipedia) that any field in a table that is declared a foreign key can contain only values from a parent table's primary key or a candidate key. That means you cannot remove some records unless you update others, or that you cannot create some records unless some other is in place. That's a form of entanglement, and now we can start to grasp another concept: the idea of transaction is born out of the need to have intermediate micro-states that do not respect the entanglement constraint, yet make those micro-states invisible. As far as the external observer is concerned, time is quantized, and transactions are the ticks.
Interestingly, database normalization is all (just? :-) about imposing a specific form on data entanglement. Consider a database that is not in second normal form; I'll use the same example of the Wikipedia page for simplicity. Several records contain the same information (work location). If we change the work location for Jones, we have to update three records. This is entanglement through redundancy, of course. Normalization does not remove entanglement. What it does is to impose a particular form on it: through foreign keys. That makes it easier to deal with entanglement within the relational paradigm, but it is not without consequences for modularity (which would require an entire post to explore). In a sense, this "standardization" of entanglement form could be seen as a form of dampening. Views, too, are a practical dampening tool (on capable hands). More on this in a future post.

  • Class Invariant
This, I must confess, was not obvious at all to recognize at first, which is weird in insight. A class invariant is a constraint on the internal state (data) of any object of that class. That means, of course, that those data cannot change independently. Therefore, they are entangled by definition. More precisely, a class invariant can often be stated as a set of predicates, and each predicate is revealing some form of entanglement between two or more data members.
Curiously enough (or perhaps not :-), invariant must hold at the beginning/end of externally observable methods, though it can be broken during the actual execution of those methods. That is to say, public methods play the role of transactions, making time quantized once again.

  • Object Lifetime
Although this can be seen as a special case of class invariant (just like a cascade delete is ascribed to referential integrity) I'd like to make this explicit, for reasons we'll understand better in a future post. Some objects cannot outlive others. Some objects cannot be created without others. Reusing the terminology of chapter 13, this is a D-D or C-C form of entanglement. The same form of entanglement can be found underneath several DB normalization concepts, and thanks to the RT/Artifact dualism, will turn out to be a useful concept in the artifact world as well.

Toward a concept of tangling through procedural knowledge
All the above, I guess, it's pretty obvious in hindsight. It borrows heavily on the existing literature, which always considered data as something that can be subject to constraints. The risk here is to fall into the "information as data" tar pit. Information is not data. Information includes procedural knowledge (see also my old posts Lost ... and Found? for more on the concept of Information and Alexandrian centers). Is there a concept of entanglement for procedural knowledge?

Consider this simple spreadsheet:

AB
11020
2200240

That may look like plain data, but behind the curtain we have two formulas (procedural knowledge):

A2=A1*B1
B2=A2*1.2

In the picture above, the system is stable, balanced, or as I will call it, in a steady state. Change a value in A1 or B1, however, and procedural knowledge will kick in. After a short interval (during which the system is unbalanced, or as I'll call it, in a transient state), balance will be restored, and all data entangled through procedural knowledge will be updated.
A spreadsheet, of course, is the quintessential Dataflow system, so it's also the natural trait d'union between knowledge-as-data and procedural knowledge. It should be rather obvious, however, that a similar reasoning can be applied to any kind of event-driven or reactive system. That kind of system is sitting idle (steady state) until something happens (new information entering the system and moving it to a transient state). That unleashes a chain of reactions, until balance has been restored.

Baby steps: now, what was the old-school view of objects (before the watering down took place)? Objects were like tiny little machines, communicating through messages. Right. So invoking a method on an object is not really different than sending the object a message. We're still upsetting its steady state, and the procedural knowledge inside the method is restoring balance (preserving the invariant, etc.)

So, what about the traditional, top-down, imperative batch system? It's actually quite straightforward now: invoking a function (even the infamous main function) means pushing some information inside the system. The system reacts by moving information around, creating new information, etc, until the output is ready (the new steady state). The glue between all that information is the procedural knowledge stored inside the system. Procedural knowledge is basically encoding the entanglement path for dynamic information.

Say that again?
This is beginning to look too much like philosophy, so let's move to code once again. In any C-like languages, given this code:
int a2 = a1 * b1;
int b2 = a2 * 1.2;
int a3 = a1 + b1;
we're expecting the so-called control-flow to move forward, initializing a2, then b2, then a3. In practice, the compiler could reorder the statements and calculate a3 first, or in between, and even the processor could execute the two instructions "out of order" (see the concept of "observable behavior" in the C and C++ standard). What we are actually saying is that a2 can be calculated given a1 and b1, and how; that b2 can be calculated given a2, and how; and that a3 can be calculated given a1 and b1, and how. The "can be calculated given ..." is an abstract interpretation of the code. That abstract interpretation provides the entanglement graph for run-time entities. The "how" part, as it happens, is more germane to the function than to the form, and in this context, is somehow irrelevant.

What about the control-flow primitives, like iteration, choice, call, etc? Iteration and choice are nothing special: at run-time, a specific branch will be taken, for a specific number of times, resulting in a specific run-time entanglement between run-time entities. Call (procedure call, function call) is somewhat different, because it reveals the fractal nature of software: whatever happens inside the function is hidden (except via side effects) from the caller. The function will carry out its job, that is, will restore the fine-grained input-output balance upon invocation.

This is the general idea of procedural knowledge: given an input, control-flow goes on, until the program reaches a steady state, where output has been produced. As noted above, due to the fractal nature of information, this process takes place at different levels. A component instance may still be in a transient state while some object contained in that instance is already back to a steady state.

In this sense, a procedure (as an artifact) is how we teach a machine to calculate an output given an input, through the stepwise creation and destruction of entangled run-time information. Phew :-).

Next steps
The beauty of the Run-Time/Artifact dualism is that whatever we learn on one side, we can often apply on the other side (with the added benefit that many things are easier to "get" on one side or another).
Here, for instance, I've introduced the concept of steady and transient state for run-time information. The same reasoning can be applied to the artifact side. Whenever you create, update or delete an artifact A1, if there is any other artifact that is entangled with A1, the system goes into an unbalanced, transient state. You have to apply work (energy) to bring it back to the steady state, where all the entangled information is properly updated (balanced). A nice quote from Leonardo da Vinci could apply here: "Motion is created by the destruction of balance" (unfortunately, I can't find an authoritative source for the quote :-). Motion in the information space is created by unsettling the system (moving to a transient state) until entanglement is finally satisfied again, and a steady state is reached. This is as true in the artifact space as in the run-time space.
Overall, this has been more like a trampoline chapter, where a few notions are introduced for the benefits of forthcoming chapters. Here is a short list of topics that I'll try to explore at some point in the future:

- The truth about multiplicity, frequency, and attraction/rejection in the run-time world.
- As above, in the artifact world.
- Moving entanglement between run-time and artifacts (we do this a lot while we design).
- The Entanglement Diagram.
- A few examples from well-known Design Patterns.
- Cross-cutting concerns and Delocalized plans: relationships with entanglement.
- More on distance, entanglement, probability of failure, and the CAP theorem.
- Dampening (isolation). This will open a new macro-chapter in the Physics of Software series.

If you read so far, you should follow me on twitter, or perhaps even share this post (I dare you :-)