Future Software Design

Timothy Revell in “Bugged Out” (NewScientist, 5 December 2015, paywall) reports on new designs for software fault tolerance:

For a growing number of researchers, it’s time to admit defeat. If we can’t beat bugs, we need to learn to live with them – switching from damage prevention to damage limitation. Making computers more resilient to things going wrong could mean an end to computer crashes altogether – buggy code or not. “The idea here is immortal software,” says Martin Rinard at the Massachusetts Institute of Technology.

I didn’t feel particularly convinced by the article, but it’s interesting.  Rinard’s work:

All this means that companies may know that their software contains glitches, but it is too expensive and time-consuming to attempt to find them. Some companies get round this by offering rewards to users who report bugs once software is released (see “Bug bounties“). But increasingly, researchers are shifting their attention away from removing bugs to simply removing their effects. A bug might lead to a software crash, but it is often the crash itself that causes problems. To address this, Rinard has developed a technique called failure-oblivious computing, which aims to avoid programs crashing at all costs.

When a computer program crashes, it has usually encountered an error that it doesn’t know how to handle. In such situations, Rinard thinks the program should just do the easiest thing it can. This might not be the correct solution and might even cause the software to do something wrong, but the result is often better than a full-scale crash.

An example is made of the Ariane 5 launch failure of two decades ago in which a memory corruption cost the ESA a cargo worth $370 million.  Rinard continues:

“Here’s the kicker,” says Rinard. “That number was never actually used. Any number whatsoever could have been used instead and the rocket still flies.” If Ariane 5 had been equipped with failure-oblivious computing, it would have been a successful launch.

Which strikes me a irrelevant cherry picking.  Is this how all launch failures occur?  Can they be characterised this way?  I’ll hazard a guess: it’s NO.  And if that problem had not been present, another one not susceptible to this approach might have taken the flight down.

Without apparent irony the idea of using automated software generation to eliminate such bugs is brought up and discarded; such an approach would have removed the Ariane 5 problem.  Automated software generation encompasses a lot of subjects, from the most basic of assemblers to advanced compilers to application generators (generally, each is built on top of the prior example), and I tend to think of them as proven knowledge, even (in a narrowly construed context) wisdom, encapsulated and consulted when certain well-understood implementation problems are encountered.  The extension covered here is called program synthesis:

Kwiatkowska suggests that we get software to do things for us by writing programs that write programs. The idea, known as program synthesis, is that programmers describe what they want their code to do in precise but relatively simple terms and then have that code automatically generated. To ensure the program that generates the program is itself bug-free would require NASA’s level of effort [to prove their launch code], but this would only have to be done once. Kwiatkowska and others have shown that the technique works for small pieces of code, but it will be some time before whole systems can be built in this way.

More puzzling is the complete omission of formal methods from consideration.  These are mathematically derived methodologies for verifying software (and hardware) does what’s desired without crashing.  Cut from the article?  Mr. Revell is listed as a mathematician, so hopefully he’s aware of the field.  I do have to wonder if the methods are just too onerous to use with complex systems – I have not kept up with the field.

But perhaps the most dismaying lack in this article is the failure to consider the knock-on effects of implementing such systems.  Will engineers then ease-off, thinking this semi-magical program that monitors their program will save their ass thirty seconds into launch?  While I applaud taking the least used fork in the road, and I think these folks are doing some interesting work, I’m not sure it’d ever be healthy to deliver such systems and bring them into long term use.  Let me pull one more quote:

Emery Berger at the University of Massachusetts Amherst is taking the opposite tack. He is deliberately injecting a little randomness to get software to crash less.

He’s targeting bugs that crash a program in the same way each time. These are sometimes known as “bohrbugs” after the physicist Niels Bohr, whose model of the atom has electrons that orbit a nucleus in a very predictable fashion. For users, bohrbugs are the worst. If you keep doing the same thing on your computer, you will keep getting the same result. Perhaps viewing a particular picture always causes your computer to freeze, or pasting some text always causes your text editor to crash.

But there is another type of bug, known as the “heisenbug”. Heisenbugs seem to change when you attempt to observe them and are less predictable than bohrbugs, like the particles in quantum mechanics described by physicist Werner Heisenberg. This means that if you try to reproduce a bug, it often miraculously disappears. You might have lost some work, but at least the same crash is unlikely to happen twice.

Berger’s system, DieHard, turns bohrbugs into heisenbugs automatically. This means that if you hit a problem, the next time you try, DieHard will randomly select a slightly different way of running the software that will often avoid the bug. “By making it so that things become a bit more like a roll of a dice, the chances of you having the program work correctly increase,” says Berger.

So a program works in one system state but not another.  This sounds like a nightmare to me for fixing the problem, since changing a system state may involve changing many variables, only one of which may be causing the problem.  The cost of tracking down bugs in THAT class sounds a little nightmarish.might be higher than they anticipate – and waving it off as only failing occasionally may not be acceptable to the customer.

And that lets me transition to motivations, which are not explicated very well but seem to be commercial, given the costs cited.  While commercial interests provide the funding for software engineering,since the engineering provides so many benefits, this also strikes me as giving obeisance to a business requirement that has never sat well with engineering departments: schedules.  Asking engineers doing, essentially, research in developing new products to provide a schedule is often an exercise in futility that business finds nearly as frustrating as do engineers. While the idea is to deliver stable systems to users, I worry that this is really just another way for businesses to get more product out the door without concern for quality: money, money, money.

Bookmark the permalink.

About Hue White

Former BBS operator; software engineer; cat lackey.

Comments are closed.