A Differential Diagnosis Of Software Failure

Imagine you were placed in New York City and hired to find a person. Now, New York City has over 8 million people; How would you go about your search? This person could be anyone, and anywhere within the city.

Presumably in this situation, you could ask some questions of your curious employer. What would be some of the first things you would ask to draw you closer to a conclusion? Would you ask for the person’s name? Surely this information would be helpful later on, especially to validate; but imagine this information being your starting point. How many Marks are in New York City? I would guess more than a few; and intuitively, and generally speaking, probably fairly well distributed across the city. This origin point of information would seem to start a brute force search, which is certainly not ideal.

Erhm, back to the question. How about location? Certainly our employer doesn’t know where he is, so we can’t ask that question – he may question our sanity, and competency as detectives.  What we can ask are questions that employ deductive reasoning. Where is this person probably not? Such questions may be – “Where does this person work?”, “What does this person do?”, “Where was he last seen?” These questions attempt to limit the problem space  in a probabilistic fashion. If we know Mark works in a butcher shop, he certainly could have quit his job to join a vegetarian enthusiasts group (Do these exist?); but for the sake of reducing the problem, we can at least temporarily take it off the table until Occam’s Razor fails us.

What am I even trying to say here? What does this have to do with software?

Myself, and others apply similar mental models when debugging a software systems issue. Now, certainly, there are cases in a software system that do not fit in our detective analogy; but there are others that fit just fine. The parts that do fit, are a deductive reasoning approach, and probabilistic reduction of problem space. Whew, that’s a lot of words trying to explain a simple concept – When we are solving a computer system error, we do not know where the error is; we only have a good idea of where it is probably not. 

cantdoit

For any reasonably sized system, we can not consider everything. We can’t do it. There is too much to consider. We have to start with the most likely modules first, and expand out as needed. If wrong characters are displayed to a screen, yet our buffer flushing code has been untouched by the breaking release, it is more likely that we have corruption, than a system display driver bug. This is a simple example, and arguably obvious point; but it can be missed in the panic to fix a bug. We can not brute force search for bug sources in our computer systems. This is much akin to our decision to not find Mark by interviewing people passing by on the street. It is also akin to our decision for the time-being, to avoid searching in vegetarian venues. It is certainly possible for a device driver bug to exist in our previous situation, and it certainly possible for Mark to have deep-seeded resentment of his employment situation; but it is unlikely.

Where things start to diverge between an approach with finding Mark, and an approach with finding a computer error is understanding the difference between these systems. What comes to the great pleasure of PIs, is that human beings tend to be predictable. An especially hard problem can arise when particularly poorly written software, interaction of competing systems, or a triage of errors results in a Complex System*. A simple example is that state in one module, could affect the producing state of another module. To determine which is the offending module, like solving any multivariable problem, we can control for the state of one module to determine how it affects state production of another module. If controlling for module A for all states of A, with producing module B results in our error we know that module A is not offending, and thus we can continue to limit our problem space.

Here is a real world example of limiting a problem:

A few months ago, a software product I was working on was reported to occasionally start up in an dysfunctional state – certain modules would consistently fail to start. My first stab was tackling the problem of “occasionally.” I need to know the parameters of “occasionally.” Turns out there are slight variations in how certain users like to “turn on”, and “turn off” the software – Some would shutdown the software within the VM, and keep the VM running. Others would halt the VM, and allow our init scripts to shutdown the software. Some would even destroy, and start fresh everyday – fresh install, fresh VM; especially since all our environments are disposable, and easily recreated.

Whenever the software was stopped using Vagrant (We use Vagrant to manage local VMs by the way), the issue could be reproduced. In addition, the oldest of versions of the dysfunctional modules produced this startup issue, yet the issue had not been reported since our adoption of Vagrant. I need to know if the issue is limited to something within the JVM, and the JVM’s interaction with the virtualized environment; or if the issue can only be reproduced within the virtual environment. Turns out, when running the software on bare-metal, latest offending release, the issue could not be produced. This was the point I knew that the issue is unlikely to be related to the “main” JVM component of the software, or a release of the software. It is more likely to be a delta between running the software on bare-metal, vs our Puppet managed virtual environments. Hmm, disk is probably corrupted somehow. Improper shutdown? What about Vagrant, vs OpenStack? We have local development environments in a Vagrant/Virtualbox context, and upper environments in an OpenStack deployment. Is Vagrant doing an ACPI shutdown, or could it be doing a hard power-off; unlike OpenStack machine management? However unlikely, it is easy to test by turning the software off/on in an upper engineering environment. This produced the same error. At this point, I know it is something within our system code, and common OS environment, but the problem space boundary stops at the JVM. Does the issue produce with a SIGKILL vs SIGINT, and manual startup? Producing the issue with SIGKILL, may confirm a improper shutdown issue; We can also try this on bare-metal. It produces with a SIGKILL, but not SIGINT. How is our init file starting/stopping the software? It’s sending a SIGINT in the code, but I don’t see any logs indicating action. Oh wow, our puppet provisioned start script is not setting a process lock in /var/lock – better fix that. Fixed. Issue persists, but at least we are seeing the init script called in the logs. Well, I think we are on the right track still, lets keep looking. Holy cow, our rc stop priority is WAY too low; jexec stops before our software. Fixing that resolved the issue.

You can see in this example the importance of limiting the problem. Fixing the root issues would have been almost impossible if we included the scope of the JVM.

 

*If you are interested in Complexity Theory, and systems complexity, I’d recommend reading more about the Cynefin Framework for reasoning about systems.