To try to understand software complexity, we first have to look at various attributes of the software itself to understand why it can become complex. Obviously, a program that has bigger goal—say, a browser or a video game—or more numerous goals—an OS or a multi-usage web server like nginx—is going to be more complex than which is more focused—like a desktop movie player, or a simple development web server. A program that needs to run permanently (an OS, a device driver, a service) will be also inherently more complex than those than only need to run intermittently (video games, mobile and desktop applications) or only need to process in batch (command-line tools); here the reason is simply because as your execution lifetime increases, you’ll need to deal with unexpected errors and handle them without compromising the execution of the program.
Error handling strategies
But first let’s try to think precisely what we mean by errors. A program has a goal, and in the course of its execution it will try to make progress towards that goal: for example, a web browser trying to display the document at an address will first have to fetch the document, which involves opening sockets, managing HTTP state, etc… If everything goes according to plan, we have the “happy path”, and can think of any other execution path as belonging to the set of error paths. An error is just defined as a divergence from the ideal path, an state that we reached and that prevents us from progressing towards our goal. (Note that we are in an error state only when we diverge from the ideal path; something that could be an error in one case, like an
EOF condition on a socket, can be something that is expected in other circumstances, like the same
EOF condition after having successfully read the bytes from a configuration file.)
In practice, we’ll have mechanism at the code level to handle errors (return codes, exceptions, monadic error handling…), which we’ll use to implement a strategy to overcome—or not—the errors and make progress towards our goal.
The strategies we’ll adopt to handle errors are really dependent on the degree to which we want to handle errors:
No error handling: we just panic whenever we encounter an error, or we’ll continue with perhaps some default value allowing us to ignore the error. This is typically what a prototype will use. If you transform the panics into slightly more useful error messages, you also get the typical strategy of a command-line utility. This is fairly low cost, and your code will mostly consist of progressing towards the goal, but it is not resilient at all. Prototypes in particular can only be cheap because of the use of this strategy.
Limited error handling: we typically choose a few subsystems that we’ll want to be fairly resilient against, and protect against a well-chosen subset of errors within these subsystems, as well as define strategies for progressing towards the goal in case of errors. Most software will use this strategy, with of course varying number of subsystems and degrees of resilience within these subsystems. For example, games will typically protect against a variety of network errors, including disconnection, and some of the strategies they’ll use can involve a large amount of complexity (like making the game pause on other clients, pending reconnection of the disconnected client). This is necessary because the underlying assumption is that networks are unreliable, so errors should be expected. On the other hand, the very same games will usually expect the file system (and thus the underlying backing medias, like SSDs or hard drives) to be reliable, so they won’t be resilient in the face of an unexpected disk error. There are various ways we can recover from errors, like ignoring the error and restarting from a known deterministic state—for example, using default values when a configuration file can’t be read for some reason—, retrying the problematic code, possibly with different parameters, or simply punting the error condition back upwards the call stack where it might be better handled at a higher-level.
This strategy is costlier than the first one. To keep the example of games, we know that in particular network code can be very tricky to get right. In general, code that handles error conditions and provides recovery can outnumber ideal path code by an order of magnitude easily.
A lot of networking and distributed systems will use this kind of approach, and some particularly clever implementations use a lower cost strategy which could be summarized by the motto die and retry. The idea is to simply make the software resilient enough that, when it encounters some kinds of error against which it does not know how to go forward from (or for which writing the code to handle it would just be too costly), it simply kills the execution at the process level and restarts. This approach was made famous by Erlang (and—I believe—pioneered by LISPs), but you can find variants of this in different languages and at different scales. (You can restart the function, the process, but also the whole subsystem! That is, provided the restart is also cheap enough.) The trade-off here is that you make your code simpler by not handling errors, and you pay that simplicity by running your code on a platform that makes it easier to die without corrupting the permanent state, and to retry cheaply by having good enough performance characteristics that a retry is rapid enough for your requirements.
Thorough error handling: this is the realm of safety-critical software, like embedded software for avionics and aerospace, but also (in an ideal world) OSes, web browsers, device drivers, servers, etc. In this strategy, the error handling code will outnumber the ideal code path almost everywhere. There will be code for handling recovery at the various levels at which the error may be recoverable from, and this will be coupled with well defined semantics, states, and testing suites. Because writing code is expensive, you’ll want to limit the potential source of errors, so you’ll also generally limit your use of the programming language to a safer, simpler, subset. You’ll also want to have the best coverage possible for various metrics, and you’ll complement testing with a range of static and dynamic analysis tools. You may sometimes also define higher-level strategies, like having concurrent, independently developed implementations, to give an example from avionics.
Obviously, this strategy is extremely costly, because of the overwhelming complexity of handling the combinatorial explosion of potential errors that divert you from the ideal path. To give a simplified model, if your ideal path to a goal involves calling n functions sequentially, and that these n functions can all succeed or fail, then there’s only one possible path to success, but there are 2**n - 1 paths that will fail to achieve the goal, only a subset of which might be recoverable from. This combinatorial explosion also explains why making complex, highly-reliable software is so expensive.
So what’s the takeaway from all this?
A few things: first, it’s that error handling is responsible for a large part of the complexity of the software we write—the other big part being complexity from the “business logic” itself.
Second, it’s that there’s usually a small number of way the software can succeed, and by complement, a large number of ways it can fail. This imbalance is why exhaustive error handling is expensive, and why we need to use strategies to manage the amount of error handling we do. The best errors are those that can’t arise, usually by making sure that there is no state, or that the state is encoded within the type system, if your language allows that. Other ways to manage involve handling large amounts of error cases under generic code, containing errors at the subsystem boundaries, and choosing our battles: that is, using exhaustive and costly techniques where it matters, and simply punting (by panicking or resetting to known state) whenever it doesn’t.
The next question that naturally comes to mind is: how good are the tools that our languages give us to handle errors? I’ll explore this question in a future series of posts on error codes, exceptions, and other methods.