Saturday, November 16, 2019

Good Programming: Error Handling

Error handling is one of the holy grails of software development and it's easy to see why if you consider that bad error handling can bring down an entire system.
Think I'm exaggerating? I work on an eCommerce solution which fails completely if a product page has no valid image. That means the customer gets a 500 error page instead of the opportunity to buy a product because someone forgot to upload an image.

I've even had discussions with developers who neglect error handling and say: "Well, if this happens, then it's *supposed* to blow up". What "this happens" means varies from context to context but, in my example above, it means "a product has no image" so let's crash.

Now, I'd like to believe that this is oftentimes an oversight rather than a conscious decision. The programmer focuses on the Ideal Path and simply neglects to consider (or is never told) what should happen when the Ideal Path gets bumpy.

Modern coding practices such as chaining feed into this. It looks absolutely fabulous to chain methods together in code... until we get a single method returning unexpectedly and our favourite error, the nullPointerException, brings down the system.

    database()
      .read('something')
      .call('something', user.property()) // Boom!
      .thenDoThis('never happens');

Of course you can use chaining responsibly but you can't always avoid code you call behaving badly. Even if you wrote the code you are calling, it may misbehave, someone may change it, you never know when something UNEXPECTED can occur and tear down your work of art.

The solution? Expect the unexpected! Be pessimistic. Like any good engineer.

In my first job out of Uni I worked for a small bio-startup in Roslyn, Scotland. Yes, Dolly the Sheep, was pretty much sharing office space with us. My boss, a no-nonsense Yorkshireman kept asking me why so much of my code was error handling. I replied that you never know what might happen and he replied: I pay you to code business logic and what I'm getting is 80% error handling and paranoid disaster mitigation. Well, not in those terms exactly: he probably told me to "get yer finger ot and staaat deliverin' t'goods afoor A thump you".

What is the goal of error or exception handling? IMHO it's to be transparent to the user and the operator of the system about what just happened and, ideally, what can be done.

In simple terms it means telling the user that what they tried to do did not work and how to proceed or work around the problem. An example is to show a message "Your file could not be saved, please try again!". This is obviously not very helpful if the problem persists but if the problem is intermittant (network outage to google drive) then it may just work the second or third time around.

Now that we've shown the user what's going on we need a way of recording this incident so that the operator or owner of the software product is also aware of the outage and can take appropriate action. We may log error NET307 to record a network save issue.

In the example above there may be no action the operator or owner can take: Google Drive cannot help it if their user's WiFi is dodgy can they? Or can they? If the product owner is aware that 50% of users run into NET307 each week they may work towards mitigating the pain associated with not being able to save a document as the user's train enters a tunnel and network access disappears.

Thus a sound logging, monitoring and reporting of the error feeds back into the product development cycle to produce new features such as offline saving. The document is saved to a local storage mechanism if the network is not available and later synced up to the online service when the network again becomes available.

Examples of bad exception handling abound: user's seeing stack traces or misleading messages about what just happened. Operators seeing the wrong stack trace or line number (yes, I'm looking at you PowerShell). Proper error handling is about the developer understanding what just happened in their code and translating this into a user-friendly message, an operator friendly log and, ideally, an owner-friendly metric. In this way the target audience is best served.

I am tired of hearing that "it works" when a crappy piece of software runs under some (ideal) conditions. It's like saying that half a barrel floats: indeed it does but you wouldn't want to cross the Atlantic in it. Your software may run in ideal conditions but the world is not ideal.

This becomes more apparent the more connected the world becomes and the more distributed our systems are. If your software fails completely (e.g. the nodeJS process crashes out and your service is dead) because of an HTTP timeout due to a slow network then you need to think about better error handling.

Some people say: "But my module/component is only responsible for X and not for the network. I assume the network is working or my component is useless."

This is a valid argument: it does not make sense for each software component to try manage the failure of the dependencies of all it's dependencies. I do think that each component should manage the failure of it's dependencies. This can mean catching the exceptions of components you use and either passing the exception through or enriching it with additional meaning for your context.

Throwing exceptions was once state of the art but I have come to question the approach in past years. Working for many years with ABAP, the programming language used within SAP, I and others built up layers of classes which all used exceptions harmoniously. The neat thing was it pushed error handling to the bottom (the CATCH block) of our code. Our code could proceed along the happy Ideal Path and readability was improved.

The challenges were that we sometimes had to wrap legacy, old fashioned functions and methods, which did not use exceptions. On the other hand, legacy code which didn't work with exceptions needed to TRY....CATCH our fancy schmancy code and it was ugly.

When I moved to a new company I decided to drop exceptions completely and take a different approach: logging and simple return codes. In this approach all code should log errors and warnings consistently (using a global Singleton log object) and then exit if they encounter an error. When they execute successfully they return a boolean ('X' in ABAP means true) - in all other cases they return nothing ('' == false).

ABAP has a really neat keyword called CHECK which will exit the current block or function if they encounter anything that is not true. So code looks like this:

    CHECK obj->method1().
    CHECK function_call.
    CHECK some=>static_method( 'foo').

The net result is that we completely avoid any error handling in intermediate classes and only need to consider two approaches:

  • Low-level functions should log errors and exit
  • Application-level programs should read the error log and inform the user
There are some downsides (like managing a global log object) but, in general, I'm quite happy with the results. We save all messages which occur in all our programs (even SUCCESS or INFO messages) and can even enable tracing by saving DEBUG messages. These messages are sent to ElasticSearch and can be analysed in Kibana meaning we get consolidated logging and metrics "for free".

In summary I highly recommend considering all stakeholders (users, operators and owners) when developing and being pessimistic about whether our dependencies will behave or not. Log out as much as possible at the correct level and show the user what happened and how to proceed.

A final word to error codes: a unique number is a good thing to have but it can be difficult to manage across many applications. I prefer a kind of hybrid code which is kindof human readable. NETFAIL707 is more useful than 0x003030301.

No comments:

Post a Comment