Whether your software is helping doctors make medical decisions, alerting sleepy drivers, or just selling dog food, there are consequences for failure. It could mean missed revenue for your company, loss of customers' confidence, or worse.

Having spent much of my career working on safety-critical and mission-critical software such as Electronic Health Records, clinical trial software, and high-volume e-commerce infrastructure, I have learned that most incidents are preventable with the right set of strategies and values.

The following is by no means a comprehensive list, but a foundation upon which to build. Some of these may seem obvious, but I believe that the impact they can have on reliability is underestimated — and in particular, how they feedback on each other.

1. Reduce Technical Debt

Technical debt is the kiss of death for the reliability of your application, not to mention your budget, timeline, and agility.

Technical Debt is when a software system or data representation model no longer makes sense, or is difficult to reason about. If you have any areas of friction or outdated dependencies, that also counts.

Your code and/or data model may be lying to you — this creates unnecessary complexity and confusion, and will often lead to mistakes (bugs). Even worse, this can become a vicious cycle, in which engineers are forced to take further shortcuts to meet deadlines.

Some amount of tech debt is unavoidable. It's an organic part of the software development process. As you learn what you need to build and how to build it, the code needs to be updated to reflect that new understanding.

2. Keep it Simple

When NASA designs a spacecraft, there are generally two basic approaches to ensure reliability. One is redundancy, ie. backup computers, backup power, etc. When redundancy isn't possible, they employ simplicity.

The ascent stage engine on the Apollo Lunar Module, the rocket engine that brought astronauts from off the lunar surface back to their mothership, was surprisingly simple. It utilized "hypergolic" fuels that combust instantly when they come into contact with each other.

If simplicity works for spacecraft, then it can work for software. In fact, I think it works even better for software. The simpler something is:

  • Fewer things to break or be flawed
  • Fewer things need to be tested, and more thoroughly they can be tested
  • Much easier to understand, fewer mistakes will be made

3. Make Your Code Readable

At some point in my career, I noticed that there is a widely-held belief that senior programmers write more complicated code than junior programmers, and you need to also be senior to understand it. It's seen as a badge of honor.

"There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies and the other is to make it so complicated that there are no obvious deficiencies." — C. A. R. Hoare

In my opinion, the best programmers are the ones who can take complex behavior and break it down into relatively simple, readable, reasonable, and communicative code that most programmers can understand.

4. Follow an Incident Management Strategy

Two of the biggest problems I see with many teams are noisy error logs and an open feedback loop. These two go hand-in-hand.

Noisy Slack channels and server logs are only paid attention to once a user complains. Afterward, some poor engineer has to scour these until they find something that might be relevant. If this happens, it's already too late.

If your errors don't have a high signal gain, their value is greatly diminished.

If you don't already have one, come up with an Incident Management Strategy. Think of it as a pipeline, where errors and warnings are captured (ideally standardized, structured data), and then filtered, to become incidents that alert real people in some way, depending on severity.

5. Follow Error Handling Best Practices

No matter your strategy, good incident handling starts at the application level (the programmer). Remember the axiom, "Garbage In, Garbage Out."

"Almost all catastrophic failures are the result of incorrect handling of non-fatal errors explicitly signalled in software."

An unexpected edge case could occur, someone may log in with a bad password, or a 3rd party service could be down. This should be communicated across all of your application layers and downstream infrastructure in a standardized way.

If your programming language has an Error construct, use it. Strings do not represent errors well and lack critical features, such as type, structure, context, and stack traces.

6. Use a Type-Safe Programming Language

"Types invalidate most of the silly errors that can sneak into codebases, and create a quick feedback loop to fix all the little mistakes when writing new code and refactoring."

Strongly-typed languages throw an error when you are unexpectedly mixing types. Statically-typed languages, such as TypeScript, Java, and C# will do this during compile-time, and throw an error even before you deploy your code.

Modern web application systems are mostly passing data around in different shapes, and these type-safe languages make it easy to define and declare these data structures as first-class citizens in the language, avoiding many common mistakes.

7. Code with GUTs (Good Unit Tests)

"I think when you hear the phrase 'it's just test code'. To me that's a code smell." — Alan Page

Unit Tests provide a "green light" that tells programmers that they haven't broken anything. Clever programmers run unit tests frequently during the development process, especially during a refactor.

Unfortunately, many unit tests I've seen are brittle, confusing, and overwhelming. These tests end up breaking every time there is a change and are a headache to deal with. These tests actually become worse than useless — they give a false sense of security.

GUTs test behavior, not implementation. They are easy to read and maintain. GUTs read like a straightforward specification, helping the reader to understand what the code being tested should be doing.

8. Use End-to-End Integration Testing Tools

Just like with unit tests, end-to-end integration tests help programmers to know that they haven't caused any regression during the development process. Unlike unit tests, however, they test the entire system.

Many parts of an application are difficult to write unit tests for, particularly things like UI's and writing to databases. For these, end-to-end integration tests might be the best bang for your buck.

9. Do Manual Testing

"Discovering the unexpected is more important than confirming the known." — George E. P. Box

Even if you were to have 100% unit test coverage and extensive end-to-end automated integration test coverage, I would still recommend doing at least some basic manual smoke testing — in both development and production environments — focusing around the areas of impact.

Unless you are employing some kind of visual perceptual diff tool, things can visually still be wrong in your UI and pass all of your tests. These kinds of problems will be much more obvious to humans.

10. Learn From Your Mistakes

"Failure isn't fatal, but failure to change might be." — John Wooden

If you learn from your mistakes, not only are you preventing a repeat mistake, but you will learn how to prevent similar ones. In fact, sometimes you can get really lucky and a small mistake can lead you down the path of finding a critical design flaw in your system.

At my last company, I created a policy that for every production bug, the engineer who diagnosed and solved the issue must research and write a "Post Mortem," along with their recommendations for how to prevent a similar issue in the future. It was successful in revealing weaknesses and preventing future mistakes while giving engineers more ownership.


This article was originally published on Medium by Marc H. Weiner, co-founder of Aeroview.io and a startup veteran with two successful exits and a special interest in scaling and safety-critical systems.