Manufacturing Code: Risk Analysis

What is risk? How do we define it in the context of our application? This is a discussion on how risk affects your work and some tools to help you decide if it's important to address.

May 16, 2018

In an earlier article I discussed scarcity of time and some factors to consider when budgeting your time on a feature or piece of work. One of those factors was risk. Risk is a fairly broad term meant to represent a wide variety of negative outcomes. In school and while working manufacturing I was introduced to this concept very early on and the reasons are much more apparent. Risks in a manufacturing plant or machining workshop are all around you - heavy rotating machinery, drums of chemicals, forklifts, heavy objects and warning labels and SOPs for all of the above are a part of your daily environment.

However I want to talk about risk from a systems and engineering standpoint. In this field risk can be a great many things with physical danger to a real person being just one. Any system, mechanical/chemical or software, has risks involved in the planning, development and maintenance processes. Risks involving project staffing, technical debt, workflows, productivity and scope are only just a few of general project management risks.

I want to get into the risks involved in making technical decisions and there are many examples when practicing software development. Do I need to do a null check here? Do I need a test case to handle this requirement? Should I handle this edge case? It's tempting to say yes to all of these but saying yes means more code, more logic, and more work to maintain. Is it worth it?

This question is fundamental in nature and in a way I talked about it in the scarcity article. When faced with it we seem to usually react based on a weird combination of instinct and laziness. It is always a struggle to make an informed and conscious decision based on the facts, to stay objective. As human beings it's natural to react in this way, it's part of how we operate and how we evolved. So as part of a way to stay objective many variations of tools have been developed to try and make us stay objective.

One such tool is an Failure Modes Effects Analysis (FMEA). Usage of FMEA was made popular by its implementation at NASA and the US military, and then adopted by the aerospace industry and automotive industry, and subsequently has seen use in almost every manufacturing industry, particularly in the United States, Germany & Japan.

fmea diagram

The goal of an FMEA is to identify failure modes in a system and then systematically and objectively attempt to quantify "how risky" that failure mode is. This quantification is done by answering three questions about the failure mode:

  1. How severe is it? (SEV) - just how badly can the system fail if this failure mode occurs? Will our app crash? Could it cause physical harm or death? Will the user even notice it happened?
  2. How likely is it to occur? (OCCUR) - How many times a second/week/decade is this likely to occur? Does this likelihood scale with the amount of users or is it likely to be constant with time?
  3. How likely are we to find it? (DETEC) - Once the failure occurs will it fail in a way that we know it will happen (and possibly take steps to rectify) or will it fail silently?

A failure modes final score is then the multiplication of these three factors to determine the "Risk Priority Number" (RPN). These RPNs are then ranked to determine the most and least "risky" error modes. In practice, and when I have used an FMEA, this process is usually done by a team of diverse skill sets: A manager, an engineer, a business type, an hourly employee who runs the system in question, etc. Both the diverse team and the process of dividing up the question of "how risky is this failure mode" into three parts can produce some interesting results. Many times you will find items you didn't think were very important at the top of your list. You'll also find items you would assume to be critical towards the bottom.

Now back to software: should you be making an FMEA spreadsheet containing a list of every if/else or for loop in your code? Definitely not. Should you be making one for larger system level decisions? I think it would be an interesting exercise. The main value I hope someone can derive from this article is not necessarily the FMEA itself but the questions it forces us to ask about the code itself.

So should you handle that edge case?

  1. If the edge case occurs and we don't handle it, what will the result be? Could we start a fire, damage our companies reputation, or will their be any consequence at all?
  2. How often is this piece of code run? Does it run once on an obscure rarely used endpoint or is it in some piece of middleware that every request ever made to our application is run through?
  3. Are users going to get a frozen app randomly and we never know about it or are we elegantly handling the error and logging it into our well maintained analytics system?

Approaching code with this mindset can really help you optimize your time, be productive, and produce quality work.