All bets are off
In the probability yardstick post, we explored the challenges of assigning probabilities to future events and reviewed some practical examples. The motivation was layed out:
- we want to fly airplanes with maximum safety
- we want to protect populations from extreme weather hazards
- we want to manage companies with well-informed decisions
- we want to make great personal choices
- ...
There is an abundance of reasons for wanting to predict the future.
Using one of the methods mentioned in that post, it is possible to obtain a probability distribution for a variable that will be observed in the future. But once we have it, what should we do with it? How does it enable rational decision-making? That is the topic of this post.
Let’s start with a simple example. We will provisionally assume that the found probability distribution is 100% accurate. This idealized assumption changes nothing on a conceptual level; it simply makes our back-of-the-envelope calculations easier.
Let’s suppose our probability distribution shows that a specific part of an electronic system has a 5% probability of failure throughout its supported lifecycle. The outcome is binary, thus the probability of the part not failing is 95%. In probability yardstick nomenclature, the part is “almost certain” not to fail.
Now, should we assume it won’t fail and ignore the “remote chance” that it will? Or should we accept the consequences of a “remote chance” of failure? Should we try to make this chance of failure less likely to occur? Should we try to mitigate any consequences of failure?
We don't have enough information to decide on what to do.
To move forward, we should at least consider:
- how many times will the variable be observed?
- what is the impact of failure?
- who is impacted by failure?
Let’s derive this information for particular cases.
Consumer perspective: a critical part of a laptop you purchased
From a consumer perspective, the answers are as follows:
- one
- you can't use your laptop until it's repaired
- you
Laptop owners can't reduce the probability of this hypothetical failure event — and even if they could, it would be too time-consuming to survey and apply. The rational strategies for this case are mitigation, meaning: to have a backup plan in case of laptop failure, or inaction, meaning: to do do nothing and improvise if it happens. Keep in mind, there's only a “remote chance” that the failure materializes in the lifecycle of the laptop.
Producer perspective: a critical part of a laptop we manufacture
From the perspective of a laptop manufacturer, answers are as follows:
- once per unit supplied (hundreds of thousand of units or more)
- laptop owners will be unable to use their laptops until they're repaired
- 5% of the customers who bought this laptop model; brand; stakeholders
This is the exact same case from a different perspective. For the manufacturer, there isn’t a “remote chance” of failure – given the large number of units produced, there is almost absolute certainty that the failure will happen in 5% of the units; with large numbers, probabilities manifest as frequencies. Therefore, a probability of 5% equates to unlikely in a single case, but it is “almost certain” over a million cases.
The rational strategy for the laptop manufacturer depends on several factors, including the stress this 5% problem brings to support services, the brand damage that might arise from it, but also how competitors handle similar scenarios. Laptop manufacturers may choose to lower failure probability, to deal with its impact, or a mixture of both.
Reseller perspective: a critical part of a laptop we resell
From the perspective of a laptop reseller, answers are as follows:
- once per unit supplied (tens to hundreds of units)
- laptop owners will be unable to use their laptops until they're repaired
- approximately 5% of the customers who bought this laptop model, with uncertainty; brand; stakeholders
The case of the reseller is particular in the sense that the number of units might not be large enough to materialize the 5% as a reliable average. In fact, if a reseller sold 30 units of the laptop in question, the expected value of faulty units would be 1.5, but the probability of 3 faulty units, i.e. twice the average, would be approximately 13%. If 300 were sold, rather than 30, the expected value of faulty units would be 15, but the probability of 30 faulty units would be only 0.02%.
The reseller perspective gives us reason to detail what's going on: by observing a binary variable n times we are building a sampling distribution, in this case a Binomial distribution of parameters p and n, where p=0.05. This Binomial distribution expresses the probability of the number of faulty units being k, for any k<= n. It can be easily shown that for the Binomial distribution, the ratio between the standard deviation and the mean is given by:
\( \frac{\sigma}{\mu} = \sqrt{\frac{1-p}{np}} \)
This means that, for a large number of units, relative deviations from the mean are small, whereas for small numbers relative deviations can be significant.
Another particular feature of this case is that close business relationships with customers incentivize extra concerns in terms of mitigating product failures.
The rational strategy for the laptop reseller might be to prepare for a number of failures that is higher than the 5% average – the number of sold units, combined with the Binomial distribution, will provide confirmation of up to what number of failures there is a reasonable probability.
Another producer perspective: critical part of an airplane we supply
From the perspective of a hypothetical airplane supplier, answers are as follows:
- once per unit supplied (hundreds or thousands of units, depending on the scale)
- the airplane will fall killing all passengers and crew
- hypothetically 5% × total airplanes supplied × average number of occupants, plus the families of the deceased, stakeholders, trust placed in airline companies, and even broader economic impacts caused by a decrease in travel.
The rational strategy for the airplane supplier is definitely to lower the probability of failure by several orders of magnitude, by improving the component in question and/or implementing redundancies. It's neither acceptable nor legal to supply planes with a single point of failure that materializes in 5% of the cases.
Discussion
The methods described in the probability yardstick post can be used to find probability distributions for different types of uncertain events. We might want to predict good things, such as economic recovery, or bad things like operational failures similar to the examples given above. The prediction operational failures is of particular interest due to its broad applicability across the industry.
It should be clear by now that even the simplest possible situation – a binary variable with a very dominant favorable outcome – can’t be dealt with in the absence of contextual information. And even with contextual information, the course of action will, to some extent, depend upon subjective judgment.
We have seen that differences in the number of observations, the dimension of the impact, and the bearers of such impact, can dramatically influence the rational course of action. There is an omnipresent tension between investing in prevention and mitigation versus assuming that problem won’t happen.
Risk Management
So far we have considered individual variables in isolation. In practical settings, however, organizations face not a single uncertain variable, but a constellation of variables, each of which with a different probability distribution. The process that handles multiple variables in an integrated way is called Risk Management, which is part of formal certifications like the ISO 27001.
If an organization is aware of N independent variables that are potential sources of operational failure, the risk associated with each is defined as:
\( R_i = P_i I_i \)
that is, the probability of operational failure times the impact of such failure. Please note that the probability above is the sum of probabilities for every event from the variable probability distribution which is considered a failure.
The total operational risk the organization is exposed to would then be given by:
\( R = \sum\limits_{i = 1}^{N} P_i I_i \)
Inserting the previously described examples into a Risk Management framework requires case by case analysis.
For the consumer perspective case – e.g. a freelancer who's concerned about business continuity – we can acknowledge the 5% as the probability of failure. But for the laptop manufacturer, those those 5% are not part of the risk – they are certain to be present due to the large number of units produced. If the decision was to accept the 5% failure rate and allocate necessary resources to handle the consequences, the risk that must be considered is the risk of deviation from that value. Since the forecast is not 100% accurate, such deviations are likely to occur.
Taking allocated resources into account, the Risk Management process could define, for instance, that operational failure is met if the failure rate of the critical component exceeds 7%, even though the forecast says 5%. How would this probability be determined? Ideally, from the history of forecast deviations. Alternatively, via ad-hoc reasoning.
Once probability is determined, impact needs to be calculated, in this case measured as a financial loss. It should be noted that deviations are problematic in both directions – a failure rate of more than 5% means that there will be insufficient resources available for mitigation, whereas a deviation in the opposite direction means that more than necessary resources have been allocated, leading to opportunity cost. The calculation of the impact is company-specific, and not devoid of assumptions and subjectivity.
There is yet an additional challenge, which is to normalize the impact in a scale that makes risks of different operational failures comparable. This exercise usually involves the definition of a subjective risk scale (ex: 1-5) and a framing of the impact for each potential failure into that scale – a process not devoid of subjectivity either.
When all this is in place, total risk can be reviewed periodically, and resources can be allocated to potential failures of higher risk.
Final notes
In these two posts, we walked all the way from the definition of probability to practical Risk Management, discussing probability assignment methods in between. Such a quick walk wouldn't be possible without simplifications since the topics at hand are rather complex – there are entire books devoted to them. But I had never come across an easy-to-read summary connecting the dots without overly dense mathematical material. When I needed it, I couldn't find it anywhere. Well, it's here now and I hope it proves helpful. All bets are off.