1. What is availability?
Availability \(A\) is the fraction of a given time window during which the system is "working correctly." It takes a value between 0 and 1 (or 0% and 100%), and the closer it is to 1, the higher the availability.
In availability design, we derive \(A\) from MTBF (Mean Time Between Failures) and MTTR (Mean Time To Repair) (covered below), and combine multiple components to compute the overall system availability.
2. MTBF, MTTR, failure rate λ, and repair rate μ
Availability can be derived from a device's failure-and-repair cycle.
- MTBF (Mean Time Between Failures): the average time from one repair until the next failure occurs.
- MTTR (Mean Time To Repair): the average time from a failure until the system is restored.
Relationship with failure rate λ and repair rate μ
MTBF and MTTR are the reciprocals of the failure rate \(\lambda\) and the repair rate \(\mu\), respectively.
Using these, availability can be written as:
Worked example
For a component with MTBF = 1000 hours and MTTR = 10 hours, the availability is:
3. Availability of a series configuration
A series configuration is one in which the entire system stops if even one component fails. A typical example is a dependency chain such as Web server → App server → DB.
Assuming each component fails independently, the system availability is the product of each component's availability.
Worked example
When \(A_1 = 0.99\), \(A_2 = 0.99\), and \(A_3 = 0.999\):
4. Availability of a parallel configuration (1-out-of-n)
A parallel configuration is one in which the system keeps running as long as at least one of the components is alive. Typical examples include a dual-DB setup or redundant Web servers behind a load balancer.
We compute the probability that "every component fails at the same time" and subtract it from 1. Each component's unavailability is \(1 - A_i\).
Worked example
When \(A_1 = 0.99\) and \(A_2 = 0.99\):
5. Availability of a k-out-of-n configuration
A k-out-of-n configuration treats the system as alive if at least \(k\) of the \(n\) branches are up.
- \(k = 1\): same as a regular parallel (1-out-of-n) configuration.
- \(k = n\): essentially equivalent to a series configuration (all branches must be alive). This tool only covers the range \(1 \le k \le n - 1\).
- \(k \ge 2\): includes practical patterns such as 2-out-of-3 majority-vote redundancy (e.g. quorum-based clusters or Quorum configurations).
When all branches have the same availability (closed form)
When every component has the same availability \(A\), we can use the binomial distribution. We sum the probabilities of exactly \(j\) branches being alive.
When branches have different availabilities (general form)
When each branch has a different availability, the count of alive branches follows a Poisson binomial distribution. We enumerate every possible combination (subset) of which branches are alive or failed, and sum the probabilities of those cases in which the alive count is at least \(k\).
The meaning of this formula is clear, but it requires enumerating \(2^n\) subsets, so the cost grows exponentially with \(n\). For example, \(n = 30\) means about 1 billion combinations, which is impractical.
Speeding it up with dynamic programming (DP)
To address this, the tool uses dynamic programming to reduce the \(O(2^n)\) problem to \(O(n^2)\). Define \(p_j^{(i)}\) as "the probability that exactly \(j\) of the first \(i\) branches are alive":
Initial conditions
Before looking at any branch (\(i = 0\)), the probability of 0 alive is 1, and all others are 0.
Recurrence
When we add the \(i\)-th branch, we sum two cases: "the \(i\)-th branch fails" and "the \(i\)-th branch is alive."
After processing every branch (\(i = n\)), summing the probabilities for which the alive count is at least \(k\) gives the system availability.
Each \(i\) requires a single pass over a table of length \(n + 1\), so the total cost is \(O(n^2)\) — far faster than the naive \(2^n\) enumeration.
Worked example: 2-out-of-3
When \(A_1 = A_2 = A_3 = 0.9\), we sum the cases of exactly 2 and exactly 3 branches alive.
Computing via subset enumeration
For the heterogeneous case (\(A_1 = 0.9, A_2 = 0.8, A_3 = 0.7\)):
The same value via dynamic programming
Updating the \(p_j^{(i)}\) table according to the recurrence above gives the following progression. The highlighted row is the final result.
| \(i\) | Branch added \(A_i\) | \(p_0^{(i)}\) | \(p_1^{(i)}\) | \(p_2^{(i)}\) | \(p_3^{(i)}\) |
|---|---|---|---|---|---|
| 0 | — (initial state) | 1 | 0 | 0 | 0 |
| 1 | \(A_1 = 0.9\) | 0.1 | 0.9 | 0 | 0 |
| 2 | \(A_2 = 0.8\) | 0.02 | 0.26 | 0.72 | 0 |
| 3 | \(A_3 = 0.7\) | 0.006 | 0.092 | 0.398 | 0.504 |
For example, in row \(i = 2\), \(p_1^{(2)} = 0.26\) is the sum of "\(A_2\) fails with 1 alive in the previous step (\(0.9 \times 0.2 = 0.18\))" and "\(A_2\) is alive with 0 alive in the previous step (\(0.1 \times 0.8 = 0.08\))."
From the last row (\(i = n = 3\)), we sum the probabilities for which the alive count is at least \(k = 2\).
We get the same 0.902 as the subset enumeration above, confirming that DP scales to large branch counts in \(O(n^2)\).
6. Computing annual and monthly downtime
Once the system availability \(A\) is known, downtime is given by:
Availability vs. annual downtime
| Availability \(A\) | Common name | Annual downtime (approx.) |
|---|---|---|
| 0.99 | Two nines | about 87 h 36 min |
| 0.999 | Three nines | about 8 h 46 min |
| 0.9999 | Four nines | about 52 min 34 s |
| 0.99999 | Five nines | about 5 min 15 s |
| 0.999999 | Six nines | about 31 s |
7. Worked example: Web + App + dual DB
Let's compute a typical three-tier configuration combining series and parallel.
Assume the following availabilities for each component:
- Web: \(A_W = 0.999\)
- App: \(A_A = 0.999\)
- DB#1, DB#2: \(A_{D1} = A_{D2} = 0.99\) (two identical units)
First we compute the parallel DB portion.
Then we compute the overall availability as a series of Web → App → DB.
Annual downtime is: