1. What is availability?

Availability \(A\) is the fraction of a given time window during which the system is "working correctly." It takes a value between 0 and 1 (or 0% and 100%), and the closer it is to 1, the higher the availability.

\[ A \;=\; \dfrac{\text{uptime}}{\text{uptime} + \text{downtime}} \]

In availability design, we derive \(A\) from MTBF (Mean Time Between Failures) and MTTR (Mean Time To Repair) (covered below), and combine multiple components to compute the overall system availability.

The "nines" notation: \(A = 0.999\) (three nines) corresponds to roughly 8.76 hours of downtime per year, and \(A = 0.99999\) (five nines) corresponds to roughly 5.26 minutes per year.

2. MTBF, MTTR, failure rate λ, and repair rate μ

Availability can be derived from a device's failure-and-repair cycle.

\[ A \;=\; \dfrac{\mathrm{MTBF}}{\mathrm{MTBF} + \mathrm{MTTR}} \]

Relationship with failure rate λ and repair rate μ

MTBF and MTTR are the reciprocals of the failure rate \(\lambda\) and the repair rate \(\mu\), respectively.

\[ \lambda \;=\; \dfrac{1}{\mathrm{MTBF}}, \qquad \mu \;=\; \dfrac{1}{\mathrm{MTTR}} \]

Using these, availability can be written as:

\[ A \;=\; \dfrac{\mu}{\lambda + \mu} \]

Worked example

For a component with MTBF = 1000 hours and MTTR = 10 hours, the availability is:

\[ A \;=\; \dfrac{1000}{1000 + 10} \;=\; \dfrac{1000}{1010} \;\approx\; 0.9901 \]

3. Availability of a series configuration

A series configuration is one in which the entire system stops if even one component fails. A typical example is a dependency chain such as Web server → App server → DB.

Reliability block diagram with A, B, C connected in series
A, B, and C connected in series

Assuming each component fails independently, the system availability is the product of each component's availability.

\[ A_\text{system} \;=\; \prod_{i=1}^{n} A_i \;=\; A_1 \cdot A_2 \cdots A_n \]
Intuition: Adding more elements to a series chain lowers availability. Putting three 0.99 components in series gives \(0.99^3 \approx 0.9703\), because the probability that any one of them fails accumulates.

Worked example

When \(A_1 = 0.99\), \(A_2 = 0.99\), and \(A_3 = 0.999\):

\[ A_\text{system} \;=\; 0.99 \times 0.99 \times 0.999 \;\approx\; 0.9791 \]

4. Availability of a parallel configuration (1-out-of-n)

A parallel configuration is one in which the system keeps running as long as at least one of the components is alive. Typical examples include a dual-DB setup or redundant Web servers behind a load balancer.

Reliability block diagram with A and B connected in parallel
A and B connected in parallel

We compute the probability that "every component fails at the same time" and subtract it from 1. Each component's unavailability is \(1 - A_i\).

\[ A_\text{system} \;=\; 1 \;-\; \prod_{i=1}^{n} (1 - A_i) \]
Intuition: Parallelization raises availability through redundancy. Putting two components of availability 0.9 in parallel yields \(1 - (1 - 0.9)^2 = 1 - 0.01 = 0.99\), a large improvement over the single-instance value of 0.9.

Worked example

When \(A_1 = 0.99\) and \(A_2 = 0.99\):

\[ A_\text{system} \;=\; 1 - (1 - 0.99)(1 - 0.99) \;=\; 1 - 0.0001 \;=\; 0.9999 \]

5. Availability of a k-out-of-n configuration

A k-out-of-n configuration treats the system as alive if at least \(k\) of the \(n\) branches are up.

A 2-out-of-3 configuration consisting of three branches A, B, and C
OK if at least 2 of the 3 are alive (the badge shows k/n)

When all branches have the same availability (closed form)

When every component has the same availability \(A\), we can use the binomial distribution. We sum the probabilities of exactly \(j\) branches being alive.

\[ A_\text{system} \;=\; \sum_{j=k}^{n} \binom{n}{j} A^{j} (1 - A)^{n - j} \]

When branches have different availabilities (general form)

When each branch has a different availability, the count of alive branches follows a Poisson binomial distribution. We enumerate every possible combination (subset) of which branches are alive or failed, and sum the probabilities of those cases in which the alive count is at least \(k\).

\[ A_\text{system} \;=\; \sum_{\substack{S \subseteq \{1,\dots,n\} \\ |S| \ge k}} \left( \prod_{i \in S} A_i \right) \left( \prod_{i \notin S} (1 - A_i) \right) \]

The meaning of this formula is clear, but it requires enumerating \(2^n\) subsets, so the cost grows exponentially with \(n\). For example, \(n = 30\) means about 1 billion combinations, which is impractical.

Speeding it up with dynamic programming (DP)

To address this, the tool uses dynamic programming to reduce the \(O(2^n)\) problem to \(O(n^2)\). Define \(p_j^{(i)}\) as "the probability that exactly \(j\) of the first \(i\) branches are alive":

\[ p_j^{(i)} \;=\; \Pr\!\bigl(\text{exactly } j \text{ of branches } 1 \text{ to } i \text{ are alive}\bigr) \]

Initial conditions

Before looking at any branch (\(i = 0\)), the probability of 0 alive is 1, and all others are 0.

\[ p_0^{(0)} \;=\; 1, \qquad p_j^{(0)} \;=\; 0 \quad (j \ge 1) \]

Recurrence

When we add the \(i\)-th branch, we sum two cases: "the \(i\)-th branch fails" and "the \(i\)-th branch is alive."

\[ p_j^{(i)} \;=\; \underbrace{p_j^{(i-1)} \cdot (1 - A_i)}_{i\text{-th fails, with }j\text{ alive in the previous step}} \;+\; \underbrace{p_{j-1}^{(i-1)} \cdot A_i}_{i\text{-th is alive, with }j-1\text{ alive in the previous step}} \]

After processing every branch (\(i = n\)), summing the probabilities for which the alive count is at least \(k\) gives the system availability.

\[ A_\text{system} \;=\; \sum_{j=k}^{n} p_j^{(n)} \]

Each \(i\) requires a single pass over a table of length \(n + 1\), so the total cost is \(O(n^2)\) — far faster than the naive \(2^n\) enumeration.

Worked example: 2-out-of-3

When \(A_1 = A_2 = A_3 = 0.9\), we sum the cases of exactly 2 and exactly 3 branches alive.

\[ \begin{aligned} A_\text{system} &= \binom{3}{2}\, 0.9^{2}\, 0.1^{1} \;+\; \binom{3}{3}\, 0.9^{3} \\ &= 3 \times 0.81 \times 0.1 \;+\; 1 \times 0.729 \\ &= 0.243 \;+\; 0.729 \;=\; 0.972 \end{aligned} \]

Computing via subset enumeration

For the heterogeneous case (\(A_1 = 0.9, A_2 = 0.8, A_3 = 0.7\)):

\[ \begin{aligned} A_\text{system} &= A_1 A_2 (1 - A_3) + A_1 (1 - A_2) A_3 + (1 - A_1) A_2 A_3 + A_1 A_2 A_3 \\ &= 0.9 \times 0.8 \times 0.3 + 0.9 \times 0.2 \times 0.7 + 0.1 \times 0.8 \times 0.7 + 0.9 \times 0.8 \times 0.7 \\ &= 0.216 + 0.126 + 0.056 + 0.504 \;=\; 0.902 \end{aligned} \]

The same value via dynamic programming

Updating the \(p_j^{(i)}\) table according to the recurrence above gives the following progression. The highlighted row is the final result.

\(i\) Branch added \(A_i\) \(p_0^{(i)}\) \(p_1^{(i)}\) \(p_2^{(i)}\) \(p_3^{(i)}\)
0 — (initial state) 1 0 0 0
1 \(A_1 = 0.9\) 0.1 0.9 0 0
2 \(A_2 = 0.8\) 0.02 0.26 0.72 0
3 \(A_3 = 0.7\) 0.006 0.092 0.398 0.504

For example, in row \(i = 2\), \(p_1^{(2)} = 0.26\) is the sum of "\(A_2\) fails with 1 alive in the previous step (\(0.9 \times 0.2 = 0.18\))" and "\(A_2\) is alive with 0 alive in the previous step (\(0.1 \times 0.8 = 0.08\))."

From the last row (\(i = n = 3\)), we sum the probabilities for which the alive count is at least \(k = 2\).

\[ A_\text{system} \;=\; p_2^{(3)} + p_3^{(3)} \;=\; 0.398 + 0.504 \;=\; 0.902 \]

We get the same 0.902 as the subset enumeration above, confirming that DP scales to large branch counts in \(O(n^2)\).

6. Computing annual and monthly downtime

Once the system availability \(A\) is known, downtime is given by:

\[ \text{annual downtime [hours]} \;=\; (1 - A) \times 8760 \]
\[ \text{monthly downtime [hours]} \;=\; (1 - A) \times \dfrac{8760}{12} \;=\; (1 - A) \times 730 \]

Availability vs. annual downtime

Availability \(A\) Common name Annual downtime (approx.)
0.99 Two nines about 87 h 36 min
0.999 Three nines about 8 h 46 min
0.9999 Four nines about 52 min 34 s
0.99999 Five nines about 5 min 15 s
0.999999 Six nines about 31 s

7. Worked example: Web + App + dual DB

Let's compute a typical three-tier configuration combining series and parallel.

Series configuration of Web -> App -> (DB#1 and DB#2 in parallel)
Web → App → (DB#1 and DB#2 in parallel), connected in series

Assume the following availabilities for each component:

First we compute the parallel DB portion.

\[ A_\text{DB} \;=\; 1 - (1 - 0.99)(1 - 0.99) \;=\; 1 - 0.0001 \;=\; 0.9999 \]

Then we compute the overall availability as a series of Web → App → DB.

\[ A_\text{system} \;=\; A_W \cdot A_A \cdot A_\text{DB} \;=\; 0.999 \times 0.999 \times 0.9999 \;\approx\; 0.99780 \]

Annual downtime is:

\[ (1 - 0.99780) \times 8760 \;\approx\; 19.27 \text{ hours / year} \]
What if the DB were not duplicated? With a single DB (\(A_\text{DB} = 0.99\)), the overall availability becomes \(0.999 \times 0.999 \times 0.99 \approx 0.98802\), and annual downtime grows to about 105 hours / year. This quantifies the benefit of parallelization.