Reliability of Softwares

Typical expectations - "Working Correctly"

Reliability -> Continue to "Working Correctly" even when things go wrong (faults)

Hardware Faults

Hard disks ->. Mean Time to Failure - 10 to 50 years -> On average, in a storage cluster, one disk dies.
First Response - Add Redundancy => Disks - RAID Configs, Servers - Dual Power Supplies, and hot swappable CPUs.

Software Faults

Systematic Faults

Software-level Guarantees and Assertions
Testing through Process Isolation

Human Errors

Being a dumbo

Solution - Well designed Abstractions, Decouple the places where people make the most mistakes, Sandbox Testing and Through Testing at all levels

Make it fast to roll back changes, roll out new code gradually and provide tools to recompute data.

Monitoring/telemetry

In some situations, we may choose to sacrifice reliability for developmental cost or operational cost