Reliability of Softwares

Typical expectations - "Working Correctly"

Performs the function as user expected
Tolerate the user making mistakes or using the software in unexpected ways.
Performance- Good enough
The system prevents any unauthorized access and abuse.

Reliability -> Continue to "Working Correctly" even when things go wrong (faults)

Hardware Faults

Hard disks ->. Mean Time to Failure - 10 to 50 years -> On average, in a storage cluster, one disk dies.
First Response - Add Redundancy => Disks - RAID Configs, Servers - Dual Power Supplies, and hot swappable CPUs.

Software Faults

Systematic Faults

Server Crashing Software Bug
Runaway process using shared resource
Dependent Service slows down or goes down
Cascading failures or faults

Software-level Guarantees and Assertions
Testing through Process Isolation

Human Errors

Being a dumbo

Solution - Well designed Abstractions, Decouple the places where people make the most mistakes, Sandbox Testing and Through Testing at all levels

Make it fast to roll back changes, roll out new code gradually and provide tools to recompute data.

Monitoring/telemetry

In some situations, we may choose to sacrifice reliability for developmental cost or operational cost