Reliability of Softwares
Typical expectations - "Working Correctly"
- Performs the function as user expected
- Tolerate the user making mistakes or using the software in unexpected ways.
- Performance- Good enough
- The system prevents any unauthorized access and abuse.
Reliability -> Continue to "Working Correctly" even when things go wrong (faults)
Hardware Faults
Hard disks ->. Mean Time to Failure - 10 to 50 years -> On average, in a storage cluster, one disk dies.
First Response - Add Redundancy => Disks - RAID Configs, Servers - Dual Power Supplies, and hot swappable CPUs.
Software Faults
Systematic Faults
- Server Crashing Software Bug
- Runaway process using shared resource
- Dependent Service slows down or goes down
- Cascading failures or faults
Software-level Guarantees and Assertions
Testing through Process Isolation
Human Errors
Being a dumbo
Solution - Well designed Abstractions, Decouple the places where people make the most mistakes, Sandbox Testing and Through Testing at all levels
Make it fast to roll back changes, roll out new code gradually and provide tools to recompute data.
Monitoring/telemetry
In some situations, we may choose to sacrifice reliability for developmental cost or operational cost