Scalability

Ability of a system to cope with increased load.

Load can be described using load parameters

Twitter - November 2012 case study

Performance Metrics

In a batch processing system - We use throughput
In online systems - Service's Response time.

Response Time - What the client sees
Latency - The duration that a request is waiting to be handled, during which it is latent

Thus, Response Time and Latency aren't the same.

It is better to think of the Response Time as a distribution and not as a single value.

It is better to use percentiles for response time, rather than mean.
High percentiles of response times - tail latencies.

High Percentiles are especially importatnt in backend services that arecalled multiple times, as part of serving a single end-user request. As it takes just one slow call to make the entire request slow - This causes an effect called Tail latency Amplification.

Forward Decay, T-digest, HdrHistorgram

Approaches for coping with Load

Scaling Up - Vertical Scaling - A More Powerful Machine
Scaling Out - Horizontal Scaling - Distributing the load across multiple machines - aka Shared-Nothing Architecture

Usually it is a mix of both

Elastic Systems -> Automatically add compute resources when they detect a load increase.

Common Wisdom - Keep your database on a single node (scale up) until scaling cost or high-availability requirements forced you to make it distributed.