When we want to monitor the distributed system, we usually use “percentile”. For example, P99 – that means percentile 99, we mesure the performance until 99% and we exclude the last 1% performance.
Concret example, we say a Service’s latency P99 = 100ms, that means 99% of service response time is less than 100ms.
Normally, the calculate of percentile is expensive. Because we have to take for example 100 samples and order them , find the 99th one.
For monitoring, we usually take P50, P99 and P99.9.
Here is a good example by Elastic which can help to understand the concept. And anther one for going deeper.
In distributed system world, Single point of failure is a key word that you should always be aware.
It means if a part of system fails, the whole system will be down. For example, if Service A sends messages to Service B via a single instance of message queue, then if the queue fails, the communication between Service A and B will be completely loses. Then this message queue is SPOF of the system.
The key to remove SPOF is using “Redundancy“, here is very well document by Oracle that explains the point.
The system “Reliability” explained by Amazon.