AWS Basic – Network

Here are the AWS Networking knowledges that are fundamental for cloud computing.

Region: (e.g. us-east-1)

AWS has the concept of a Region, which is a physical location around the world where we cluster data centers.

Each AWS Region is designed to be isolated from the other AWS Regions. This design achieves the greatest possible fault tolerance and stability.

VPC:

The Amazon Virtual Private Cloud (Amazon VPC) service lets you provision a private, isolated section of the AWS Cloud where you can launch AWS services and other resources in a virtual network that you define.

Availability Zone: (e.g: us-east-1a)

An Availability Zone (AZ) is one or more discrete data centers with redundant power, networking, and connectivity in an AWS Region. AZ’s give customers the ability to operate production applications and databases that are more highly available, fault tolerant, and scalable than would be possible from a single data center.

Subnet:

Separate subnets for unique routing requirements. AWS recommends using public subnets for external-facing resources and private subnets for internal resources. For each Availability Zone, this Quick Start provisions one public subnet and one private subnet by default.

Internet Gateway:

An internet gateway is an access point through which your resources can access the internet and be accessed from the internet.

NAT Gateway:

A NAT gateway can route outgoing traffic from private subnets to the internet.

Route 53:

Amazon Route 53 is the DNS available for your AWS resources.

Percentile – Monitoring

When we want to monitor the distributed system, we usually use “percentile”. For example, P99 – that means percentile 99, we mesure the performance until 99% and we exclude the last 1% performance.

Concret example, we say a Service’s latency P99 = 100ms, that means 99% of service response time is less than 100ms.

Normally, the calculate of percentile is expensive. Because we have to take for example 100 samples and order them , find the 99th one.

For monitoring, we usually take P50, P99 and P99.9.

Here is a good example by Elastic which can help to understand the concept. And anther one for going deeper.

The links:

https://www.elastic.co/blog/averages-can-dangerous-use-percentile

https://blog.bramp.net/post/2018/01/16/measuring-percentile-latency/

Single point of failure (SPOF)

In distributed system world, Single point of failure is a key word that you should always be aware.

It means if a part of system fails, the whole system will be down. For example, if Service A sends messages to Service B via a single instance of message queue, then if the queue fails, the communication between Service A and B will be completely loses. Then this message queue is SPOF of the system.

The key to remove SPOF is using “Redundancy“, here is very well document by Oracle that explains the point.

The system “Reliability” explained by Amazon.

The links:

https://docs.oracle.com/cd/E19424-01/820-4806/fjdch/index.html

https://wa.aws.amazon.com/wat.pillar.reliability.en.html