This blog is inspired and based on the Alicloud WeChat article https://mp.weixin.qq.com/s/X1MtLk71LLZnsYZBVv-hHA
This year 2019-11.11, Alibaba Tmall double 11 sales event accomplished Turnover of 268.4 billion Chinese Yuan! The peak value of orders reached 544,000 units/second, and the data processing capacity per day reached 970PB! And all the system is based on Alibaba Cloud.
According to CTO Alibaba said in this blog, there were four technology secrets behind this.
- 3rd Generation of X-Dragon Architecture. An AWS nitro similar technology.
- OceanBase and PolarDB. Those are Alibaba’s self-made Databases.
- Calculation and storage are separated. The storage is on remote and could be easily for expansion
- RDMA(RemoteDirect Memory Access) in order to access the remote storage data quickly.
We could see in order to support and boost such large data requesting case, we should improve on the physical machine side and also database sides. Reading retrieving data quickly is the key.
Context: When we are dealing with a very large set of data or a data streaming, we could not put them all in the memory.
1.Bloom filter: Used to query if the item exists (Membership query). A Bloom filter is a bit array of m bits initialized to 0. To add an element, feed it to k hash functions to get k array position and set the bits at these positions to 1. To query an element, feed it to k hash functions to obtain k array positions. If any of the bits at these positions is 0, then the element is definitely not in the set. If the bits are all 1, then the element might be in the set.
2. HyperLogLog: used for estimating the number of distinct elements (Cardinality). HyperLogLog counter can count one billion distinct items with an accuracy of 2% using only 1.5 KB of memory. It is based on the bit pattern observation that for a stream of randomly distributed numbers if there is a number x with the maximum of leading 0 bits k, the cardinality of the stream is very likely equal to 2^k.
3. Count-Min Sketch: used for querying single item count(Frequency).The basic data structure is a two-dimensional d , w array of counters with d pairwise independent hash functions h1 … hd of range w.
Link from https://dzone.com/articles/introduction-probabilistic-0