Pangu: In-depth analysis of AliCloud Feitian distributed storage system design

At the Cloud Community online training on October 19, Wu Yang from the Pangu team shared “Pangu: Flying Distributed Storage System Practice”. He mainly shared the following three aspects: what is Pangu? What problems is Pangu used to solve? How does Pangu solve problems? He mainly introduced Pangu's distributed system architecture and design concepts.

Review video: https: //yq.aliyun.com/edu/lesson/play/392

The finishing article for this live video is organized as follows.

What is Pangu?

The above chart lists the current mainstream cloud computing vendors, and we found a very interesting thing: all cloud computing vendors are “rich second generation”, and all of their distributed storage technology adopts self-developed technology, instead of using the familiar open source distributed system.

Dream of Flying

The dream of the first generation of flyers was to provide various computing and storage services to the outside world on a large number of cheap PC servers. Specifically, the following components: Quafu, mainly responsible for networking; Nuwa, mainly responsible for collaboration; Fuxi, mainly responsible for scheduling; Pangu, mainly responsible for storage; Shennong, mainly responsible for monitoring.

The above figure introduces the underlying storage platform of Pangu, which assumes the role of the top and bottom. As a distributed storage system, Pangu mainly provides two types of interfaces: Append Only interface and Random Access interface.

What problem is Pangu used to solve?

A single machine hardware or system is always imperfect, there is always a small probability of error, but it also needs to have the ability to scale horizontally on a large scale, because it has to manage a large number of machines. These two levels put together mean that errors are the norm.

At large scale, small probability events are the norm

4% annual disk corruption rate, 1%% daily machine downtime rate

Raid card crashes, capacitor charging and discharging causing write back mode to become write through

Network segmentation, switch packet loss, upgrade reboot, fiber damage, bandwidth reduced by 90%, routing errors between two server rooms

Rack power outage, power down of the entire server room

NIC TCP checksum errors, disk access data checksum errors

NTP time drift, kernel IO thread D state, dirty page cache can not be written back

System hotspots all the time, instantaneous transfer

Program defects lead to resource leakage, create a large number of files, access dirty data

Misuse: mistakenly deleted data, pulled the wrong disk, did not clean up the test machine environment online ......

Problems and challenges faced by Pangu

As you can see from the above figure, as a unified storage, to support block storage, object storage, table storage, file storage, offline big data processing, big data analytics and many other businesses in the virtual machine, the challenges it faces are great, and even some of them are self-contradictory.

How does Pangu solve the problem?

Pangu made some trade-offs in the system design. First, Pangu enabled more cloud products to interface with users, so that it could focus on building a stable and reliable distributed storage platform. High reliability and high availability are the parts that cannot be compromised, to ensure strong consistency, correctness, reliability, and availability of data in any situation. There are times when the pursuit of low cost can threaten high availability, so it is important to achieve high performance and reasonable cost to provide cost-effective online storage. Easy to use, servitization, easy for users to lightweight access, non-perceptive operation and maintenance to improve the good use of monitoring, tools, documents.

Pangu overall architecture

Pangu is divided into three parts: Client, Master, and ChunkServer. when you need to initiate a write, Client creates a file to Master and opens the file, at this time, Master will select the location of the three replicas and send it back to Client, Client finds ChunkServer according to the location of the three replicas and writes the data into it. Client will find the ChunkServer according to the location of the three replicas and write the data into it. In other words, Client does the overall control, Master provides source data storage, ChunkServer provides data storage. The single point in the system is very vulnerable, how to ensure its high availability? Pangu's first step is to join a Paxos, that is to say, using many Masters to form a group to realize high availability. That is to say, using many servers to achieve high availability, the final external service can only be a server, when the memory data is enough, it is necessary to expand horizontally.MountTable can divide the directory tree into volumes, through different volumes can realize the horizontal expansion of the Master.

Highly Reliable Data

Pangu three-copy strong consistency, three copies are located in different failure domains, automatic data replication in case of failure. As shown in the figure above, a data center has three copies of data stored in four RACKs, if RACK-1 suddenly power failure or network problems. At this time, for example, the diamond-shaped data was originally on RACK-3 and RACK-4. When the diamond-shaped data on RACK-1 is lost, Pangu will copy a copy from RACK-3 into RACK-2 through efficient algorithms, which ensures the safety and reliability of the data.

Data Assurance Integrity

Pangu does two main things: end-to-end data validation and silent error checking. In small probability, the data stored in memory is subject to change, and the data stored on disk is also subject to change. Each piece of data is followed by a CRC, so that once written to disk, the data and the CRC are able to match up. The background periodically scans, and when it finds a mismatch between the data and the CRC it determines that the piece of data has undergone a bit-reversal, and then it overwrites it with some other good copy.

Reasonable Cost

Pangu is optimized for reasonable cost. For example, a single cluster running offline has tens of thousands of units and hundreds of petabytes of data. A single cluster Master was also optimized to be able to achieve 15W QPS for reads and 5W QPS for writes. single data nodes were optimized for software stack limits, resulting in very low software consumption and tiered storage. Finally, to achieve low cost, a common PC server, Erasure Code, was used.

Self-service

Operation and maintenance is very important, Pangu achieved hot upgrade application non-perception, operation and maintenance operations according to the configuration of automated execution, without human intervention, through the standardization of the environment to correct in a timely manner, through the problem diagnosis of self-solving. The structure is shown in the figure above, there is a centrally managed configuration management library, Pangu control center will push the configuration management library to each component of Pangu, automatically execute configuration changes, and realize automatic alignment when the configuration is found to be incorrect, and the standardized checking of the operating environment is very important for large-scale distributed systems.

Fault-Oriented Design

The core of distributed system is fault-oriented design:

Data security is a kind of faith: E2E Checksum; background silent scanning; system bugs, hardware failures, fault tolerance of operation and maintenance operations. In large-scale systems, there are always a variety of problems, and when these problems are stirred up together they can become very tricky.

Environment check to eliminate hidden problems: disk partitioning; rack distribution; configuration errors; software errors; hardware errors.

Individual machine failure insensitive: data replication to ensure safety; change machine retry to ensure read and write success; memory and avoid faulty machines.

Monitoring + Self-healing: Master self-health check for switching; Chunkserver finds faulty disks or machines for isolation; Client detects service conditions for master switching; Client self-health checks and reports status.

The above design greatly reduces the pressure of operation and maintenance.

Master

Master needs to solve the main three types of problems: large capacity, high efficiency, stability. Large capacity means: Federation level expansion, memory tight arrangement of a single group to support 800 million files, read and write OPS 100K/s. Efficient means the optimal algorithm, hardware error triggered fast replication to ensure data security, dynamic planning of data traffic to achieve maximum throughput, dynamic adjustment of the security domain to ensure high availability of data. Stability means Paxos data consistency, preventing single point, multi-angle monitoring automatically triggered switching, multi-user isolation to prevent death. As Pangu is a multi-tenant system, such as 10,000 units of the cluster above will run a variety of applications, which are not known to each other, but they are sharing a Master machine. If a user accesses a large number of masters, the whole cluster will not be able to provide external services, how to prevent this situation? Pangu made multiple isolation to solve the above problem.

Chunkserver

Chunkserver faces the following problems: flash memory is expensive and has high IOPS; mechanical hard disk is cheap and has low IOPS; and the write-only memory solution will lose data if power is lost. If the whole cluster is powered down, then the data will be lost before it is written in the memory, and if all three backup data are lost, this is not an acceptable thing for cloud computing. How to combine flash memory, mechanical hard disk to solve the above problems at the lowest cost? Some solutions use UPS, but UPS also has unreliability issues and data can still be lost. So the ultimate solution is to use a small amount of cache with a large number of mechanical hard disks, where the data is written to the cache in the foreground and dumped to the mechanical hard disks in the background.

Client

Client faces a lot of problems, a lot of programming languages nowadays, co-programming is a very popular thing. Traditional multithreaded programming with more threads on a multicore system is very expensive to switch, and high-performance programs cannot tolerate this. Some solutions are asynchronous programming, so that a small number of threads are used, and no threads are cut. How about having the convenience of synchronous programming with the performance of asynchronous programming? Concurrency is the solution. Many current programming languages already provide concurrency themselves, but C++ does not, so Pangu itself gained high performance by implementing concurrency. the problem Client faces is: some users need extreme performance, others need ease of programming, and a huge number of existing programs have to be supported seamlessly. The solution to these problems is to use thread synchronization primitives to support both concurrent and non-concurrent users. In a concurrent thread there is no thread cutting, so it means that all Tasks are executed in a single thread, and if any of the Tasks has a blocking operation, it will result in a reduction of the throughput rate of the whole thread.