On December 17, 2019, Polymarket CMO Daniele Antoniani was invited to attend the Conference on Principles of Distributed Systems. As an industry exchange event where all walks of life, academia, and research circles gather, many guests at the scene shared that it was full of dry goods, which won unanimous praise from the participants.

The Conference on Principles of Distributed Systems in Neuchâtel, Switzerland. The Conference on Principles of Distributed Systems is an open forum for exchanging the latest knowledge about distributed computing and distributed computer systems.

Among them, the “Distributed System Model and Analysis of Key Issues” from Polymarket CMO Daniele Antoniani is particularly eye-catching. This article is the essence of Daniele Antoniani’s live speech.

1. In the data age, the importance of distributed systems is prominent

How much data will the Internet generate in one minute?

YouTube generates 1.3 million videos in a minute, Google has 2 million search queries, and Facebook has 6 million views in a minute.

According to a report by the authoritative consulting company Gartner, corporate Internet data will grow by about 50% every year. The report also predicts that by 2020, the amount of global data will reach the order of 35 ZB. 35 ZBs probably need 8 billion 4T hard drives to fit them.

With the advent of the 5G era, the amount of data will grow more rapidly. It is necessary to connect the isolated storage devices through the network to form a relatively large distributed storage system. Generally speaking, distributed storage refers to a persistent, institutionalized, and distributed storage system. There are usually two categories: centralized distributed storage system and decentralized P2P storage system.

The centralized distributed storage system stores data on some distributed and networked nodes, and finally forms an integrated name space. This type of file system is a centralized distributed file system with a centralized control node and a tree structure as a whole;

Another type of file system is a decentralized or P2P file system. This type of file system is a mesh structure without a central control node.

IPFS is a typical P2P file system. It is the same as the traditional HTTP network seven-layer protocol, and its bottom layer has a network layer, a routing layer, a switching layer and a typical attribute storage structure. The network layer adopts a storage structure to realize searching by content. The core of IPFS is the consensus protocol, which uses proof of storage capacity, proof of possibility, and proof of retrievability, which is a supplement to the scalability of the blockchain system.

2. The development history of distributed storage systems

The development process of distributed storage dates back to 1983. The development of the entire distributed file system, whether it is a centralized distributed file system or a decentralized P2P file system, the development process is constantly intersecting. It may be that centralized file systems have an advantage during this period. The next The stage may be that the decentralized point-to-point file system is dominant. The inventory of distributed storage projects at each stage is as follows:

In 1983, AFS. The file system AFS was developed by Carnegie Mellon University. This system is distributed on different nodes on the network, has the characteristics of distributed cross-platform, high security and so on.

In 1995, Zebra. Distributed systems used in massively parallel systems, using partitioning and slicing technology, have strong technical reliability. This type of file system is mainly used for some high-density computing tasks.

In 2000, Oceanstore. A peer-to-peer distributed file system that can be deployed globally, has strong network penetration capabilities, and interconnects between different sub-networks, but this file system does not use the current blockchain incentive system, and participants voluntarily go Contribute storage space. The lack of incentives causes the underlying infrastructure to be unstable and lack guarantees.

In 2003, GFS. It is a centralized file system and requires a master node to manage the task scheduling and data distribution of the entire cluster.

In 2005, XrootD. This project is to construct a global or global file system. This type of file system does not implement its low-level details. It just associates different file spaces. Each machine has its own file system. There is no protocol to interconnect different file systems, but different file systems. Make a proxy mapping, hang on different nodes, and then string together to form a unified space.

In 2006, HDFS. This kind of file system has a great advantage. It can run on cheap hardware, and has a measure of strong reliability and fault tolerance.

In 2014, IPFS. This kind of P2P file storage protocol came out, mainly used for archival storage. Maybe the data that everyone has not accessed for a year or a long time will be thrown on it, and the storage and retrieval of documents will be completed quickly.

With the development of Internet technology and hardware, the emergence and evolution of distributed systems have never stopped. With the rapid increase in the amount of data, the requirements for the stability, scalability, and security of the distributed system are getting higher and higher.

3. Key issues of distributed storage systems

The current distributed storage system is mainly developed along the direction of the P2P file system. Every P2P file system involves a basic principle: a hash mapping needs to be made between the data and the node that stores the data, and everyone is stored in a hash space, and the data and the ID of the node are in the same storage hash space. , Nodes and data are divided into different types of topological structures (ring or tree), and finally the nodes are connected to the network, and the data is stored on the corresponding nodes. In the end, access to the file path of the data is realized instead of the address access.

There are several issues to consider in distributed systems: fault tolerance, scalability, security, stability, efficiency, and so on. When implementing a file system, we must consider several issues:

1. How to design the server? Designed as stateful or stateless, this is related to the stability of the system;

2. The system must have some basic file semantics, how to open the file and how to handle the file lock?

3. Fault tolerance issues. How to ensure data consistency or reliability?

4. The efficiency of document retrieval. The general approach is to retrieve nodes one by one on the storage network based on the content, first find a certain node, and then find its neighbors, but this is too inefficient. A high-speed solution is needed to improve retrieval efficiency.

4. The outlook for distributed storage

The hardware-based resource utilization is mainly computing and storage. You may hear a lot of computing, mainly edge computing, fog computing, and even borderless computing proposed by Huawei.

Compared to computing, storage has a similar concept. Several concepts that are currently more common are cloud storage, centralized storage, and borderless storage. The so-called borderless storage refers to the storage of data in various devices, data platforms and storage systems. It is predicted that by 2020, there will be more than 20 billion smart IoT devices in the world. Massive smart devices mean that huge computing resources and storage space resources will be idle. How to fully manage and utilize these idle resources will be a challenge. An imaginative market.