0%

Intro

FoundationDB的研究意义在于,它成功地将NoSQL的灵活性与ACID事务的强大功能结合在一起,提供了一种模块化的架构,使得各个子系统可以独立配置和扩展。这种设计不仅提高了系统的可扩展性和可用性,还增强了故障容忍能力。此外,FoundationDB采用了严格的模拟测试框架,确保了系统的稳定性和高效性,使得开发者能够快速引入和发布新特性。FoundationDB的快速恢复机制显著提高了系统的可用性,简化了软件升级和配置变更的过程,通常在几秒钟内完成。

The main design principles are:

  1. Divide-and-Conquer (or separation of concerns). FDB decouples the transaction management system (write path) from the distributed storage (read path) and scales them independently. Within the transaction management system, processes are assigned various roles representing different aspects of transaction management. Furthermore, cluster-wide orchestrating tasks, such as overload control and load balancing are also divided and serviced by additional heterogeneous roles.
  2. Make failure a common case. For distributed systems, failure is a norm rather than an exception. To cope with failures in the transaction management system of FDB, we handle all failures through the recovery path: the transaction system proactively shuts down when it detects a failure. Thus, all failure handling is reduced to a single recovery operation, which becomes a common and well-tested code path. To improve availability, FDB strives to minimize Mean-Time-To-Recovery (MTTR). In our production clusters, the total time is usually less than five seconds.
  3. Simulation testing. FDB relies on a randomized, deterministic simulation framework for testing the correctness of its distributed database. Simulation tests not only expose deep bugs, but also boost developer productivity and the code quality of FDB.

Architecture

img

  • The control plane is responsible for persisting critical system metadata, that is, the configuration of transaction systems, on Coordinators.

    • These Coordinators form a Paxos group and elect a ClusterController.

    • The ClusterController monitors all servers in the cluster and recruits three processes, Sequencer, DataDistributor, and Ratekeeper, which are re-recruited if they fail or crash.

    • The DataDistributor is responsible for monitoring failures and balancing data among StorageServers.

    • Ratekeeper provides overload protection for the cluster.

  • The data plane is responsible for transaction processing and data storage. FDB chooses an unbundled architecture:

    • A distributed transaction management system (TS) consists of a Sequencer, Proxies, and Resolvers, all of which are stateless processes.

      • The Sequencer assigns a read and a commit version to each transaction.

      • Proxies offer MVCC read versions to clients and orchestrate transaction commits.

      • Resolvers check for conflicts among transactions.

    • A log system (LS) stores Write-Ahead-Log (WAL) for TS, and a separate distributed storage system (SS) is used for storing data and servicing reads. The LS contains a set of LogServers and the SS has a number of StorageServers. LogServers act as replicated, sharded, distributed persistent queues, each queue storing WAL data for a StorageServer.

Clients read from sharded StorageServers, so reads scale linearly with the number of StorageServers.

Writes are scaled by adding more Proxies, Resolvers, and LogServers.

The control plane’s singleton processes (e.g., ClusterController and Sequencer) and Coordinators are not performance bottlenecks; they only perform limited metadata operations. 因为元数据操作少且简单,且与两者无关的数据读写是并行扩展的(如上面两行加粗字体所述)。

Bootstrapping

FDB has no dependency on external coordination services. All user data and most system metadata (keys that start with 0xFF prefix) are stored in StorageServers. The metadata about StorageServers is persisted in LogServers, and the LogServers configuration data is stored in all Coordinators.

  1. The Coordinators are a disk Paxos group; servers attempt to become the ClusterController if one does not exist.
  2. A newly elected ClusterController reads the old LS configuration from the Coordinators and spawns a new TS and LS.
  3. Proxies recover system metadata from the old LS, including information about all StorageServers.
  4. The Sequencer waits until the new TS finishes recovery, then writes the new LS configuration to all Coordinators. The new transaction system is then ready to accept client transactions.

Reconfiguration

The Sequencer process monitors the health of Proxies, Resolvers, and LogServers. Whenever there is a failure in the TS or LS, or the database configuration changes, the Sequencer terminates. The ClusterController detects the Sequencer failure, then recruits and bootstraps a new TS and LS. In this way, transaction processing is divided into epochs, where each epoch represents a generation of the transaction management system with its own Sequencer.

End-to-end transaction processing

  1. Transaction Start and Read Operations:

    • A client starts a transaction by contacting a Proxy to obtain a read version (timestamp).

    • The Proxy requests a read version from the Sequencer that is greater than all previously issued commit versions and sends it to the client.

    • The client then reads from StorageServers at this specific read version.

  2. Buffered Write Operations:

    • Client writes are buffered locally and not sent to the cluster immediately.

    • Read-your-write semantics are preserved by combining the database lookups with the client’s uncommitted writes.

  3. Transaction Commit:

    • When the client commits, it sends the transaction data (read and write sets) to a Proxy, waiting for either a commit or abort response.

    • The Proxy commits a transaction in three steps:

      1. Obtain Commit Version: The Proxy requests a commit version from the Sequencer that is larger than all current read or commit versions.

      2. Conflict Check: The Proxy sends transaction data to the partitioned Resolvers, which check for read-write conflicts. If no conflicts are found, the transaction proceeds; otherwise, it is aborted.

      3. Persist to Log Servers: The transaction is sent to LogServers for persistence, and after all LogServers acknowledge, the transaction is considered committed. The Proxy then reports the committed version to the Sequencer and sends the response back to the client.

  4. Applying Writes:

    • StorageServers continuously pull mutation logs from LogServers and apply the committed changes to disk.
  5. Read-Only Transactions and Snapshot Reads:

    • Read-only transactions are serializable (at the read version) and high-performance (thanks to MVCC), allowing the client to commit locally without contacting the database, which is particularly important since most transactions are read-only.

    • Snapshot reads relax the isolation property of a transaction, reducing conflicts by allowing concurrent writes without conflicting with snapshot reads.

FoundationDB (FDB) using Serializable Snapshot Isolation (SSI) by combining Optimistic Concurrency Control (OCC) with Multi-Version Concurrency Control (MVCC).

Transaction Versions

  • Each transaction receives a read version and a commit version from the Sequencer.
  • The read version ensures that the transaction observes the results of all previously committed transactions, and the commit version is greater than all current read or commit versions, establishing a serial order for transactions.

Log Sequence Number (LSN)

  • The commit version serves as the LSN, defining a serial history of transactions.
  • To ensure no gaps between LSNs, the Sequencer also returns the previous LSN with each commit. Both the LSN and previous LSN are sent to Resolvers and LogServers to enforce serial processing of transactions.

Conflict Detection

  • FDB uses a lock-free conflict detection algorithm similar to write-snapshot isolation, but the commit version is chosen before conflict detection, enabling efficient batch processing of version assignments and conflict detection.
  • The key space is divided among multiple Resolvers, allowing conflict detection to be parallelized. A transaction can commit only if all Resolvers confirm no conflicts.

Handling Aborted Transactions

  • If a transaction is aborted, some Resolvers may have already updated their history, leading to possible “false positive” conflicts for other transactions. However, this is rare because most transactions’ key ranges fall within one Resolver, and the effects of false positives are limited to a short MVCC window (5 seconds).

Efficiency of OCC

  • The OCC design avoids the complexity of acquiring and releasing locks, simplifying interactions between the Transaction System (TS) and Storage Servers (SS).
  • While OCC may result in some wasted work due to aborted transactions, FDB’s conflict rate in production is low (less than 1%), and clients can simply restart aborted transactions.

Logging protocol

Commit Logging:

  • Once a Proxy decides to commit a transaction, it sends the transaction’s changes (mutations) to the LogServers responsible for the modified key ranges. Other LogServers receive an empty message.
  • The log message includes the current and previous Log Sequence Number (LSN) from the Sequencer and the largest known committed version (KCV) of the Proxy.
  • The LogServers reply to the Proxy once the log data is durably stored. The Proxy updates its KCV if all replica LogServers acknowledge and the LSN is larger than the current KCV.

Shipping Redo Logs:

  • Shipping the redo log from LogServers to StorageServers happens in the background and is not part of the commit path, improving performance.

Applying Redo Logs:

  • StorageServers apply non-durable redo logs from LogServers to an in-memory index. In most cases, this happens before any client reads are processed, ensuring low-latency multi-version reads.
  • If the requested data is not yet available on a StorageServer, the client either waits or retries at another replica. If both reads time out, the client can restart the transaction.

I/O Efficiency:

  • Since log data is already durable on LogServers, StorageServers can buffer updates in memory and write batches to disk periodically, improving input/output (I/O) efficiency.

What if a StorageServer is lagging behind on applying the redo logs and a client requests a version of a key pair it does not have?

  1. Wait for a threshold for when known-committed-version is greater than or equal to the read version
  2. If timeout, the client asks another StorageServer that stores the key
  3. Return error “request for a future version” (FDB error code 1009)

What if there is no further transaction logs to redo?

  • Without new transactions issued from the client, proxies still generate empty transactions to advance the known-committed-version
  • Known-committed-version and LSN of each transaction are sent to all LogServers (limit scalability on writes)

Transaction system recovery

Simplified Recovery

  • Unlike traditional databases that require undo log processing, FoundationDB avoids this step by making the redo log processing the same as the normal log forward path. StorageServers pull logs from LogServers and apply them in the background.

Failure Detection and New Transaction System (TS)

  • Upon detecting a failure, a new TS is recruited. The new TS can start accepting transactions even before all old logs are fully processed. Recovery focuses on finding the end of the redo log, allowing StorageServers to asynchronously replay the logs from that point.

Epoch-based Recovery

  • The recovery process is handled per epoch. The ClusterController locks the old TS configuration, stops old LogServers from accepting new transactions, recruits a new set of transaction components (Sequencer, Proxies, Resolvers, and LogServers), and writes the new TS configuration to the Coordinators.
  • Stateless components like Proxies and Resolvers don’t require special recovery, but LogServers, which store committed transaction logs, must ensure all data is durable and retrievable by StorageServers.

Recovery Version (RV)

  • The recovery focuses on determining the Recovery Version (RV), which is essentially the end of the redo log. The Sequencer collects data from the old LogServers, specifically the Durable Version (DV) (maximum LSN persisted) and KCV (maximum committed version) from each.
  • Once enough LogServers have responded, the Previous Epoch Version (PEV) is established (the maximum of all KCVs). The start version of the new epoch is PEV + 1, and the minimum DV becomes the RV.

Log Copying and Healing

  • Logs between PEV + 1 and RV are copied from old LogServers to the new ones to restore replication in case of LogServer failures. This copying process is lightweight since it only covers a few seconds of logs.

Rollback and Transaction Processing

  • The first transaction after recovery is a special recovery transaction that informs StorageServers of the RV, so they can discard in-memory multi-versioned data beyond the RV. StorageServers then pull data larger than the PEV from the new LogServers.
  • The rollback process simply discards in-memory multi-versioned data, as persistent data is only written to disk once it leaves the MVCC window.

Replication

  1. Metadata Replication:

    • System metadata related to the control plane is stored on Coordinators using the Active Disk Paxos protocol. As long as a majority (quorum) of Coordinators are operational, the metadata can be recovered in case of failure.
  2. Log Replication:

    • When a Proxy writes logs to LogServers, each log record is replicated synchronously across k = f + 1 LogServers (where f is the number of allowed failures). The Proxy only sends a commit response to the client after all k LogServers have successfully persisted the log. If a LogServer fails, a transaction system recovery is triggered.
  3. Storage Replication:

    • Each key range (shard) is asynchronously replicated across k = f + 1 StorageServers. These StorageServers form a team. A StorageServer typically hosts multiple shards, distributing its data across several teams. If a StorageServer fails, the DataDistributor moves the data from teams with the failed server to other healthy teams.

To prevent data loss in case of simultaneous failures, FoundationDB ensures that no more than one process in a replica group is placed within the same fault domain (e.g., a host, rack, or availability zone). As long as one process in each team is operational, no data is lost, provided at least one fault domain remains available.

Simulation testing

  1. Deterministic Simulation:

    • FoundationDB uses deterministic discrete-event simulation to test its distributed system. This simulation runs real database code along with randomized synthetic workloads and fault injection to uncover bugs.

    • Determinism ensures that bugs are reproducible and can be investigated thoroughly.

  2. Fault Injection:

    • The simulation tests system resilience by injecting various faults, such as machine, rack, or data center failures, network issues, disk corruption, and delays.

    • Randomization of these faults increases the diversity of tested states, allowing for a wide range of potential issues to be examined.

    • “Buggification” is a technique used to deliberately introduce rare or unusual behaviors (e.g., unnecessary delays, errors) in the system to stress-test its handling of non-standard conditions.

  3. Swarm Testing:

    • Swarm testing increases simulation diversity by using random cluster sizes, configurations, workloads, and fault injection parameters.

    • This ensures that a broad range of scenarios is covered in testing, allowing for the discovery of rare bugs.

  4. Test Oracles:

    • Test oracles are built into the system to verify key properties like transaction atomicity, isolation, and recoverability. Assertions check these properties to detect failures during simulation.

    • They help confirm that the system’s expected behaviors are maintained, even under stressful conditions.

  5. Bug Detection Efficiency:

    • The simulation runs faster than real-time, allowing FoundationDB to quickly discover and trace bugs. The parallel nature of testing accelerates the process of finding bugs, particularly before major releases.

    • This approach uncovers bugs that may not appear during real-time testing, especially for issues that require long-running operations.

  6. Limitations:

    • Simulation cannot reliably detect performance issues (like imperfect load balancing).

    • It cannot test third-party libraries or external dependencies, focusing mainly on FoundationDB’s internal code and behaviors.

Lessons learned

  1. Architecture Design

    • Divide-and-Conquer Principle: Separating the transaction system from the storage layer allows for independent scaling and deployment of resources, enhancing both flexibility and performance.

    • LogServers as Witness Replicas: In multi-region deployments, LogServers reduce the need for full StorageServer replicas while maintaining high availability.

    • Role Specialization: The design enables the creation of specialized roles, like separating DataDistributor and Ratekeeper from the Sequencer, and separating Proxies into Get-Read-Version and Commit Proxies, which improves performance and makes the system extensible.

    • Decoupling Enhances Extensibility: This design pattern allows features like replacing SQLite with RocksDB and adding new roles or functions without overhauling the entire system.

  2. Simulation Testing

    • High Productivity: FDB’s deterministic simulation testing enables bugs to be found and reproduced quickly. This approach has improved developer productivity and system reliability by reducing debugging time and improving test coverage.

    • Reliability: FDB has operated without any data corruption over several years of deployment (e.g., CloudKit), thanks to rigorous simulation testing. Simulation has allowed ambitious rewrites and improvements to be made safely.

    • Eliminating Dependencies: Simulation testing helped find bugs in external dependencies, leading to FDB replacing Apache Zookeeper with its own Paxos implementation. This change resulted in no further production bugs.

  3. Fast Recovery

    • Simplifies Upgrades: FDB allows fast recovery by restarting all processes simultaneously, typically within seconds, simplifying software upgrades and configuration changes. This method has been extensively tested and used in Apple’s production clusters.

    • Bug Healing: Fast recovery can automatically resolve certain latent bugs, similar to software rejuvenation, by resetting system states.

  4. 5-Second MVCC Window

    • Memory Efficiency: FDB uses a 5-second MVCC (Multi-Version Concurrency Control) window to limit memory usage in transaction systems and storage servers. This time window is long enough for most OLTP workloads, exposing inefficiencies if the transaction exceeds 5 seconds.

    • TaskBucket Abstraction: Long-running processes, like backups, are broken into smaller transactions that fit within the 5-second window. FDB implements this through an abstraction called TaskBucket, which simplifies splitting large transactions into manageable jobs.

Questions

With FDB, what operations does a transaction commit perform when the transaction only reads the value of data items?

  • Read Version Retrieval: The client requests a read version from a Proxy via the Sequencer, which guarantees the read version is greater than or equal to any committed version.
  • Read Operation: The client reads the requested data at this specific read version from the StorageServers. The reads are served by the StorageServers, which are guaranteed to provide data consistent with the requested version.
  • No Writes or Conflicts: Since the transaction is read-only, there is no write set or conflicts to check. The transaction simply ends, and no data is written or modified, meaning it does not interact with LogServers or commit any changes.
  • Commit: Even though no actual commit occurs (because there’s no data change), the transaction is marked as successfully completed after the reads are done.

With FDB, is it possible for multiple resolvers to participate in the decision whether to commit or abort a write transaction?

Yes, multiple Resolvers can participate in the decision to commit or abort a write transaction in FDB. Here’s how it works:

  • Conflict Detection: When a transaction writes data, the write set (the keys it wants to write) is sent to a set of Resolvers. Each Resolver is responsible for a specific portion of the key space. Multiple Resolvers can be involved in checking the transaction’s read and write sets to detect conflicts (read-write conflicts or write-write conflicts).
  • Parallel Conflict Checking: Since the key space is partitioned, different Resolvers check different key ranges in parallel. A transaction can only commit if all Resolvers agree that there are no conflicts.

With FDB, what if a StorageServer is lagging behind on applying the redo logs and a client requests a version of a key pair it does not have?

  • Client Waits: The client can choose to wait for the StorageServer to catch up by applying the redo logs. Once the StorageServer finishes replaying the logs and reaches the required version, it can serve the requested data.
  • Retry at Another Replica: If the StorageServer does not have the requested version yet, the client can try to read from another replica of the key. FDB typically stores multiple replicas of data across different StorageServers, so the client can retry the request from a replica that is up to date.
  • Transaction Restart: If neither replica has the requested version or the delay is too long, the client may restart the transaction. Since FoundationDB uses MVCC (Multi-Version Concurrency Control), restarting the transaction allows it to obtain a fresh version of the key from an up-to-date StorageServer.

Consider a database for students enrolling in courses and professors teaching those courses. Provide a SDM model of this database?

Students: base concrete object class.

member property: student_id, name, age, email, department_id.

identifier: student_id.

Professors: base concrete object class.

member property: professor_id, name, age, email, department_id.

identifier: professor_id.

Courses: base concrete object class

member property: course_id, name, location, start_time, end_time, department_id.

derived member property: professor as Professors.professor_id.

identifier: course_id.

Enrollment: base duration event class.

member property: enrollment_id, date_of_enrollment.

member participant: student in Students, course in Courses.

identifier: enrollment_id.

Departments: abstract Students and Professors on common value of department_id.

derived member property: department_id as distinct value of (Students.department_id union Professors.department_id).

What is the difference between a monolithic database management system and a disaggregated database management system?

Feature Monolithic DBMS Disaggregated DBMS
Architecture All components tightly integrated into a single system Components like storage, computation, and query processing are separated
Scalability Scales through vertical scaling (adding resources to the single server) Scales through horizontal scaling (independent scaling of storage and compute)
Performance Bottlenecks May face bottlenecks as the system grows Components are independently optimized, reducing bottlenecks
Resource Management Storage and compute resources are tightly coupled, hard to manage separately Storage and compute resources can be managed independently, offering flexibility
Complexity Easier to deploy and manage initially, but complexity increases with scale More complex to manage and coordinate different components
Cost Pay for all resources, even if they are not fully utilized Can optimize resource usage and costs by scaling components independently
Consistency Strong data consistency due to tight integration Requires additional mechanisms to ensure consistency across components

With Gamma and its data flow execution paradigm, how does the system know when the execution of a parallel query involving multiple operators is complete?

Data Dependency Graph: The query execution is modeled as a directed acyclic graph (DAG), where each node represents an operator (e.g., selection, join). Data flows between operators, and the system tracks the completion of each operator based on this graph.

Completion Signals: Each parallel operator sends a “done” signal once it finishes processing its data partition. The system monitors these signals to determine when all operators have finished.

Coordinator: A central coordinator tracks the progress of parallel tasks. When all tasks report completion, the system declares the query execution as complete.

Reference: https://sigmodrecord.org/publications/sigmodRecord/2203/pdfs/08_fdb-zhou.pdf

多表关系

  • 一对多:在多的一方建立外键,指向一的一方的主键。如:部门-员工。
  • 多对多:建立第三张中间表,其中至少包含两个外键,分别关联两方主键。如:学生-课程。
  • 一对一:用于单表拆分,将一张表的基础字段放在一张表中,其他详情字段放在另一张表中,以提升操作效率。在任意一方加入外键,关联另一方的主键,并且设置外键为唯一(UNIQUE) 。

[!NOTE]

在多表查询时,需要消除无效的笛卡尔积。

连接查询

内连接

相当于查询两张表交集部分数据。

  • 隐式内连接:SELECT 字段列表 FROM 表1,表2 WHERE 条件;
  • 显式内连接 :SELECT 字段列表 FROM 表1 [INNER] JOIN 表2 ON 连接条件;

外连接

左外连接:查询左表所有数据,以及两张表交集部分数据。

1
SELECT 字段列表 FROM1 LEFT [OUTER] JOIN2 ON 条件;

右外连接:查询右表所有数据,以及两张表交集部分数据。

1
SELECT 字段列表 FROM1 RIGHT [OUTER] JOIN2 ON 条件;

自连接:当前表与自身的连接查询,必须使用表别名。可以是内连接,也可以是外连接。

联合查询:把多次查询的结果合并以形成一个新的查询结果集。不使用 ALL 的时候,有去重效果。联合查询的多张表之间的列数和字段类型需要保持一致

1
SELECT 字段列表 FROM1 UNION [ALL] SELECT 字段列表 FROM2;

子查询

SQL语句中嵌套 SELECT 语句,外部语句可以是 INSERT/UPDATE/DELETE/SELECT 中任何一个。

1
SELECT * FROM 表名 WHERE col = (SELECT col FROM2);

子查询种类

根据结果

  • 标量子查询:返回结果是单个值。
  • 列子查询:返回结果是一列。
  • 行子查询:返回结果是一行(可以是多列)。
  • 表子查询:返回结果是多行多列。

根据位置:WHERE 之后,FROM 之后,SELECT 之后。

Intro

LSM-Tree(Log-Structured Merge Tree)的核心思想是将大量的随机写入转换为更高效的顺序写入。简单来说,它通过以下方式来实现:

  1. 写入内存:当有新的数据写入时,LSM-Tree首先将这些数据存储在内存中的缓冲区(称为MemTable)。这是一个有序的结构,数据按键排序。
  2. 批量写入磁盘:当内存中的数据积累到一定程度时,整个MemTable会被一次性地写入磁盘,这个过程是顺序写入,非常高效。写入磁盘后,这个数据成为一个不可修改的文件,称为SSTable(Sorted String Table)。
  3. 合并和压缩:随着时间的推移,磁盘上会产生多个SSTable。为了优化读取性能,系统会周期性地将这些SSTable进行合并和压缩,使得数据保持有序并减少冗余。

这样,LSM-Tree通过将频繁的随机写操作缓存在内存中,最后批量顺序写入磁盘,大大提高了写入性能。这种方式适合写入密集型的工作负载,同时还能保证数据查询的效率。

LSM-Tree的基础结构,特别是数据如何从内存(memtable)移动到磁盘,并经过多级的归并排序(compaction)过程来进行存储。

img

  1. MemTable(内存表)

    • 数据的写入首先进入到内存中的memtable,通常是一个有序的数据结构(比如跳表或B+树),这使得数据在内存中是有序的,便于快速写入和查询。

    • 当memtable满了或者系统需要将数据持久化时,memtable中的数据会被flush(刷新)到磁盘,形成第一层的SSTable。

  2. Level-0(磁盘上的第一层)

    • 数据从内存写入磁盘后,存储在Level-0层的SSTable中。此时,SSTable的数据顺序与memtable一致,但可能存在多个SSTable,且它们之间的键值范围可能重叠。

    • Level-0的SSTable是逐渐积累的,并不会自动排序或整理,直到执行compaction(归并操作)。

  3. Compaction(归并操作)

    • 当Level-0层的数据达到一定量时,系统会执行归并操作,将Level-0层的多个SSTable合并,并将合并后的有序数据移到Level-1层。

    • Level-1开始,所有的SSTable都是有序且互不重叠的。也就是说,每个SSTable都有自己独立的键值范围,不会与其他SSTable的键值范围重叠,这使得查询时能够快速定位到目标SSTable。

  4. 逐级沉降

    • 数据会随着系统运行,从Level-0层逐步沉降到更深的层级(如Level-1、Level-2等)。在每一层,数据都通过归并操作变得更加有序且结构紧凑。

    • 每次合并后,数据被重新整理,分配到新的不重叠的SSTable中,从而保持物理上的键值有序性。

LSM-Tree查询

基于LSM-Tree的查询可分为点查与范围查询两大类,对应的执行方式如下:

  • 点查(point lookup):从上往下进行查询,先查memtable,再到L0层、L1层。因为上层的数据永远比下层版本新,所以在第一次发生匹配后就会停止查询。
  • 范围查询(range lookup):每一层都会找到一个匹配数据项的范围,再将该范围进行多路归并,归并过程中同一key只会保留最新版本。

LSM-Tree性能的衡量主要考虑三个因素:空间放大、读放大和写放大。

一是空间放大(space amplification)。LSM-Tree的所有写操作都是顺序追加写,对数据的更新并不会立即反映到数据既有的值里,而是通过分配新的空间来存储新的值,即out-place update。因此冗余的数据或数据的多版本,仍会在LSM-Tree系统里存在一定时间。这种实际的占用空间大于数据本身的现象我们称之为空间放大。因为空间有限,为了减少空间放大,LSM-Tree会从L1往L2、L3、L4不断做compaction,以此来清理过期的数据以及不同数据的旧版本,从而将空间释放出来。

二是读放大(read amplification)。假设数据本身的大小为1k,由于存储结构的设计,它所读到的值会触发多次IO操作,一次IO意味着一条读请求,这时它所读取到的则是在后端所需要做大的磁盘读的实际量,已经远大于目标数据本身的大小,从而影响到了读性能。这种现象我们称之为读放大。为了减轻读放大,LSM-Tree采用布隆过滤器来避免读取不包括查询键值的SST文件。

三是写放大(write amplification)。在每层进行compaction时,我们会对多个SST文件进行反复读取再进行归并排序,在删掉数据的旧版本后,再写入新的SST文件。从效果上看,每条key在存储系统里可能会被多次写入,相当于一条key在每层都会写入一次,由此带来的IO性能损失即写放大。

LSM-Tree最初的理念是用空间放大和读放大来换取写放大的降低,从而实现较好的写性能,但也需要做好三者的平衡。以下是两种假设的极端情况。

第一种极端情况是:如果完全不做compaction,LSM-Tree基本等同于log文件,当memtable不断刷下来时,由于不做compaction,只做L0层的文件,这时如果要读一条key,读性能会非常差。因为如果在memtable里找不到该条key,就要去扫描所有的SST文件,但与此同时写放大现象也将不存在。

第二种极端情况是:如果compaction操作做到极致,实现所有数据全局有序,此时读性能最优。因为只需要读一个文件且该文件处于有序状态,在读取时可以很快找到对应的key。但要达到这种效果,需要做非常多的compaction操作,要不断地把需要删的SST文件读取合并再来写入,这会导致非常严重的写放大。

Nova-LSM架构设计

img

第一部分是写日志的组件,将WAL写成功后再往LSM-Tree的memtable中查询新的数据。

第二部分是本身处理LSM-Tree写入的线程,其缩写为LTC(LSM-Tree Component),代表着将该线程单独组件化。

第三部分则是底层的存储,负责把接收到的上层LTC组件下发下来,并提供标准的文件接口。

Nova-LSM所解决的核心问题

第一个核心问题是:基于LSM-Tree结构的存储系统,例如LevelDB、RocksDB等,都会不可避免地遇到缓写或者停写的问题。比如内存里的memtable,在配置时最多可以写8个,因为写入多,需要全部flush到磁盘上。与此同时,当前L0层的SST文件非常多,L0层即将开始做compaction。但compaction会涉及到磁盘IO,在还没做完时,就会阻塞内存中的memtable对L0层SST进行flush的过程。当flush无法进行时,就会发生缓写,随着阈值的推进,在实在写不进时甚至会停写,这种现象体现在客户端就是请求掉零。

为了解决LSM-Tree结构存储系统中的缓写、停写问题,该文章提出了两个解决办法:

  • 第一种方法是设计了存算分离的架构体系,具体如上图所示。该架构的重要作用之一,是把处理写入和处理磁盘IO的两大主力模块拆分,计算存储分离,哪个部分慢就为哪个部分增加节点以此来提高该部分的能力,这是比较亮眼的突破。
  • 第二种方法是引入了动态分区,即Drange机制。该机制的目的是为了让业务的写入压力,在LTC即计算层的memtable上进行区间划分,每个range都有自己的memtable,通过区间划分,从而实现多个range之间进行并行compaction。以L0层为例,我们可以把L0层变成没有互相重叠的状态,这时我们就可以对L0层进行并行的compaction,可以加快L0层的文件的消化,从而减轻对memtable flush到磁盘上的过程的影响。

第二个核心问题是:在这种方式下需要划分很多不同的Drange,每个range都会增加一定的memtable数量,memtable数量的增加会影响scan和get的性能。假设有一个主请求,在原来所有数据都写在一个memtable里的情况下,在读取时,索引只需要面向这个memtable,再根据跳表进行get,如果get到则可以马上返回。现在划分成不同的Drange,memtable数量增加,因此需要查找的memtable以及L0层的SST也会变多。解决办法是:实现了一个索引,可以查询到一个key在memtable或L0 SST中的最新值(若存在)。

Nova-LSM 中的重要设计

LTC和StoCs之间的写数据流程

第一个比较重要的设计是LTC和StoCs之间的写数据流程。该流程展示的是:当在客户端发起写请求时,计算节点和存储节点是以怎样的方式将数据写进去的过程。

首先是计算节点的客户端发起一个新的写请求操作。存储节点在接收到该请求后,基于RDMA交互,它会在buffer区域分配一个内存区域,并且为这块内存和偏移量(当前哪块内存可以写)分配一个id,告知客户端。客户端接到响应后就会开始写数据,完成后会通知存储节点。存储节点接收到信号后,将数据持久化并且再告知客户端。

上述流程是写一个数据文件即SSTable。写完后,我们要以同样的流程将元数据文件更新。因为底层是分布式架构,需要知道哪些文件写在哪里以及每个SST的范围、版本号。

img

动态区间划分

第二个比较重要的设计是动态区间划分。假设业务的请求范围为0-1万,当前有10个计算节点,将这10个计算节点的区间划分为10等份,比如第一个key的空间范围为0-1000。在负责0-1000的计算节点里,它会再进行划分,这一层划分业务无感知。这就叫动态区间划分,简称Drange。其作用主要有以下几点:

首先,每个range都是一棵LSM-Tree,按照数据区间,不同的Drange都有自己的memtables。比如0-1000区间又可以划分为10个Drange,10个Drange之间的memtable相互独立。这样做的好处是这些Drange之间的key互不重叠,例如0-100、100-200、200-300。

其次,在Dranges下还有一层Tranges。如果发现Drange里的部分range比如890-895存在热点现象,而旁边的range并非热点,则可以用Tranges进行细粒度的复杂重均衡,实现动态均衡负载。

最后,在此基础上,因为Drange的key范围互不相交,当memtable变成immutable,不可再写后,它们需要独立地flush到磁盘上。这时,在L0层的SSTable来自不同的Drange,它们之间的key完全不相交,我们就可以进行并行的compaction。

img

文章还将没有Drange划分和有Drange划分两种情况进行了对比:

  • 在没有Drange划分的情况下,L0的compaction无法很好并行。在这种情况下,如果遇到最坏的情况,L0层的某一个SST有可能覆盖了整个key空间,假设key范围为0-600,L0层的SST文件的范围是0-1000,当发生compaction时,它必须要跟其他4个SST做归并,这时不但要把L0层的其他SST全部读取比较一遍,还要把L1层所有的SST都读一遍再做归并排序。这时写放大会较为严重,意味着L0层到L1层的compaction会变慢,flush也会变慢,甚至flush不了时,前端就会出现缓写、停写现象。
  • 有Drange划分后,相当于compaction可以分开区间,如下方的示意图所示。在0-100区间,L0到L1可以独立去compaction,100-200区间也可以独立去compaction,可以较好地实现并行compaction。而在原生的RocksDB里,只有从L1开始compaction,才能进行并行compaction操作。

索引查找以及Scan操作

因为划分了很多不同的动态区间,memtable的数量也会增加,意味着查询操作的耗时也会增加。所以要如何在原来的基础上维护好读性能?这篇文章提出了以下解决思路:

每个LTC维护了一个lookup index。如果这些数据存在于memtable和L0层的SST上,通过lookup index我们就可以快速查找到想要的数据。当某一个L0层SST被compaction到L1层时,索引上就会移除掉对应的key。

LTC同时还维护了一个范围索引即range index。因为知道每个Drange的范围,所以当一个scan请求所涉及到的key都可以在memtable和L0层SST中找到时,该范围索引就能快速响应scan操作。

img

SSTable的分布

最后一个比较重要的设计涉及到存储层。当某个SST文件要写到存储节点时,分布式系统首先要保证负载均衡,要保证数据避免单点故障不可恢复的场景。

该文章提出根据一定策略,将数据文件即SST打散写入到多个存储节点里。考虑到存储成本,每个SSTable采用纠删码(Erasure Coding)的方式进行编码然后分布式存放。默认情况下对每个 SSTable 采用 “3+1”的 EC 配置,将一个SSTable切分为3个数据块,根据一定算法,在这3个数据块里去计算出一个校验块,变成了“3+1”的形式。这种方式比传统的多副本可以节省更多空间。假设一个SSTable是3M,这种“3+1”的方式最终所占空间为4M,并且能容忍一个节点的丢失,与占用6M空间的双副本方案拥有同样的故障容忍等级。而元数据文件因为体积比较小,所以直接采用多副本存储的方式,比如1个元数据文件可以写3个副本。

Challenges and Solutions

  1. Write Stalls, the solutions are:

    1. Vertical scaling: use large memory.

    2. Horizontal scaling: use the bandwidth of many StoCs.

  2. Scans are slowed down, the solutions are:

    1. Construct Dranges at runtime based on workload. Drange faciliates parallel compaction.

    2. Construct range index dynamically.

  3. Gets are slowed down, the solution is: Use lookup index.

  4. Temporary Bottlenecks, the solution is:

    1. Scatter blocks of a SSTable across multiple StoCs.

    2. Power-of-d: power-of-d is applied in Nova-LSM to help with load balancing during SSTable placement. When writing data to storage components (StoCs), Nova-LSM doesn’t randomly select just one StoC. Instead, it chooses d StoCs at random and writes to the one with the shortest queue. This method helps avoid bottlenecks and improves throughput, ensuring that data is distributed evenly across storage nodes without overwhelming any individual node.

  5. Logging, the solution is: Replicating Log records in the memory of StoCs to provide high availability.

  6. Skewed Access Pattern, the solution is: Dranges enable LTC to write 65% less data to StoCs with skewed data access.

Questions

Why do modern database systems disaggregate compute from storage?

Modern database systems disaggregate compute from storage to improve scalability, resource utilization, and fault isolation. By separating compute (processing) and storage, the system can independently scale each based on demand. Compute nodes handle processing, while storage nodes handle data access, optimizing resources and ensuring that failures in one component don’t impact the other. This separation also benefits cloud environments, where elastic scaling of resources is crucial.

How does Nova-LSM provide superior performance than monolithic data stores?

Nova-LSM improves performance by using a component-based architecture that disaggregates processing (LTC) and storage (StoC). It allows components to scale independently and uses RDMA for fast communication. Nova-LSM also introduces dynamic range partitioning (Dranges), allowing parallel compaction and reducing write stalls, which significantly enhances throughput. This architecture minimizes bottlenecks seen in monolithic stores like LevelDB and RocksDB, especially under skewed workloads.

Why does the standard cost-based optimizer produce sub-optimal query plans? How does Kepler improve both the query planning time and query execution time?

The standard cost-based optimizer can produce sub-optimal plans because it relies on simplified and static cost models that don’t always capture real execution costs, especially in dynamic environments. It also may lack up-to-date statistics, leading to inaccurate decisions. Kepler, on the other hand, uses machine learning to learn from past executions and adapts to current data distributions, improving query plan selection. By pruning the search space efficiently and using real-time data, it reduces both planning time and execution time while optimizing performance.

References:

作用于表中字段上的规则,用于限制存储在表中的数据,保证表中数据的正确性、有效性和完整性。

  • NOT NULL 非空约束
  • UNIQUE 唯一约束
  • PRIMARY KEY 主键约束
  • DEFAULT 默认约束
  • CHECK 检查约束
  • FOREIGN KEY 外键约束:让两张表的数据之间建立连接,具有外键的表是子表。

声明外键:

1
2
3
4
5
6
CREATE TABLE 表名 (
字段名 数据类型,
[CONSTRAINT] [外键名称] FOREIGN KEY (外键字段名) REFERENCES 主表 (主表列名)
);

ALTER TABLE 表名 ADD CONSTRAINT 外键名称 FOREIGN KEY (外键字段名) REFERENCES 主表 (主表列名);

删除外键:ALTER TABLE 表名 DROP FOREIGN KEY 外键名称;

删除/更新行为:

  • NO ACTION 在父表中删除/更新对应记录时,首先检查该记录是否有对应外键,如果有则不允许删除/更新。(与RESTRICT一致)
  • RESTRICT
  • CASCADE 在父表中删除/更新对应记录时,首先检查该记录是否有对应外键:如果有,也删除/更新外键在子表中的记录。
  • SET NULL 在父表中删除对应记录时,首先检查该记录是否有对应外键:如果有,设置子表中该外键值为 null (要求外键允许取 null)。
  • SET DEFAULT 父表数据变更时,子表将外键列设置成一个默认的值(InnoDB不支持)。
1
ALTER TABLE 表名 ADD CONSTRAINT 外键名称 FOREIGN KEY (外键字段) REFERENCES 主表名 (主表字段名) ON UPDATE [更新行为] ON DELETE [删除行为];

字符串函数

  • CONCAT(S1, S2, …) 字符串拼接;
  • LOWER(str) 小写转换;
  • UPPER(str) 大写转换;
  • LPAD(str, n, pad) 左填充,用字符串 pad 对 str 左边进行填充,直到长度为 n;
  • RPAD(str, n, pad) 右填充,用字符串 pad 对 str 右边进行填充,直到长度为 n;
  • TRIM(str) 去掉字符串头部和尾部的空格;
  • SUBSTRING(str, start, len) 返回字符串 str 从 start 位置起的 len 长度的字符串,需要注意的是字符索引是从1开始的。

数值函数

  • CEIL(x) 向上取整;
  • FLOOR(x) 向下取整;
  • MOD(x, y) 返回 x/y 的模;
  • RAND() 返回0~1的随机数;
  • ROUND(x, y) 求参数 x 四舍五入的值并保留 y 位小数。

日期函数

  • CURDATE() 当前日期;
  • CURTIME() 当前时间;
  • NOW() 当前日期和时间;
  • YEAR(date) 获取指定 date 的年份;
  • MONTH(date) 获取指定 date 的月份;
  • DAY(date) 获取指定 date 的日期;
  • DATE_ADD(date, INTERVAL expr type) 返回一个日期/时间值加上一个时间间隔 expr 后的时间值;
  • DATEDIFF(date1, date2) 返回起始时间 date1 和结束时间 date2 之间的天数。

流程函数

  • IF(value, t, f) 如果 value 为真,返回 t,否则返回 f;
  • IFNULL(value1, value2) 如果 value1 不为空,返回 value1,否则返回 value2;
  • CASE WHEN [val1] THEN [res1] … ELSE [default] END 如果 val1 为 true,返回 res1,… 否则返回 default 默认值;
  • CASE [expr] WHEN [val1] THEN [res1] … ELSE [default] END 如果 expr 的值等于 val1,返回 res1, … 否则返回 default 默认值。