Why These Paths Matter
OceanBase targets high availability and scalability in a shared-nothing cluster. The core engineering challenge is to make four critical subsystems work together with predictable latency and correctness:
- Write transactions must be durable, replicated, and efficiently committed.
- Tablet replay must recover state quickly and safely.
- SQL parse to execute must optimize well while respecting multi-tenant constraints.
- Unit placement must map tenants to physical resources without fragmentation.
This article focuses on motivation, design, implementation highlights, and tradeoffs, using concrete code entry points from the OceanBase codebase.
System Architecture
OceanBase adopts a shared-nothing architecture where each node is equal and runs its own SQL engine, storage engine, and transaction engine. Understanding the overall architecture is essential before diving into implementation details.
Cluster, Zone, and Node Organization
graph TB
subgraph Cluster["OceanBase Cluster"]
subgraph Z1["Zone 1"]
N1["OBServer Node 1"]
end
subgraph Z2["Zone 2"]
N2["OBServer Node 2"]
end
subgraph Z3["Zone 3"]
N3["OBServer Node 3"]
end
end
subgraph Proxy["obproxy Layer"]
P1["obproxy 1"]
P2["obproxy 2"]
end
Client["Client Applications"] --> Proxy
Proxy --> N1
Proxy --> N2
Proxy --> N3
Key Concepts:
- Cluster: A collection of nodes working together
- Zone: Logical availability zones for high availability and disaster recovery
- OBServer: Service process on each node handling SQL, storage, and transactions
- obproxy: Stateless proxy layer routing SQL requests to appropriate OBServer nodes
Data Organization: Partition, Tablet, and Log Stream
graph TB
subgraph Table["Table"]
P1["Partition 1"]
P2["Partition 2"]
P3["Partition 3"]
end
subgraph LS1["Log Stream 1"]
T1["Tablet 1"]
T2["Tablet 2"]
end
subgraph LS2["Log Stream 2"]
T3["Tablet 3"]
end
subgraph LS3["Log Stream 3"]
T4["Tablet 4"]
end
P1 --> T1
P1 --> T2
P2 --> T3
P3 --> T4
T1 --> LS1
T2 --> LS1
T3 --> LS2
T4 --> LS3
Key Concepts:
- Partition: Logical shard of a table (hash, range, list partitioning)
- Tablet: Physical storage object storing ordered data records for a partition
- Log Stream (LS): Replication unit using Multi-Paxos for data consistency
- Replication: Each tablet has multiple replicas across zones, with one leader accepting writes. Log streams replicate data via Multi-Paxos protocol across different zones.
Multi-Tenant Resource Model
graph TB
subgraph Tenant["Tenant"]
T1["Tenant 1 MySQL Mode"]
T2["Tenant 2 Oracle Mode"]
T3["System Tenant"]
end
subgraph Pool["Resource Pool"]
RP1["Pool 1"]
RP2["Pool 2"]
RP3["Pool 3"]
end
subgraph Unit["Resource Unit"]
U1["Unit 1 CPU Memory Disk"]
U2["Unit 2 CPU Memory Disk"]
U3["Unit 3 CPU Memory Disk"]
end
subgraph Server["Physical Server"]
S1["Server 1"]
S2["Server 2"]
S3["Server 3"]
end
T1 --> RP1
T2 --> RP2
T3 --> RP3
RP1 --> U1
RP2 --> U2
RP3 --> U3
U1 --> S1
U2 --> S2
U3 --> S3
Key Concepts:
- Tenant: Isolated database instance (MySQL or Oracle compatibility)
- Resource Pool: Groups resource units for a tenant across zones
- Resource Unit: Virtual container with CPU, memory, and disk resources
- Unit Placement: Rootserver schedules units to physical servers based on resource constraints
Layered Architecture
graph TB
subgraph App["Application Layer"]
Client["Client Applications"]
end
subgraph Proxy["Proxy Layer"]
ODP["obproxy Router"]
end
subgraph OBServer["OBServer Layer"]
subgraph Node["OBServer Node"]
SQL["SQL Engine"]
TX["Transaction Engine"]
ST["Storage Engine"]
end
end
subgraph Storage["Storage Layer"]
subgraph LS["Log Stream"]
Tablet["Tablet"]
Memtable["Memtable"]
SSTable["SSTable"]
end
Palf["Paxos Log Service"]
end
subgraph Resource["Resource Layer"]
Tenant["Tenant"]
Unit["Resource Unit"]
Pool["Resource Pool"]
RS["Rootserver"]
end
Client --> ODP
ODP --> SQL
SQL --> TX
TX --> ST
ST --> LS
LS --> Palf
Tenant --> Unit
Unit --> Pool
Pool --> RS
RS --> Node
Key Concepts:
- SQL Engine: Parses, optimizes, and executes SQL statements
- Transaction Engine: Manages transaction lifecycle, commit protocols, and consistency
- Storage Engine: Handles data organization, memtables, and SSTables
- Log Service: Provides Paxos-based replication and durability
- Rootserver: Manages cluster metadata, resource scheduling, and placement
Design Overview
At a high level, each node runs a full SQL engine, storage engine, and transaction engine. Data is organized into tablets, which belong to log streams. Log streams replicate changes using Paxos-based log service. Tenants slice resources via unit configurations and pools, while rootserver components place those units on servers.
The following sections walk through each path with the relevant implementation anchors.
Architecture Diagrams
Transaction Write Path
flowchart LR
subgraph A["Transaction Write Path"]
C["Client"] --> S["SQL Engine"]
S --> T["Transaction Context"]
T --> M["Memtable Write"]
M --> R["Redo Buffer"]
R --> L["Log Service"]
L --> P["Replicated Log"]
P --> K["Commit Result"]
end
Tablet Replay Path
flowchart LR
subgraph B["Tablet Replay Path"]
L["Log Service"] --> RS["Replay Service"]
RS --> E["Tablet Replay Executor"]
E --> LS["Log Stream"]
LS --> TB["Tablet"]
TB --> CK["Replay Checks"]
CK --> AP["Apply Operation"]
AP --> ST["Updated Tablet State"]
end
SQL Compile and Execute
flowchart LR
subgraph C["SQL Compile and Execute"]
Q["SQL Text"] --> P["Parser"]
P --> R["Resolver"]
R --> W["Rewriter"]
W --> O["Optimizer"]
O --> LP["Logical Plan"]
LP --> CG["Code Generator"]
CG --> PP["Physical Plan"]
PP --> EX["Executor"]
end
Unit Placement
flowchart LR
subgraph D["Unit Placement"]
UC["Unit Config"] --> RP["Resource Pool"]
RP --> PS["Placement Strategy"]
PS --> CS["Candidate Servers"]
CS --> CH["Chosen Server"]
CH --> UN["Unit Instance"]
end
Write Transaction: From Memtable to Replicated Log
Motivation
A write transaction must be both fast and durable. OceanBase uses memtables for in-memory writes, and a log stream for redo replication. The design must allow low-latency commit while supporting parallel redo submission and multi-participant (2PC) transactions.
Design Sketch
- Each transaction is represented by a per-LS context (
ObPartTransCtx). - Redo is flushed based on pressure or explicit triggers.
- Commit chooses one-phase or two-phase based on participants.
- Logs are submitted via a log adapter backed by logservice.
Implementation Highlights
- Transaction context lifecycle and commit logic are in
src/storage/tx/ob_trans_part_ctx.cpp. - Redo submission is driven by
submit_redo_after_write, which switches between serial and parallel logging based on thresholds. - Commit decides between one-phase and two-phase commit depending on participant count.
- The log writer (
ObTxLSLogWriter) submits serialized logs viaObITxLogAdapter, which is wired to logservice (ObLogHandler).
Tradeoffs
- Serial vs parallel redo: Serial logging is simpler and cheaper for small transactions, but parallel logging reduces tail latency for large transactions at the cost of more coordination.
- 1PC vs 2PC: 1PC is fast for single-participant transactions; 2PC is required for distributed consistency but increases coordination overhead.
- In-memory batching vs durability: Larger batching improves throughput but can delay durability and increase replay time.
Tablet Replay: Reconstructing State Safely
Motivation
Recovery needs to be deterministic and safe: the system must replay logs to reconstruct tablet state without violating invariants or applying obsolete data.
Design Sketch
- Logservice schedules replay tasks per log stream.
- Tablet replay executor fetches the LS, locates the tablet, validates replay conditions, and applies the log.
- Specialized replay executors handle different log types (e.g., schema updates, split operations).
Implementation Highlights
- Replay orchestration lives in
src/logservice/replayservice/ob_log_replay_service.cpp. - Tablet replay logic is in
src/logservice/replayservice/ob_tablet_replay_executor.cpp. - Specific tablet operations are applied in dedicated executors, such as
ObTabletServiceClogReplayExecutorinsrc/storage/tablet/ob_tablet_service_clog_replay_executor.cpp.
Tradeoffs
- Strictness vs throughput: Replay barriers enforce ordering for correctness but can reduce parallelism.
- Tablet existence checks: Allowing missing tablets can speed recovery but requires careful validation to avoid partial state.
- MDS synchronization: Metadata state updates improve correctness but add contention via locks.
SQL Parse to Execute: Compile Pipeline for Performance
Motivation
OceanBase supports MySQL and Oracle compatibility with rich SQL features. The compile pipeline must be fast, cache-friendly, and yield efficient execution plans.
Design Sketch
- SQL text enters the engine via
ObSql::stmt_query. - Parsing produces a parse tree.
- Resolution turns the parse tree into a typed statement tree.
- Rewrite and optimization generate a logical plan.
- Code generation produces a physical plan and execution context.
Implementation Highlights
- Entry and query handling:
src/sql/ob_sql.cpp(stmt_query,handle_text_query). - Resolver:
ObResolverinsrc/sql/resolver/ob_resolver.h. - Transform and optimize:
ObSql::transform_stmtandObSql::optimize_stmtinsrc/sql/ob_sql.cpp. - Code generation:
ObSql::code_generateinsrc/sql/ob_sql.cpp.
Tradeoffs
- Plan cache vs compile accuracy: Plan caching reduces latency but may reuse suboptimal plans under changing data distributions.
- Rewrite aggressiveness: More transformations can yield better plans but increase compile cost.
- JIT and rich formats: Faster execution for some workloads, but added complexity and memory pressure.
Unit Placement: Scheduling Tenant Resources
Motivation
Multi-tenancy requires predictable isolation and efficient resource utilization. Unit placement must respect CPU, memory, and disk constraints while minimizing fragmentation.
Design Sketch
- Unit config defines resource demands.
- Resource pool groups units by tenant and zone.
- Placement strategy scores candidate servers to pick a host for each unit.
Implementation Highlights
- Resource types and pools:
src/share/unit/ob_unit_config.h,src/share/unit/ob_resource_pool.h,src/share/unit/ob_unit_info.h. - Placement policy:
src/rootserver/ob_unit_placement_strategy.cppuses a weighted dot-product of remaining resources to choose a server. - Orchestration:
src/rootserver/ob_unit_manager.cpphandles creation, alteration, and migration of units and pools.
Tradeoffs
- Greedy placement vs global optimality: Dot-product scoring is efficient and practical but may not be globally optimal.
- Capacity normalization: Assuming comparable server capacities simplifies scoring but may bias placement in heterogeneous clusters.
- Latency vs stability: Fast placement decisions can lead to more churn; conservative placement improves stability but can reduce utilization.
Closing Thoughts
These four paths demonstrate how OceanBase balances correctness, performance, and operability. The code structure follows clear separation of responsibilities: transaction logic is in storage/tx, replication and replay are in logservice, SQL compilation is in sql, and scheduling is in rootserver and share/unit. The tradeoffs are explicit and largely encoded in thresholds and policies, which makes tuning feasible without invasive rewrites.
If you are extending OceanBase, start with the entry points highlighted above and follow the call chains into the relevant subsystem. It is the fastest way to build a mental model grounded in the actual implementation.