In late 2023, Clyso was approached by a cutting-edge company to transition their existing HDD-backed Ceph cluster to a 10 petabyte NVMe deployment. This marked the beginning of an intense performance analysis and optimization journey that would push Ceph to new limits. The goal: achieving 1 TiB/s throughput with a Ceph cluster. This blog outlines the challenges, solutions, and outcomes of this remarkable achievement.
The Challenge: Setting Up a High-Performance Cluster
The client’s requirements were unique. They needed a system that could operate across 17 racks with no service interruptions during the migration. The new cluster had to be integrated into the existing infrastructure without disrupting operations. This meant maintaining a delicate balance between power, cooling, density, and vendor preferences. Despite the client already having a hardware design in mind, they sought Clyso’s expertise to finalize the configuration.
The final setup utilized 68 Dell PowerEdge R6615 nodes, each with an AMD EPYC 9454P processor, 192 GiB DDR5 RAM, and 10 Dell 15.36 TB NVMe drives. This setup provided not only higher memory throughput and more aggregate CPU resources but also better network throughput. With two 100GbE Mellanox ConnectX interfaces per node, the cluster was ready to handle extreme workloads.
The Initial Hurdles: Testing and Debugging
The burn-in testing was performed using CBT (Ceph Benchmarking Tool) and FIO with the librbd engine. The customer didn’t need RBD, RGW, or CephFS, however using librbd allowed Clyso to compare the results with previously published numbers from prior testing. Initially, the tests revealed significant performance inconsistencies. One of the major issues was that the operating system had been accidentally deployed on two of the OSD drives instead of the internal Dell BOSS m.2 boot drives.
The testing process was complex and revealed a pattern of erratic behavior. During multi-OSD tests, the system showed significant performance degradation, which only recovered after several hours or a system reboot. Further analysis showed that kernel-level issues were causing the system to block I/O operations, leading to these performance drops.
Fixing the Problems: The Three Key Solutions
**Fix One: Disabling CPU c-states**
It was discovered that Ceph is highly sensitive to latency introduced by CPU c-state transitions. Disabling c-states in the BIOS resulted in a 10-20% performance gain. While this was a good start, it wasn’t enough to meet the desired performance goals.
**Fix Two: Disabling IOMMU**
The second major fix involved disabling IOMMU in the kernel. A significant amount of time was being spent on kernel spin locks. Disabling it led to a substantial performance boost, especially as the number of NVMe drives increased. This fix was crucial in overcoming the bottlenecks observed at larger scales.
Fix Three: Recompiling RocksDB with Correct Flags**
The final fix was related to the way Ceph’s Ubuntu/Debian packages were compiled. It was discovered that the Ceph build lacked the proper compile flags for RocksDB, leading to slow compaction and poor 4K random write performance. After recompiling with the correct flags, compaction time dropped significantly, and 4K random write performance doubled.
With these fixes in place, Clyso was able to push the performance closer to the target of 1 TiB/s.
The Breakthrough: Reaching 1 TiB/s
In the first week of January 2024, testing resumed with all 10 drives in each node active. Initial tests with 10 nodes achieved 213.5 GiB/s, almost linear scaling at 98.4%. Encouraged by this, we increased the node count to 32, achieving a staggering 635 GiB/s read throughput.
To reach 1 TiB/s, Clyso needed to test with all 63 nodes. The only option was to co-locate FIO processes on the same nodes as the OSDs, a move that would likely impact performance due to resource contention. After extensive testing and fine-tuning, we finally hit 1 TiB/s read throughput in the early hours of Monday morning.
Beyond the Goal: Exploring Erasure Coding and Encryption
After achieving 1 TiB/s with 3X replication, Clyso tested the cluster with 6+2 erasure coding, which is the configuration the customer would use. The results showed over 500 GiB/s for reads and nearly 400 GiB/s for writes. The network overhead of erasure coding significantly impacted read performance compared to 3X replication.
We also tested msgr v2 encryption to evaluate its impact on performance. Enabling encryption reduced read throughput from 1 TiB/s to around 750 GiB/s, while other workloads experienced a more modest decline. This provided the customer with valuable data on the trade-offs involved in using encryption.
The Outcome: A Milestone for Ceph Performance
By mid-January, the customer’s cluster was fully migrated to the new NVMe nodes, achieving unprecedented performance levels. The key numbers are as follows:
– **4MB Read Performance:** Reached up to 1025 GiB/s with 630 OSDs.
– **4MB Write Performance:** Achieved 387 GiB/s with 630 OSDs using erasure coding.
– **4K Random Read IOPS:** Peaked at 25.5 million IOPS.
These results represent the fastest single-cluster Ceph performance ever published. Clyso is now focused on addressing remaining issues like the laggy PG problem observed during high write loads and exploring further ways to improve IOPS scaling.
Conclusion: A New Benchmark for Ceph
The journey to 1 TiB/s was filled with challenges, from hardware limitations to software bugs, but the outcome demonstrated the immense potential of Ceph. The results set a new benchmark for Ceph performance and hint at possibilities for even higher throughput in the future.
For those interested in pushing the boundaries of Ceph performance, this project provides valuable insights and a roadmap for achieving similar results. If you have a faster cluster or want to collaborate, the team at Clyso is always open to discussions.
This milestone was only possible through the collaborative efforts of the Ceph community, the hardware vendors, and the dedicated team at Clyso. The future looks bright for Ceph as it continues to evolve and break new performance barriers.