Data Storage

Information on how data is stored and how data loss is prevented

All data in the Uni Cloud Münster is stored in a Ceph object store 1 2. For each region, this store is distributed over many storage nodes, which are also used for virtual machines in a hyperconverged architecture. Different regions are completely independent, i.e. no automatic mirroring, synchronization or backup of data takes place. Different methods of storing data in a Ceph Object Store exist, among which the end user may select when creating virtual block devices or shares:

5x Replication

Replicated storage is only available for virtual disks. Using this kind of storage, all blocks of data are stored redundantly on five disk drives of different hosts. Virtual disks are put in read-only mode, when quorum is lost, which means that 3 disk drives on different hosts must fail in order to trigger this behavior. Data loss can only occur in the case of a simultaneous, permanent and complete failure of 5 disk drives in 5 different servers or in the unlikely case, that the same data is damaged on 5 disk drives in 5 different servers. This storage mode should mostly be used for small to medium sized amounts of data, as it comes with a big overhead of 400%.

8+3 Erasure-Coding

With 8+3 Erasure-Coding 3, all blocks of data are split into 8 equally sized blocks and an additional 3 blocks of parity information are computed. These 11 blocks are distributed to different servers using a hashing method 4. Data is put into read-only mode, if 3 disks have failed simultaneously. Data loss can only occur in the case of a simultaneous, permanent and complete failure of 4 drives in 4 different servers or in the unlikely case, that the same data is damaged on 4 disk drives in 4 different servers. This method of data storage is mostly used for large amounts of data, as it has an overhead of only 37.5%.

Recovery Time

As soon as a failure is discovered, the data contained on the affected drive is automatically redistributed by reading from the remaining drives. Therefore the time window of reduced redundancy stays small. The time required to restore redundancy is about less than 12 hours in the case of replicated data and less than 48 hours in the case of erasure coded data.

Gold- and Bronze storage

The Uni Cloud Münster provides HDD storage (bronze) as well as NVME storage (gold). Large amounts of data should be stored using the “bronze” category while small amounts of data with tight requirements on disk latencies (especially databases on virtual disks) should use the “gold” category. The difference in latencies between “bronze” and “gold” storage is not that high due to the fact that synchronous writes on virtual disks require acknowledgement of writes from at least 3 disks. Due to Ceph BlueStore architecture there is a huge CPU bottleneck in processing, which causes latencies to be much higher than expected from NVMe storage. Futhermore, we use Write Ahead Logs (WALs) on NVMe for writes on “bronze” storage, which makes low frequent writes of small data quite fast.

Snapshots and Backups

There are no automatic snapshots or backups of data, because usage scenarios differ greatly between projects. Depending on the use case, snapshots of virtual disks may be taken, data may be replicated in your databases or manual backups of important data may be created. If you need to store data in more than one region or run virtual machines for replication, please request an additional project at the other region.

Encryption

All data is stored encrypted on disks. This means that on disk failure, third parties may only see encrypted content on failed disks. The encryption keys are stored encrypted on five master nodes. We use LUKS whole disk encryption for all Ceph OSDs and also for the operating system and other partitions.


  1. Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D., & Maltzahn, C. (2006, November). Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation (pp. 307-320). USENIX Association. ↩︎

  2. Weil, S. A., Leung, A. W., Brandt, S. A., & Maltzahn, C. (2007, November). Rados: a scalable, reliable storage service for petabyte-scale storage clusters. In Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing'07 (pp. 35-44). ACM. ↩︎

  3. Plank, J. S., Simmerman, S., & Schuman, C. D. (2007). Jerasure: A library in C/C++ facilitating erasure coding for storage applications. Technical Report CS-07–603, University of Tennessee. ↩︎

  4. Weil, S. A., Brandt, S. A., Miller, E. L., & Maltzahn, C. (2006, November). CRUSH: Controlled, scalable, decentralized placement of replicated data. In SC'06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (pp. 31-31). IEEE. ↩︎