RSS feed RSS: Events | News | Papers

PDSI News @ UCSC

PDSI Events @ UCSC

No upcoming events at this time.

Archival Storage

Description

We have several active projects in archival storage, all of which are contributing to the ability to build more efficient, reliable, and secure long-term storage systems.

  • Deep Store: building more efficient archival storage using deduplication to take advantage of intra-file and inter-file redundancy.
  • POTSHARDS: long-term secure storage, which allows the secure preservation of data for decades without relying upon traditional encryption to prevent information leakage.
  • Pergamum: long-term evolvable storage built from intelligent network-attached bricks with both disk and NVRAM such as flash.

Digital reference data is produced at ever higher rates, increasing storage requirements, while at the same time users are increasing their demand for lower access times. On-line deep storage, with sub-second latency, is remarkably better than robot-loaded near-line media, which can take minutes. Disk-based deep storage is becoming practical because magnetic disks are rapidly becoming as inexpensive as magnetic tape and optical storage, the traditional storage media used for backup and archiving today. The Deep Store architecture uses inter-file (differential) and intra-file (sliding dictionary) data compression to increase storage density, and by adding distribution and redundancy to improve request bandwidth and robustness, the expected media costs will be much lower than that of traditional backup and archival storage.

The goal of the POTSHARDS project is to securely preserve data by spreading breaking it into pieces (shards) and storing them across multiple archives so that no individual archive can reconstruct the data or even know which shards it must steal from other archives to build data. However, a user who gathers all of the shards must be able to reconstruct the original data with no additional information (including encryption keys). We accomplish this using multiple levels of secret splitting and approximate pointers that limit the space that must be searched for related shards while requiring an attacker to obtain exponential numbers of shards that may not be identified in advance. This approach has information-theoretic security because of the use of secret splitting, unlike encryption that might be broken by advances in algorithms or computer hardware. We believe that this approach will become common as the need to securely store data for decades becomes more pressing.

Pergamum was created to explore evolvable archival storage. The project's goal is to develop a long-term system that controls the major storage cost contributors: static, operational and management. Pergamum consists of a fully distributed network of intelligent storage devices. Each node, called a tome, consists of a SATA hard drive, a low-power processor, NVRAM and a standardized network interface. Reliability is provided through two levels of redundancy encoding: intra-tome redundancy handles latent sector errors, and inter-tome redundancy handles lost devices. By keeping most of the devices spun-down, and through the utilization of commodity hardware, Pergaumum provides cost efficiency on par with tape based systems, while providing superior random access performance. Further cost savings are realized by utilizing hierarchical consistency checking, staged rebuilds and NVRAM based metadata stores; reducing disk spin-up results in dramatic energy savings.

Status

We are currently developing a scalable system architecture that addresses new problems: searching for similar files in a very large corpus to improve compression, maximizing storage throughput, distributing a large system for throughput and reliability, and managing file similarity data for billions of files. Due to the immutable nature of archival, or reference data, content-based addressing can be used to identify and locate entire files or portions of files. Our work currently focuses on organizing similar files containing arbitrary data using data fingerprinting and summarization. We are characterizing reference data and determining suitability of the deep store for various problem domains, such as scientific computing, simulations, enterprise and organizational computing. We have experimented with chunk-based storage (variable-sized blocks) and delta-encoded storage to evaluate the relative merits of each technique for storage efficiency, performance, and workload applicability.

We have also implemented a prototype POTSHARDS system, and have tested its performance on both local clusters and the PlanetLab wide-area testbed. We have demonstrated the ability to reconstruct data from just the shards stored in the system; while this can be done relatively quickly if all of the shards are present, it is impossible to do using just the shards from a single archive. We are currently exploring different redundancy techniques and approaches that will reduce the storage overhead while maintaining a high level of security and resistance to attack.

Publications


Last modified 10 May 2008
Home | Research | People | Publications | Seminars | Sponsors
  Site powered by Django