RSS feed RSS: Events | News | Papers

PDSI News @ UCSC

No recent news at this time.

PDSI Events @ UCSC

Scalable Metadata Management

Description

As the number and variety of files stored and accessed by users dramatically increases, existing file system structures have begun to fail as a mechanism for managing all of the information contained in those files. Many applications, such as email clients, multimedia management applications, and desktop search engines, have been forced to develop their own richer metadata infrastructures. While effective, these solutions are generally non-standard, non-portable, and potentially non-scalable. These issues suggest search, indexing, and information retrieval are becoming increasingly important areas for file and storage systems. In conjunction with faculty and students specializing in information retrieval at the UC Santa Cruz Department for Information Systems and Technology Management, we are developing system architectures that address these issues, which are scalable up to billions of files.

Status

Our current areas of focus are scalable indexing architectures for storage systems, improved file system namespaces, and incorporating concepts from information retrieval, such as faceted search, into file systems. This work is being done in collaboration with NetApp. Our new metadata search index, Spyglass, leverages the characteristics unique to storage systems, such as data distributions and hierarchical namespaces, to design new search and indexing algorithms. Our current design has search performance that can outperform basic DBMS-based solutions by up to four orders of magnitude, allows time-traveling queries over versioned metadata, and can efficiently re-crawl very large file systems. We are now exploring how similar techniques can be applied to full-text search in file systems.

We are also exploring new file system designs where search is first-class functionality rather than an after thought. The current approach of using a search index in addition to the file system's index requires two large and separate index structures to be maintained. This separation forces users and applications to access and update two structures when using their data. Our file system designs take a new approach to internal file system structures, layouts, and logging that are search optimized. Our new design can improve search performance, allow data layouts based on how files are queried, and improve efficiency by reducing the number of index structures that must be maintained. We are in the process of implementing these concepts in the Ceph distributed file system.

In addition, we are designing a file system query language, QUASAR, that allows users to have powerful semantic access to stored data. QUASAR allows semantic file system views and directories to be created, which provide more meaningful data representations. Inter-file relationships, such as provenance, can be expressed and searched through links. To aid browsing, we are investigating applying faceted search to QUASAR. Faceted search uses rich key-value metadata to allow users to interactively navigate the search space and can allow interfaces to be automatically personalized for each user.

Publications

2009

2008

2007


Last modified 22 Sep 2009
Home | Research | People | Publications | Seminars | Sponsors
  Site powered by Django