Using XtreemFS for Data Analysis with Hadoop

von Christoph Kleineweber (Zuse Institute Berlin)

Hadoop and the associated distributed file system HDFS have become important tools to analyze huge amounts of data with the MapReduce programming model. The fact that HDFS does not support POSIX semantics forces many users to run a second storage infrastructure with a general purpose file system. This increases the administration effort and requires transferring large files between the file systems.

XtreemFS (http://www.xtreemfs.org) is a POSIX compliant distributed file system developed by the Zuse Institute Berlin, published under the BSD license. XtreemFS has a Hadoop adapter, which can replace HDFS for MapReduce jobs. After a general introduction to the XtreemFS architecture and its features, we will present the architecture of our Hadoop adapter and how to deal with the storage requirements of Hadoop. Benchmarks show that a general purpose file system can compete with the performance of HDFS while administrative overhead is avoided.

Über den Autor Christoph Kleineweber:

Christoph Kleineweber is one of the developers of the open source distributed file system XtreemFS. He is a scientific staff member in the distributed algorithms group at the Zuse Institute Berlin (ZIB). His research interests are performance modeling and quality of service guarantees for distributed storage systems. Before joining ZIB, he obtained a Bachelor and Master degree in computer science from the University of Paderborn.