sherriconnett: ElephantDB, a Distributed Database for Working with Hadoop

2011年2月20日星期日

ElephantDB, a Distributed Database for Working with Hadoop

We first told you about ElephantDB earlier this year in our article Secrets of BackType's Data Engineers. But we didn't link to the GitHub repo, which has been making rounds in the blogosphere for the past couple days.

As a refresher, ElephantDB is an distributed database created by BackType to export data from Hadoop and serve it into analytics applications, APIs, etc.

Sponsor

A bit more detail from the ReadMe:

ElephantDB is a database that specializes in exporting key/value data from Hadoop. ElephantDB is composed of two components. The first is a library that is used in MapReduce jobs for creating an indexed key/value dataset that is stored on a distributed filesystem. The second component is a daemon that can download a subset of a dataset and serve it in a read-only, random-access fashion. A group of machines working together to serve a full dataset is called a ring.
Since ElephantDB server doesn't support random writes, it is almost laughingly simple. Once the server loads up its subset of the data, it does very little. This leads to ElephantDB being rock-solid in production, since there's almost no moving parts.

ElephantDB server has a Thrift interface, so any language can make reads from it. The database itself is implemented in Clojure.

An ElephantDB datastore contains a fixed number of shards of a "Local Persistence". ElephantDB's local persistence engine is pluggable, and ElephantDB comes bundled with a local persistence implementation for Berkeley DB Java Edition. On the MapReduce side, each reducer creates or updates a single shard into the DFS, and on the server side, each server serves a subset of the shards.

Also of note is Cascalog, a programming language derived from Clojure for working with Hadoop.

Discuss