I had blogged a while ago about SQL coming to Hadoop, citing a GigaOm article. That article had also mentioned Cloudera's Impala product as one of the strong contenders in this area.
Ckoudera Impala is an open source SQL query engine that can operate directly on Hadoop data; there is no need to extract the data into an RDBMS. They also plan to support Business Intelligence tools.
Here is the original announcement of Impala from Cloudera:
Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real
Interestingly, they mention in that announcement, that Google's Dremel paper was one thing that inspired them to create Impala.
I had blogged about Dremel earlier:
Drill by Apache, like Google Dremel
More on Google Dremel - Wired article
These are the key benefits and features of Cloudera Impala (excerpt from their page, emphasis mine), that I found interesting:
Key Benefits of Impala
Speed to Insight
Perform interactive analytics directly on data stored in Hadoop. Get answers as quickly as you can ask questions, without the bottlenecks caused by data movement and jumping between data silos.
Reduce data movement as well as duplicate storage with specialized systems by performing interactive analysis directly on full fidelity data.
Full Fidelity Analysis
Ask questions of all your data - without loss of fidelity from aggregations or conforming to fixed schemas.
Leverage existing BI tools and employee skill sets (SQL) to interact with data stored in Hadoop.
Enable more users to interact with more data by providing a single repository and metadata store from source to analysis.
Leverage the same file and data formats, metadata, security and resource management frameworks you use for the rest of the Hadoop system.
Key Features of Impala
SQL queries on CDH in seconds
Native MPP query engine
Integration with leading BI tools
Support for HDFS and HBase
Support for a wide variety of file formats including text, SequenceFiles, Avro, RCFile, LZO and Parquet
In-memory data transfers
Leverages metadata, ODBC driver, SQL syntax and Beeswax GUI (in Hue) from Apache Hive
Fine-grained, role-based authorization with Sentry
100% open source (Apache licensed)
Here are a few related interesting posts:
Cloudera gets $65 mil more to grow Hadoop based Big Data offerings
On GigaOm: Cloudera makes SQL a first-class citizen in Hadoop">Cloudera makes SQL a first-class citizen in Hadoop
Cloudera Touts Near Linear Scalability with Impala
And here is a video of a technical deep dive into Cloudera Impala, on their site.
Read other posts about Big Data on my blog.
Check out this photo of an impala with cheetahs:
Excerpts from the Wikipedia page about the cheetah:
[ The cheetah is a large feline inhabiting most of Africa and parts of the Middle East. The cheetah can run faster than any other land animal— as fast as 112 to 120 km/h (70 to 75 mph) in short bursts, and has the ability to accelerate from 0 to 100 km/h (62 mph) in three seconds. ]
Maybe Cloudera should have named Impala as Cheetah instead :)
- Vasudev Ram - Dancing Bison Enterprises