We bring to you some of the best data tools in the expanding Hadoop ecosystem.
Rate this news: (0 Votes)
Thursday, October 17, 2013: MapReduce was rolled out to cater to the limitations of traditional databases. Tools including Giraph, Hama, and Impala catered to the limitations of MapReduce running on Hadoop. Apart from these, graph, document, column, and other NoSQL databases may also be part of the mix. The options are expanding faster than you think. Here’s looking at them in detail.
1. Apache Hadoop – As cited on infoworld.com, Hadoop refers to the MapReduce framework with the project having important tools for data storage and processing. There are not a lot of Apache projects that can support even one heavily capitalized startup whereas Hadoop can support many. It is estimated by analysts that the market for Hadoop will be worth tens of billions per year.
2. Apache Sqoop - This enables quick data transfers from relational database systems to Hadoop through concurrent connections, customizable mapping of data types, and metadata propagation. It allows you to import to HDFS, Hive, and HBase. It also exports results back to relational databases. It takes care of all the complexities present in the use of data connectors and mismatched data formats.
3. Talend Open Studio for Big Data – This allows you to load files into Hadoop without the use of manual coding. The graphical IDE produces native Hadoop code that makes use of Hadoop's distributed environment for data transformations on a large scale.
4. Apache Giraph – This is a graph processing system designed for high scalability and high availability. It is the open source equivalent of Google’s Pregal and is used by Facebook for analyzing social graphs of users and their connections. The system blocks the problem of using MapReduce to process graphs by implementing Pregel's efficient Bulk Synchronous Parallel processing model.
5. Apache Hama – This is very much like Giraph and brings Bulk Synchronous Parallel processing to the Hadoop ecosystem and runs on top of the Hadoop Distributed File System. But where Giraph is focused exclusively on graph processing, Hama is a generalized framework for performing massive matrix and graph computations.
6. Cloudera Impala - The Impala engine is situated atop all the data nodes in your Hadoop cluster keeping an ear out for queries. Post parsing of each query and optimization of an execution plan, it coordinates parallel processing between the worker nodes in the cluster resulting in a low-latency SQL queries across Hadoop with near-real-time view into big data. As Impala can be accessed from any ODBC/JDBC source, it is a good companion for BI packages like Pentaho.
7. Serengeti – This allows you to spin up Hadoop clusters in a dynamic manner on a shared server infrastructure. It leverages the Apache Hadoop Virtualization Extensions that is created and contributed by VMware which makes Hadoop ready for virtualization. You can use this tool to deploy Hadoop cluster environments quickly without giving away configuration options like node placement, HA status, or job scheduling.
8. Apache Drill – This is created for low-latency interactive analysis of big data sets. It supports multiple data sources, including HBase, Cassandra, and MongoDB along with traditional relational databases. Using Hadoop, you can obtain huge data throughput.
9. Gephi – This has been made by a consortium of academics, corporations, and individuals. It is a visualization and exploration tool supporting many graph types and networks that are as big as huge as one million nodes.
10. Neo4j – This is a quick and very-fast graph database that can be used in many ways which include social applications, recommendation engines, fraud detection, resource authorization, and data center network management. This has made steady progress accompanied with performance improvements and better clustering/HA support.