Intel Software Adrenaline

Optimizing Java* and Apache Hadoop* for Intel® Architecture

Issue link:

Contents of this Issue


Page 1 of 4

Optimizing Java* and Apache Hadoop* for Intel® Architecture Apache Hadoop Architecture and Java Components Built on Java, Apache Hadoop is a scalable, distributed data storage and analytics engine that can store and analyze massive amounts of unstructured and semi-structured data. The distributed architecture of Apache Hadoop is comprised of multiple nodes, which are individual servers that run an off-the-shelf operating system and the Apache Hadoop software. This design lets Apache Hadoop manipulate large datasets across clusters of commodity hardware using a simple programming model. Three core components written in Java form the foundation of Apache Hadoop: •Apache Hadoop Distributed File System* (HDFS*): HDFS provides a high-performance file system that sits on top of an Apache Hadoop node's native file system and can span and replicate data across the nodes in the cluster. The file system can scale from a single node to thousands of nodes and provides resiliency and performance for large datasets. •MapReduce: A processing framework that enables large-scale analytics across distributed, unstructured data. Apache Hadoop breaks down analytics jobs into smaller tasks that are then executed by MapReduce on individual nodes as close as possible to the data being analyzed. •Apache Hadoop common: This component provides utilities that tie HDFS and MapReduce together. It provides Java Archive* (JAR) files, startup scripts, source code, and documentation. •JobTracker, which breaks the request down into multiple smaller tasks based on which slave nodes contain the data, and then assigns these tasks to those specific slave nodes. One of the performance objectives of Apache Hadoop is to analyze data on the same node where the data resides. A typical Apache Hadoop cluster is comprised of a single master node that keeps track of analytics jobs and data location, in addition to multiple slave nodes that store the data and perform the actual analytics tasks. These two services then communicate with additional Java services on the slave nodes. Each slave node utilizes three services that communicate with the master node and perform the actual analytics: •DataNode, a component of HDFS that stores and retrieves data on the slave node at the request of NameNode. •TaskTracker, which receives MapReduce tasks from JobTracker and runs the tasks within a separate JVM. When a client sends an analytics request to the master node, the request is processed by two Java services: •MapReduce, the analytic service that performs the analytics on the data. •NameNode, a directory component of HDFS that keeps track of where data is located within each of the cluster's slave nodes. Apache Hadoop* architecture and components Master Node NameNode JobTracker Java* Operating System Hardware Slave Node TaskTracker MapReduce Slave Node DataNode TaskTracker MapReduce Slave Node DataNode TaskTracker MapReduce Java* Java* Java* Operating System Operating System Operating System Hardware Hardware Hardware Figure 1: Overview of Apache Hadoop* architecture and components 2 DataNode

Articles in this issue

view archives of Intel Software Adrenaline - Optimizing Java* and Apache Hadoop* for Intel® Architecture