Conventionally, in order to achieve parallel distributed processing of Big Data, complicated programs need to be created for synchronization processing and so on. Under Hadoop, there is no need to consider parallel distributed processing when creating programs. The user just creates programs as two applications: applications that perform Map processing and Reduce processing in accordance with MapReduce algorithms. The distributed storage and extraction of data and the parallel execution of created processing is all left up to Hadoop.
The applications for performing processing under Hadoop include the following types:
MapReduce application
Java programs that operate in the Hadoop MapReduce framework are developed using the Hadoop API.
Hive query
These are queries written in an SQL-equivalent language (HiveQL) using Apache Hive, developed by The Apache Software Foundation, rather than using the Hadoop API.
Pig script
Like Hive, the Hadoop API is not used. These scripts are written using the Pig Latin language, which has functions equivalent to SQL language functions.
HBase application
The HBase API is used to develop Java programs that perform HBase data input-output and perform operations on the data in HBase.
The development of MapReduce applications is described below. Refer to the website and similar of the Apache Hadoop project for information on developing other applications.
Design the application processing logic. The processing such as input file splitting, merge, and so on, which needs to be designed under conventional parallel distributed processing, does not need to be designed because that is executed by the Hadoop framework. Therefore, developers can concentrate on designing the logic required for jobs.
For MapReduce applications, the application developer must understand the Hadoop API and design applications in accordance with the MapReduce framework. The main design tasks required are:
Determining items corresponding to Key and Value
Content of Map processing
Content of Reduce processing
Create applications based on the application design result.
MapReduce applications
As with the creation of ordinary Java applications, create a Java project and perform coding.
The Hadoop API can be used by adding the Hadoop jar file to the Eclipse build box.
Note that specification of hadoop -core-xxx.jar is mandatory (enter the Hadoop version at "xxx"). Specify other suitable Hadoop libraries in accordance with the Hadoop API being used.
Refer to the following information provided by Apache Hadoop for MapReduce application references:
If Hadoop API is used
Hadoop project top page
http://hadoop.apache.org/mapreduce/
Getting Started
Release page
Documentation
MapReduce project page
http://hadoop.apache.org/mapreduce/
MapReduce tutorial
http://hadoop.apache.org/common/docs/r1.0.1/mapred_tutorial.html
Note: The samples and explanations in the above tutorial are based on the "org.apache.hadoop.mapred" packages, but these packages are not recommended for the current stable Hadoop version. Use the tutorial content for reference only since some of the content does not apply.