Here I am
sharing my experience of setting up a Hadoop Cluster using Ubuntu
12.04LTS machines and running MapReduce programs on the hadoop cluster which
was one of my assignment in Evolving Architecture Lab.
A. Introduction to Hadoop
Hadoop is a
framework written in Java for running applications on large clusters of
commodity hardware and incorporates features similar to those of the Google File System (GFS)
and of the MapReduce
computing paradigm. Hadoop’s HDFS
is a highly fault-tolerant distributed file system and, like Hadoop in general,
designed to be deployed on low-cost hardware. It provides high throughput
access to application data and is suitable for applications that have large
data sets.
For setting up Hadoop we can choose one of the following
options available
- Hortonworks SandBox http://hortonworks.com/products/hortonworks-sandbox/
- Mapr http://www.mapr.com/
- cdh4 (Cloudera) http://www.cloudera.com/content/support/en/documentation/cdh4-documentation/cdh4-documentation-v4-latest.html
- Apache hadoop http://hadoop.apache.org/
I have chosen
Apache Hadoop for setting up cluster.
What Is
Apache Hadoop?
The Apache Hadoop project develops
open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a
framework that allows for the distributed processing of large data sets across
clusters of computers using simple programming models. It is designed to scale
up from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability,
the library itself is designed to detect and handle failures at the application
layer, so delivering a highly-available service on top of a cluster of
computers, each of which may be prone to failures.
The project includes these modules:
- Hadoop Common: The common utilities that support the other Hadoop modules.
- Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
B. Setting up a single node Hadoop cluster
Pre-requisites-
1.Install JAVA JDK 7
1. Download the Java JDK (https://www.dropbox.com/s/h6bw3tibft3gs17/jdk-7u21-linux-x64.tar.gz)
2. Unzip the file tar -xvf jdk-7u21-linux-x64.tar.gz
3. Now move the JDK 7 directory to /usr/lib
1. sudomkdir -p /usr/lib/jvm
2. sudo mv ./jdk1.7.0_21 /usr/lib/jvm/jdk1.7.0
4. Now run
1. sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.7.0/bin/java" 1
2. sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.7.0/bin/javac" 1
3. sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.7.0/bin/javaws" 1
5. Correct the file ownership and the permissions of the executables:
1. sudochmoda+x /usr/bin/java
2. sudochmoda+x /usr/bin/javac
3. sudochmoda+x /usr/bin/javaws
4. sudochown -R root:root /usr/lib/jvm/jdk1.7.0
6. Check the version of you new JDK 7 installation: java-version
2. Install SSH Server
1. sudo apt-get install openssh-client
2. sudo apt-get install openssh-server
3. Adding a dedicated Hadoop system user
- sudo addgroup hadoop
- sudo adduser --ingroup hadoop hduser
4. Configure SSH
1. su - hduser
2. ssh-keygen -t rsa -P ""
3. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
4. sshlocalhost
5.
Disabling IPv6 –
Run the following command in the extended
terminal (Alt + F2)
1. gksudogedit /etc/sysctl.conf
Add the following lines to the bottom of
the file
|
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 |
Save the file and close it.
6. Restart
your Ubuntu
Hadoop Installation and Configuration
1. Download Apache Hadoop 1.1.2 (https://www.dropbox.com/s/znonl6ia1259by3/hadoop-1.1.2.tar.gz)
and store it the Downloads folder
2. Unzip the file (open up the terminal window)
1. cd Downloads
2. sudo tar xzf hadoop-1.1.2.tar.gz
3. cd /usr/local
4. sudo mv /home/hduser/Downloads/hadoop-1.1.2 hadoop
5. sudoaddgrouphadoop sudochown -R hduser:hadoophadoop
Open your .bashrc file in the extended
terminal (Alt + F2)
1. gksudogedit .bashrc
Add the following lines to the bottom of
the file:
|
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop export PIG_HOME=/usr/local/pig export PIG_CLASSPATH=/usr/local/hadoop/conf # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on) export JAVA_HOME=/usr/lib/jvm/jdk1.7.0/ # Some convenient aliases and functions for running Hadoop-related commands unaliasfs&> /dev/null alias fs="hadoopfs" unaliashls&> /dev/null alias hls="fs -ls" # If you have LZO compression enabled in your Hadoop cluster and # compress job outputs with LZOP (not covered in this tutorial): # Conveniently inspect an LZOP compressed file from the command # line; run via: # # $ lzohead /hdfs/path/to/lzop/compressed/file.lzo # # Requires installed 'lzop' command. # lzohead () { hadoopfs -cat $1 | lzop -dc | head -1000 | less } # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$PIG_HOME/bin |
Save the .bashrcfile and close it
Run
1. gksudogedit/usr/local/hadoop/conf/hadoop-env.sh
Add the following lines
|
# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0/ |
Save and close file
In
the terminal window, create a directory and set the required ownerships and
permissions
1. udomkdir -p /app/hadoop/tmp
2. sudochownhduser:hadoop /app/hadoop/tmp
3. sudochmod 750 /app/hadoop/tmp
Run
1. gksudogedit /usr/local/hadoop/conf/core-site.xml
Add the following between the
<configuration> … </configuration> tags
|
<property>
<name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming theFileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> |
Save and close file
Run
1. gksudogedit /usr/local/hadoop/conf/mapred-site.xml
Add the following between the
<configuration> … </configuration> tags
|
<property>
<name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> |
Save and close file
Run
1. gksudogedit /usr/local/hadoop/conf/hdfs-site.xml
Add the following between the
<configuration> … </configuration> tags
|
<property>
<name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> |
Format the HDFS
1. /usr/local/hadoop/bin/hadoop namenode -format
Press the Start button and type Startup
Applications
Add an application with the following
command:
1. /usr/local/hadoop/bin/start-all.sh
Restart
Ubuntu and login
Running a MapReduce job
We will now run your first Hadoop
MapReduce job. We will use the WordCount
example job which reads text files and counts how often words occur. The
input is text files and the output is text files, each line of which contains a
word and the count of how often it occurred, separated by a tab.
1. Login to hduser
su - hduser
2. Restart the Hadoop cluster
/usr/local/hadoop/bin/start-all.sh
3. Copy local example data to HDFS
cd
/usr/local/hadoop
bin/hadoop dfs
-copyFromLocal /tmp/input /user/hduser/input
4. Run the MapReduce job
bin/hadoop jar
hadoop*examples*.jar wordcount /user/hduser/input /user/hduser/output
5. Retrieve the job result from HDFS
mkdir
/tmp/gutenberg-output
bin/hadoop
dfs -getmerge /user/hduser/output /tmp/gutenberg-output
Hadoop Web Interfaces
Hadoop
comes with several web interfaces which are by default (see conf/hadoop-default.xml)
available at these locations:
- http://localhost:50070/ – web UI of the NameNode daemon
- http://localhost:50030/ – web UI of the JobTracker daemon
- http://localhost:50060/ – web UI of the TaskTracker daemon
These web
interfaces provide concise information about what’s happening in your Hadoop
cluster.
C. Setting up a multi-node Hadoop cluster
Similarly you can setup a multi-node cluster and run your
job.For multi-node cluster setup guide you can visit here .
