Sachin Jain: Setting up a Hadoop Cluster and Running MapReduce Programs

Here I am sharing my experience of setting up a Hadoop Cluster using Ubuntu 12.04LTS machines and running MapReduce programs on the hadoop cluster which was one of my assignment in Evolving Architecture Lab.

A. Introduction to Hadoop

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

For setting up Hadoop we can choose one of the following options available

Hortonworks SandBox http://hortonworks.com/products/hortonworks-sandbox/
Mapr http://www.mapr.com/
cdh4 (Cloudera) http://www.cloudera.com/content/support/en/documentation/cdh4-documentation/cdh4-documentation-v4-latest.html
Apache hadoop http://hadoop.apache.org/

I have chosen Apache Hadoop for setting up cluster.

What Is Apache Hadoop?

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

B. Setting up a single node Hadoop cluster

Pre-requisites-

1.Install JAVA JDK 7

1. Download the Java JDK (https://www.dropbox.com/s/h6bw3tibft3gs17/jdk-7u21-linux-x64.tar.gz)

2. Unzip the file tar -xvf jdk-7u21-linux-x64.tar.gz

3. Now move the JDK 7 directory to /usr/lib

1. sudomkdir -p /usr/lib/jvm

2. sudo mv ./jdk1.7.0_21 /usr/lib/jvm/jdk1.7.0

4. Now run

1. sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.7.0/bin/java" 1

2. sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.7.0/bin/javac" 1

3. sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.7.0/bin/javaws" 1

5. Correct the file ownership and the permissions of the executables:

1. sudochmoda+x /usr/bin/java

2. sudochmoda+x /usr/bin/javac

3. sudochmoda+x /usr/bin/javaws

4. sudochown -R root:root /usr/lib/jvm/jdk1.7.0

6. Check the version of you new JDK 7 installation: java-version

2. Install SSH Server

1. sudo apt-get install openssh-client

2. sudo apt-get install openssh-server

3. Adding a dedicated Hadoop system user

sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser

4. Configure SSH

1. su - hduser

2. ssh-keygen -t rsa -P ""

3. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

4. sshlocalhost

5. Disabling IPv6 –

Run the following command in the extended terminal (Alt + F2)

1. gksudogedit /etc/sysctl.conf

Add the following lines to the bottom of the file

# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Save the file and close it.

6. Restart your Ubuntu

Hadoop Installation and Configuration

1. Download Apache Hadoop 1.1.2 (https://www.dropbox.com/s/znonl6ia1259by3/hadoop-1.1.2.tar.gz) and store it the Downloads folder

2. Unzip the file (open up the terminal window)

1. cd Downloads

2. sudo tar xzf hadoop-1.1.2.tar.gz

3. cd /usr/local

4. sudo mv /home/hduser/Downloads/hadoop-1.1.2 hadoop

5. sudoaddgrouphadoop sudochown -R hduser:hadoophadoop

Open your .bashrc file in the extended terminal (Alt + F2)

1. gksudogedit .bashrc

Add the following lines to the bottom of the file:

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
export PIG_HOME=/usr/local/pig
export PIG_CLASSPATH=/usr/local/hadoop/conf

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0/

# Some convenient aliases and functions for running Hadoop-related commands
unaliasfs&> /dev/null
alias fs="hadoopfs"
unaliashls&> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoopfs -cat $1 | lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$PIG_HOME/bin

Save the .bashrcfile and close it

Run

1. gksudogedit/usr/local/hadoop/conf/hadoop-env.sh

Add the following lines

# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0/

Save and close file

In the terminal window, create a directory and set the required ownerships and permissions

1. udomkdir -p /app/hadoop/tmp

2. sudochownhduser:hadoop /app/hadoop/tmp

3. sudochmod 750 /app/hadoop/tmp

Run

1. gksudogedit /usr/local/hadoop/conf/core-site.xml

Add the following between the <configuration> … </configuration> tags

<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming theFileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
</property>

Save and close file

Run

1. gksudogedit /usr/local/hadoop/conf/mapred-site.xml

Add the following between the <configuration> … </configuration> tags

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
</description>
</property>

Save and close file

Run

1. gksudogedit /usr/local/hadoop/conf/hdfs-site.xml

Add the following between the <configuration> … </configuration> tags

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

Format the HDFS

1. /usr/local/hadoop/bin/hadoop namenode -format

Press the Start button and type Startup Applications

Add an application with the following command:

1. /usr/local/hadoop/bin/start-all.sh

Restart Ubuntu and login

Running a MapReduce job

We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.

1. Login to hduser

su - hduser

2. Restart the Hadoop cluster

/usr/local/hadoop/bin/start-all.sh

3. Copy local example data to HDFS

cd /usr/local/hadoop

bin/hadoop dfs -copyFromLocal /tmp/input /user/hduser/input

4. Run the MapReduce job

bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/input /user/hduser/output

5. Retrieve the job result from HDFS

mkdir /tmp/gutenberg-output

bin/hadoop dfs -getmerge /user/hduser/output /tmp/gutenberg-output

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:

http://localhost:50070/ – web UI of the NameNode daemon
http://localhost:50030/ – web UI of the JobTracker daemon
http://localhost:50060/ – web UI of the TaskTracker daemon

These web interfaces provide concise information about what’s happening in your Hadoop cluster.

C. Setting up a multi-node Hadoop cluster

Similarly you can setup a multi-node cluster and run your job.For multi-node cluster setup guide you can visit here .

Thursday, April 24, 2014

Setting up a Hadoop Cluster and Running MapReduce Programs

A. Introduction to Hadoop

B. Setting up a single node Hadoop cluster

Pre-requisites-

1.Install JAVA JDK 7

1. Download the Java JDK (https://www.dropbox.com/s/h6bw3tibft3gs17/jdk-7u21-linux-x64.tar.gz)

2. Unzip the file tar -xvf jdk-7u21-linux-x64.tar.gz

3. Now move the JDK 7 directory to /usr/lib

1. sudomkdir -p /usr/lib/jvm

2. sudo mv ./jdk1.7.0_21 /usr/lib/jvm/jdk1.7.0

4. Now run

1. sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.7.0/bin/java" 1

2. sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.7.0/bin/javac" 1

3. sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.7.0/bin/javaws" 1

5. Correct the file ownership and the permissions of the executables:

1. sudochmoda+x /usr/bin/java

2. sudochmoda+x /usr/bin/javac

3. sudochmoda+x /usr/bin/javaws

4. sudochown -R root:root /usr/lib/jvm/jdk1.7.0

6. Check the version of you new JDK 7 installation: java-version

1. sudo apt-get install openssh-client

2. sudo apt-get install openssh-server

3. Adding a dedicated Hadoop system user

1. su - hduser

2. ssh-keygen -t rsa -P ""

3. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

4. sshlocalhost

1. gksudogedit /etc/sysctl.conf

Hadoop Installation and Configuration

1. cd Downloads

2. sudo tar xzf hadoop-1.1.2.tar.gz

3. cd /usr/local

4. sudo mv /home/hduser/Downloads/hadoop-1.1.2 hadoop

5. sudoaddgrouphadoop sudochown -R hduser:hadoophadoop

1. gksudogedit .bashrc

1. gksudogedit/usr/local/hadoop/conf/hadoop-env.sh

Save and close file

1. udomkdir -p /app/hadoop/tmp

2. sudochownhduser:hadoop /app/hadoop/tmp

3. sudochmod 750 /app/hadoop/tmp

1. gksudogedit /usr/local/hadoop/conf/core-site.xml

1. gksudogedit /usr/local/hadoop/conf/mapred-site.xml

1. gksudogedit /usr/local/hadoop/conf/hdfs-site.xml

1. /usr/local/hadoop/bin/hadoop namenode -format

1. /usr/local/hadoop/bin/start-all.sh

Running a MapReduce job

Hadoop Web Interfaces

C. Setting up a multi-node Hadoop cluster

That's all about the Hadoop cluster setup and I hope it will be helpful to beginners of Hadoop. !!!

No comments:

Post a Comment