Thursday, April 24, 2014

Setting up a Hadoop Cluster and Running MapReduce Programs

Here I am sharing my experience of setting up a Hadoop Cluster using  Ubuntu 12.04LTS machines and running MapReduce programs on the hadoop cluster which was one of my assignment in Evolving Architecture Lab.

A. Introduction to Hadoop

 

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

For setting up Hadoop we can choose one of the following options available
  1. Hortonworks SandBox http://hortonworks.com/products/hortonworks-sandbox/
  2. Mapr http://www.mapr.com/
  3. cdh4 (Cloudera) http://www.cloudera.com/content/support/en/documentation/cdh4-documentation/cdh4-documentation-v4-latest.html
  4. Apache hadoop http://hadoop.apache.org/
I have chosen Apache Hadoop for setting up cluster.

What Is Apache Hadoop?
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

 B. Setting up a single node Hadoop cluster

Pre-requisites-

1.Install  JAVA JDK 7

1.      Download the Java JDK (https://www.dropbox.com/s/h6bw3tibft3gs17/jdk-7u21-linux-x64.tar.gz

2.      Unzip the file tar -xvf jdk-7u21-linux-x64.tar.gz 

3.      Now move the JDK 7 directory to /usr/lib 

1.      sudomkdir -p /usr/lib/jvm 

2.      sudo mv ./jdk1.7.0_21 /usr/lib/jvm/jdk1.7.0 

4.      Now run 

1.      sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.7.0/bin/java" 1 

2.      sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.7.0/bin/javac" 1 

3.      sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.7.0/bin/javaws" 1 

5.      Correct the file ownership and the permissions of the executables: 

1.      sudochmoda+x /usr/bin/java 

2.      sudochmoda+x /usr/bin/javac 

3.      sudochmoda+x /usr/bin/javaws 

4.      sudochown -R root:root /usr/lib/jvm/jdk1.7.0 

6.      Check the version of you new JDK 7 installation: java-version  

 

 2. Install SSH Server 

1.      sudo apt-get install openssh-client

2.      sudo apt-get install openssh-server

3. Adding a dedicated Hadoop system user

  1. sudo addgroup hadoop
  2. sudo adduser --ingroup hadoop hduser 
4. Configure SSH 

1.      su - hduser

2.      ssh-keygen -t rsa -P ""

3.      cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 

4.      sshlocalhost 

 5. Disabling IPv6 – 
Run the following command in the extended terminal (Alt + F2) 

1.      gksudogedit /etc/sysctl.conf 

Add the following lines to the bottom of the file
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
 Save the file and close it.
 
6. Restart your Ubuntu 

Hadoop Installation and Configuration 

1. Download Apache Hadoop 1.1.2 (https://www.dropbox.com/s/znonl6ia1259by3/hadoop-1.1.2.tar.gz) and store it the Downloads folder 

2. Unzip the file (open up the terminal window) 

1.      cd Downloads

2.      sudo tar xzf hadoop-1.1.2.tar.gz 

3.      cd /usr/local 

4.      sudo mv /home/hduser/Downloads/hadoop-1.1.2 hadoop 

5.      sudoaddgrouphadoop sudochown -R hduser:hadoophadoop

Open your .bashrc file in the extended terminal (Alt + F2) 

1.      gksudogedit .bashrc

Add the following lines to the bottom of the file: 

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
export PIG_HOME=/usr/local/pig
export PIG_CLASSPATH=/usr/local/hadoop/conf

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0/

# Some convenient aliases and functions for running Hadoop-related commands
unaliasfs&> /dev/null
alias fs="hadoopfs"
unaliashls&> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoopfs -cat $1 | lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$PIG_HOME/bin
 Save the .bashrcfile and close it 
 Run 

1.      gksudogedit/usr/local/hadoop/conf/hadoop-env.sh

Add the following lines 

# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0/

Save and close file

 In the terminal window, create a directory and set the required ownerships and permissions 

1.      udomkdir -p /app/hadoop/tmp

2.      sudochownhduser:hadoop /app/hadoop/tmp

3.      sudochmod 750 /app/hadoop/tmp

Run

1.      gksudogedit /usr/local/hadoop/conf/core-site.xml 

 Add the following between the <configuration> … </configuration> tags 

<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming theFileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
</property>
Save and close file 
 Run 

1.      gksudogedit /usr/local/hadoop/conf/mapred-site.xml 

Add the following between the <configuration> … </configuration> tags 

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
</description>
</property>
Save and close file 
Run 

1.      gksudogedit /usr/local/hadoop/conf/hdfs-site.xml 

Add the following between the <configuration> … </configuration> tags 

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
Format the HDFS 

1.      /usr/local/hadoop/bin/hadoop namenode -format

Press the Start button and type Startup Applications 
Add an application with the following command: 

1.      /usr/local/hadoop/bin/start-all.sh 

Restart Ubuntu and login

Running a MapReduce job

We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.

1. Login to hduser 
su - hduser  

2. Restart the Hadoop cluster  
/usr/local/hadoop/bin/start-all.sh 

3. Copy local example data to HDFS
cd /usr/local/hadoop 
bin/hadoop dfs -copyFromLocal /tmp/input /user/hduser/input 

4. Run the MapReduce job
bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/input /user/hduser/output 

5. Retrieve the job result from HDFS
mkdir /tmp/gutenberg-output
bin/hadoop dfs -getmerge /user/hduser/output /tmp/gutenberg-output 
 

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
These web interfaces provide concise information about what’s happening in your Hadoop cluster.

 C. Setting up a multi-node Hadoop cluster

Similarly you can setup a multi-node cluster and run your job.For multi-node cluster setup guide you can visit here .

That's all about the Hadoop cluster setup and I hope it will be helpful to beginners of Hadoop. !!!


 

 


No comments:

Post a Comment