Thursday, April 24, 2014

Setting up a Hadoop Cluster and Running MapReduce Programs

Here I am sharing my experience of setting up a Hadoop Cluster using  Ubuntu 12.04LTS machines and running MapReduce programs on the hadoop cluster which was one of my assignment in Evolving Architecture Lab.

A. Introduction to Hadoop

 

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

For setting up Hadoop we can choose one of the following options available
  1. Hortonworks SandBox http://hortonworks.com/products/hortonworks-sandbox/
  2. Mapr http://www.mapr.com/
  3. cdh4 (Cloudera) http://www.cloudera.com/content/support/en/documentation/cdh4-documentation/cdh4-documentation-v4-latest.html
  4. Apache hadoop http://hadoop.apache.org/
I have chosen Apache Hadoop for setting up cluster.

What Is Apache Hadoop?
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

 B. Setting up a single node Hadoop cluster

Pre-requisites-

1.Install  JAVA JDK 7

1.      Download the Java JDK (https://www.dropbox.com/s/h6bw3tibft3gs17/jdk-7u21-linux-x64.tar.gz

2.      Unzip the file tar -xvf jdk-7u21-linux-x64.tar.gz 

3.      Now move the JDK 7 directory to /usr/lib 

1.      sudomkdir -p /usr/lib/jvm 

2.      sudo mv ./jdk1.7.0_21 /usr/lib/jvm/jdk1.7.0 

4.      Now run 

1.      sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.7.0/bin/java" 1 

2.      sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.7.0/bin/javac" 1 

3.      sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.7.0/bin/javaws" 1 

5.      Correct the file ownership and the permissions of the executables: 

1.      sudochmoda+x /usr/bin/java 

2.      sudochmoda+x /usr/bin/javac 

3.      sudochmoda+x /usr/bin/javaws 

4.      sudochown -R root:root /usr/lib/jvm/jdk1.7.0 

6.      Check the version of you new JDK 7 installation: java-version  

 

 2. Install SSH Server 

1.      sudo apt-get install openssh-client

2.      sudo apt-get install openssh-server

3. Adding a dedicated Hadoop system user

  1. sudo addgroup hadoop
  2. sudo adduser --ingroup hadoop hduser 
4. Configure SSH 

1.      su - hduser

2.      ssh-keygen -t rsa -P ""

3.      cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 

4.      sshlocalhost 

 5. Disabling IPv6 – 
Run the following command in the extended terminal (Alt + F2) 

1.      gksudogedit /etc/sysctl.conf 

Add the following lines to the bottom of the file
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
 Save the file and close it.
 
6. Restart your Ubuntu 

Hadoop Installation and Configuration 

1. Download Apache Hadoop 1.1.2 (https://www.dropbox.com/s/znonl6ia1259by3/hadoop-1.1.2.tar.gz) and store it the Downloads folder 

2. Unzip the file (open up the terminal window) 

1.      cd Downloads

2.      sudo tar xzf hadoop-1.1.2.tar.gz 

3.      cd /usr/local 

4.      sudo mv /home/hduser/Downloads/hadoop-1.1.2 hadoop 

5.      sudoaddgrouphadoop sudochown -R hduser:hadoophadoop

Open your .bashrc file in the extended terminal (Alt + F2) 

1.      gksudogedit .bashrc

Add the following lines to the bottom of the file: 

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
export PIG_HOME=/usr/local/pig
export PIG_CLASSPATH=/usr/local/hadoop/conf

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0/

# Some convenient aliases and functions for running Hadoop-related commands
unaliasfs&> /dev/null
alias fs="hadoopfs"
unaliashls&> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoopfs -cat $1 | lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$PIG_HOME/bin
 Save the .bashrcfile and close it 
 Run 

1.      gksudogedit/usr/local/hadoop/conf/hadoop-env.sh

Add the following lines 

# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0/

Save and close file

 In the terminal window, create a directory and set the required ownerships and permissions 

1.      udomkdir -p /app/hadoop/tmp

2.      sudochownhduser:hadoop /app/hadoop/tmp

3.      sudochmod 750 /app/hadoop/tmp

Run

1.      gksudogedit /usr/local/hadoop/conf/core-site.xml 

 Add the following between the <configuration> … </configuration> tags 

<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming theFileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
</property>
Save and close file 
 Run 

1.      gksudogedit /usr/local/hadoop/conf/mapred-site.xml 

Add the following between the <configuration> … </configuration> tags 

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
</description>
</property>
Save and close file 
Run 

1.      gksudogedit /usr/local/hadoop/conf/hdfs-site.xml 

Add the following between the <configuration> … </configuration> tags 

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
Format the HDFS 

1.      /usr/local/hadoop/bin/hadoop namenode -format

Press the Start button and type Startup Applications 
Add an application with the following command: 

1.      /usr/local/hadoop/bin/start-all.sh 

Restart Ubuntu and login

Running a MapReduce job

We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.

1. Login to hduser 
su - hduser  

2. Restart the Hadoop cluster  
/usr/local/hadoop/bin/start-all.sh 

3. Copy local example data to HDFS
cd /usr/local/hadoop 
bin/hadoop dfs -copyFromLocal /tmp/input /user/hduser/input 

4. Run the MapReduce job
bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/input /user/hduser/output 

5. Retrieve the job result from HDFS
mkdir /tmp/gutenberg-output
bin/hadoop dfs -getmerge /user/hduser/output /tmp/gutenberg-output 
 

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
These web interfaces provide concise information about what’s happening in your Hadoop cluster.

 C. Setting up a multi-node Hadoop cluster

Similarly you can setup a multi-node cluster and run your job.For multi-node cluster setup guide you can visit here .

That's all about the Hadoop cluster setup and I hope it will be helpful to beginners of Hadoop. !!!


 

 


Friday, April 11, 2014

CUDA installation and programming in CUDA and OpenCL

Introduction to CUDA:-

 CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).
CUDA was developed with several design goals in mind:
  • Provide a small set of extensions to standard programming languages, like C, that enable a straightforward implementation of parallel algorithms. With CUDA C/C⁠+⁠+, programmers can focus on the task of parallelization of the algorithms rather than spending time on their implementation.
  • Support heterogeneous computation where applications use both the CPU and GPU. Serial portions of applications are run on the CPU, and parallel portions are offloaded to the GPU. As such, CUDA can be incrementally applied to existing applications. The CPU and GPU are treated as separate devices that have their own memory spaces. This configuration also allows simultaneous computation on the CPU and GPU without contention for memory resources.

Installing CUDA on Ubuntu 12.04:-

1. System Requirements

To use CUDA on your system, you will need the following installed:

2. Pre-installation Actions

Some actions must be taken before the CUDA Toolkit and Driver can be installed on Linux:
  • Verify the system has a CUDA-capable GPU
  • Verify the system is running a supported version of Linux.
  • Verify the system has gcc installed.
  • Download the NVIDIA CUDA Toolkit. 
You can see here for these pre-installation actions. 


3. Package Manager Installation:-


The installation is a two-step process. First the small repository configuration package must be downloaded from the NVIDIA CUDA download page, and installed manually. The package sets the package manager database to include the CUDA repository. Then the CUDA Toolkit is installed using the package manager software.

3.1. Prerequisites

  • The package manager installations (RPM/DEB packages) and the stand-alone installer installations (.run file) of the NVIDIA driver are incompatible. Before using the distribution-specific packages, uninstall the NVIDIA driver with the following command:
sudo /usr/bin/nvidia-uninstall
 
·         For x86 to ARMv7 cross development, the host system must be an Ubuntu 12.04 system. 

3.2. Ubuntu

  1. Perform the pre-installation actions.
  2. Enable armhf foreign architecture, if necessary
The armhf foreign architecture must be enabled in order to install the cross-armhf toolkit. To enable armhf as a foreign architecture, the following commands must be executed:
On Ubuntu 12.04,
$ sudo sh -c \
  'echo "foreign-architecture armhf" >> /etc/dpkg/dpkg.cfg.d/multiarch'
$ sudo apt-get update
On Ubuntu 12.10 or greater,
$ sudo dpkg --add-architecture armhf
$ sudo apt-get update
  1. Install repository meta-data
When using a proxy server with aptitude, ensure that wget is set up to use the same proxy settings before installing the cuda-repo package.
$ sudo dpkg -i cuda-repo-<distro>_<version>_<architecture>.deb
  1. Update the Apt repository cache
$ sudo apt-get update
  1. Install CUDA
$ sudo apt-get install cuda
  1. Add libcuda.so symbolic link, if necessary
The libcuda.so library is installed in the /usr/lib/nvidia-331 directory. For pre-existing projects which use libcuda.so, it may be useful to add a symbolic link from libcuda.so in the /usr/lib directory.
  1. Perform the post-installation actions.

4. Post-installation Actions

Some actions must be taken after installing the CUDA Toolkit and Driver before they can be completely used:
  • Setup evironment variables.
  • Install a writable copy of the CUDA Samples.
  • Verify the installation.

4.1. Environment Setup

The PATH variable needs to include /usr/local/cuda-6.0/bin
The LD_LIBRARY_PATH variable needs to contain /usr/local/cuda-6.0/lib on a 32-bit system, and /usr/local/cuda-6.0/lib64 on a 64-bit system
·         To change the environment variables for 32-bit operating systems:
·         $ export PATH=/usr/local/cuda-6.0/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/local/cuda-6.0/lib:$LD_LIBRARY_PATH
·         To change the environment variables for 64-bit operating systems:
·         $ export PATH=/usr/local/cuda-6.0/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/local/cuda-6.0/lib64:$LD_LIBRARY_PATH

4.2. (Optional) Install Writable Samples

In order to modify, compile, and run the samples, the samples must be installed with write permissions. A convenience installation script is provided:
$ cuda-install-samples-6.0.sh <dir>
This script is installed with the cuda-samples-6-0 package. The cuda-samples-6-0 package installs only a read-only copy in /usr/local/cuda-6.0/samples.

4.3. Verify the Installation

Before continuing, it is important to verify that the CUDA toolkit can find and communicate correctly with the CUDA-capable hardware. To do this, you need to compile and run some of the included sample programs.
Note: Ensure the PATH and LD_LIBRARY_PATH variables are set correctly.

4.3.1. Verify the Driver Version

If you installed the driver, verify that the correct version of it is installed.
This can be done through your System Properties (or equivalent) or by executing the command
cat /proc/driver/nvidia/version
Note that this command will not work on an iGPU/dGPU system.

4.3.2. Compiling the Examples

The version of the CUDA Toolkit can be checked by running nvcc -V in a terminal window. The nvcc command runs the compiler driver that compiles CUDA programs. It calls the gcc compiler for C code and the NVIDIA PTX compiler for the CUDA code.
The NVIDIA CUDA Toolkit includes sample programs in source form. You should compile them by changing to ~/NVIDIA_CUDA-6.0_Samples and typing make. The resulting binaries will be placed under ~/NVIDIA_CUDA-6.0_Samples/bin.

4.3.3. Running the Binaries

After compilation, find and run deviceQuery under ~/NVIDIA_CUDA-6.0_Samples.



Programming in CUDA

A simple “Hello, world” program using CUDA C is given here

In file hello.cu:

#include "stdio.h"

int main()

{

printf("Hello, world\n");

return 0;

}

On our Tesla machine, you can compile and this with:

$ nvcc hello.cu

$ ./a.out

You can change the output file name with the –o flag: nvcc –o hello hello.cu

Programming using OpenCL

We should write the header as follows for Linux

#ifdef _LINUX_
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif
To run "Hello.c"
> gcc -I /path-to-NVIDIA/OpenCL/common/inc -L /path-to-NVIDIA/OpenCL/common/lib/Linux64 -o hello hello.c -lOpenCL (64-bit Linux)


That is all you have to do .I hope this blog is helpful to beginners. !!