Expert Consultancy from Yellow Pelican

Configuring Hadoop 2.x

A site about Talend

Configuring Apache Hadoop 2.x

In the article Installing Hadoop on OS X (there are further articles to come on installing Hadoop on other operating systems), we looked at how to install an Hadoop Single Node Cluster on Mac OS X. We will now look at the next stepsx,, which are to configure and run Hadoop. If you have not already installed Hadoop, read Install Hadoop on OS X, now, or look for an alternative article that explains how to install Hadoop on the operating system of your choice.

2nd March 2014 - This article is work-in-progress. Please check back for updates.

This article assumes that you have installed Hadoop in the directory /usr/local/hadoop or, preferably, /usr/local/hadoop is a Symbolic link that references thst real installation directory.

In this article, we will look at the minimum configuration that is needed to get a basic Hadoop Single Node Cluster up and running. In later articles, we will look at more advanced configuration.

Setting up your PROFILE

The first thing that we need to do is to add some Hadoop configuration parameters to our login profile and add the Hadoop programs to our execution path. These programs are located in the directories /usr/local/hadoop/bin and /usr/local/hadoop/sbin. The file that needs editing depends on your operating system and command interpreter (shell). If you are running on Mac OS X, Linux or on Windows (using Cygwin), then you will most likely be able to create or edit the file $HOME/.bash_profile.

Use caution when editing login profiles

I would recommend caution when editing any login profile. If the login profile already exists, make a backup copy of it first. Once you have finished editing it, run it in a new shell first, so that you can test that it does not contain errors. In some circumstances, having an invalid login profile, will prevent you from logging in to your computer.

Updating your login profile

Enter the following command (using the editor of your choice).

vi $HOME/.bash_profile

If this file already exists, have a look to see if it already contains an entry for PATH, for example, export PATH=$PATH:/usr/local/bin. If it does, then append a new entry for Hadoop (:/usr/local/hadoop/bin:/usr/local/hadoop/sbin). If the file does not exist or there was no entry for PATH, add the new entry export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin. You can now save this file.

The settings in the file should look something like this: -

export HADOOP_PREFIX="/usr/local/hadoop"
export HADOOP_HOME="${HADOOP_PREFIX}"
export HADOOP_COMMON_HOME="${HADOOP_PREFIX}"
export HADOOP_CONF_DIR="${HADOOP_PREFIX}/etc/hadoop"
export HADOOP_HDFS_HOME="${HADOOP_PREFIX}"
export HADOOP_MAPRED_HOME="${HADOOP_PREFIX}"
export HADOOP_YARN_HOME="${HADOOP_PREFIX}"

export "PATH=${PATH}:${HADOOP_PREFIX}/bin:${HADOOP_PREFIX}/sbin"

If you have created a new .bash_profile, you will need to grant execute permission to the owner. You can do this using chmod, by entering the following command.

chmod u+x $HOME/.bash_profile

You can now validate your login profile, by executing the following command. You should see no errors.

sh $HOME/.bash_profile

If there are no errors in your login profile, you can now load these changes in to the current shell by entering the following command.

. $HOME/.bash_profile

Finally, we can test that the changes have taken effect by entering the following command.

echo $PATH

You should see a response similar to the one shown below. The important part is that you should see :/usr/local/hadoop/bin:/usr/local/hadoop/sbin appended to the PATH value.

/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/local/MacGPG2/bin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin

General Configuration

In this section, we'll look at the general configuration of Hadoop.

hadoop-env.sh

This is the main Hadoop configuration file. As a minimum, you should modify this file and tell Hadoop where the Java Home directory is. The Java Home directory is the directory that immediately precedes the bin directory, where the java program is located. In my case (OS X (Mavericks)), the Java Home directory is /Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home.

I would recommend that you make a backup copy of this file, for example: -

cp /usr/local/hadoop/etc/hadoop/hadoop-env.sh /usr/local/hadoop/etc/hadoop/hadoop-env.sh.orig

Enter the following command (using the editor of your choice).

vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Amend the file as shown below, by commenting out the original value for JAVA_HOME and replacing it with the correct value. Once this modification has been made, you can save this file and you will have completed the basic Hadoop set-up.

#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home"

HDFS Configuration

This section describes the configuration for the Haddoop Distributed File System (HDFS).

hdfs-site.xml

We specify HDFS site-specific configuration parameters in the file hdfs-site.xml. In this initial configuration, we will specify both our NameNode and DataNode.

By default, hdfs-site.xml has no properties defined. I would still recommend that you make a backup copy of this file, for example: -

cp /usr/local/hadoop/etc/hadoop/hdfs-site.xml /usr/local/hadoop/etc/hadoop/hdfs-site.xml.orig

NameNode

The Hadoop NameNode s the centrepiece of the HDFS and maintains a directory tree of all of files in the Hadoop Distributed File System, including the location of files within the cluster. No data is stored on the NameNode. Client application will communicate with the NameNode, when they want to retrieve data. For more information on the NameNode, read the article on NameNode, on the Hadoop Wiki.

Now add the an entry for the NameNode, between the <configuration>...</configuration> tags, as shown below.

<configuration>
        <property>
        	<name>dfs.namenode.servicerpc-address</name>
        	<value>localhost:8020</value>
        </property>
</configuration>

DataNode

DataNodes are individual computers within the Hadoop Cluster and store the File System. An Hadoop Cluster consists of one or more DataNodes. In our examples, we have a Single Node Cluster. Our local computer acts as both the NameNode and the DataNode. Once the NameNode has provided a client with th location of a data file, the client may communicate directly with the DataNode. Data is replicated across DataNodes, so their is no need to use RAID storage for data. For more information on the DataNode, read the article on DataNode, on the Hadoop Wiki.

Hadoop and Mac OS X

If you're running Hadoop on OS X (this issue persists with Java 7), there appears to be a Kerbos releated issue that produces the error Unable to load realm info from SCDynamicStore, as documented in Issues HADOOP-7489. This appears to be a minor issue and is not specifically related to Hadoop.

A quick test

We can now run a quick test, to see how things are progressing.

Enter the following comnand.

hadoop version

You should see the following response.

Hadoop 2.2.0
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
This command was run using /usr/local/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar

Next Steps

These steps have completed the basic of Hadoop. In the next tutorials, we'll look at running Hadoop and testing some basic commands.




Expert Consultancy from Yellow Pelican
comments powered by Disqus

© www.TalendByExample.com