Wednesday, 23 October 2013

Hadoop Pseudo Distributed Mode

The pseudo-distributed mode is running Hadoop in a “cluster of one” with all daemons running on a single machine. This mode complements the standalone mode for debugging your code, allowing you to examine memory usage, HDFS input/output issues, and other daemon interactions.

edit conf/hadoop-env.sh

so that the line below points correctly to your java installation:

#The java implementation to use. Required.
export JAVA_HOME=/usr/local/jdk1.7.0_45
Edit the xml files in /conf dir

core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system. A URI whos scheme and authority determine the FileSystem implementation.</description>
</property>
</configuration>

mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
<description>The host and port that the MapReduce job tracker runs at.</description>
</property>
</configuration>

hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>The actual number of replications can be specified when the file is created.</description>
</property>
</configuration>

In core-site.xml and mapred-site.xml we specify the hostname and port of the NameNode and the JobTracker, respectively. 

In hdfs-site.xmlwe specify the default replication factor for HDFS, which should only be one because we’re running on only one node. 

We must also specify the location of the Secondary NameNode in the masters file and the slave nodes in the slaves file. Make sure you are in the conf dir:

cat masters >> localhost

cat slaves >> localhost

While all the daemons are running on the same machine, they still communicate with each other using the same SSH protocol as if they were distributed over a cluster. Section 2.2 has a more detailed discussion of setting up the SSH channels, but for single-node operation simply check to see if your machine already allows you to ssh back to itself.

ssh localhost

If it does, then you’re good. Otherwise try entering the following:

sudo apt-get install openssh-server
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

You are almost ready to start Hadoop. But first you’ll need to format your HDFS by using the command

bin/hadoop namenode -format

We can now launch the daemons by use of the start-all.sh script. The Java jps command will list all daemons to verify the setup was successful.

bin/start-all.sh

jps