Vicci Hadoop Tutorial

This tutorial will demonstrate running a Hadoop experiment on Vicci.

Prerequisites

  • We will be running a demo described at http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/. Please visit that page and familiarize yourself with the demo.
  • Your slice should not have any existing slivers. We're going to be setting the image and the network configuration for the slivers, so we need a clean slate to work with. If you have any slivers, please delete them now or create a fresh slice with no slivers.

Configuring The Slice

Log into Vicci and go to the Manage Slice Slivers Page

  1. Click the Configure Network tab and select Private Bridge with GRE Tunnels. Press Submit.
  2. Click the Select Image tab and select Hadoop. Press Submit.
  3. Click the Add Slivers tab. Select Princeton for the site and 3 for the number of nodes to add. Press Submit
  4. Click the Show Slivers tab. This will list the hosts where your slivers are located. In particular, make note of the designated "head node". The head node is where we will be launching our hadoop experiment.

At this point, Vicci will start initializing your slivers. It may take up to 15 minutes for the slivers to be created.

Logging into the head node and starting Hadoop

Being by using an SSH client to log into the head node (if you've forgot which node is the head node, use the Manage Slice Slivers tab). Once you've logged in, execute the following commands:


# set-user to slice username, cd into the appropriate home directory
su <name_of_slice>
cd <name_of_slice>

# download the demo code and source files and un-tar them
wget http://www.vicci.org/files/hadoop/hadoop-demo.tar.gz
tar -xzf hadoop-demo.tar.gz

# configure Vicci nodes for hadoop, setting up SSH keys, etc
hadoop-startup.py

# format the hadoop datanode
hadoop namenode -format

# start hadoop dfs and mapred services
start-dfs.sh
start-mapred.sh

# copy the data files to the hadoop file system
hadoop dfs -mkdir /gutenberg
hadoop dfs -copyFromLocal data/*.txt /gutenberg

# start hadoop and run the experiment
hadoop jar hadoop-*streaming*.jar -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py -input /gutenberg/* -output /gutenberg-output