Tuesday, December 27, 2011

Multi-Node Cluster Setup

Administration Lab 6: Multi-node Cluster Setup

Create Additional VM Image

Repeat steps outlined in Lab 1 to create a new Virtual Machine and install JDK and Hadoop 


Configure Bridged Adapter
  1. Power off both Virtual Machines (Machine => ACPI Shutdown)
  1. Highlight each VM and click on Settings => Network
  1. In the “Attached To” field, select “Bridged Adapter
In the “Name” field, select the correct network card adapter (wired/wireless) that you use on your primary Operating System to connect to the Internet


Uncheck all devices except “Hard Disk”


Find out Master & Slave IPs
  1. Run ifconfig on master and slave
  1. Write down IPs for both hosts
Execute Installation from Lab1
  1. Except the following steps:
  1. sudo apt-get install hadoop-0.20-namenode
  2. sudo apt-get install hadoop-0.20-jobtracker
  1. They are not needed on slave node
Configuration For Both Nodes
  1. /etc/hadoop-0.20/conf/mapred-site.xml
Provisioning IPs
  1. /etc/hosts
  1. Add entires for master and slave node to ensure proper network communication
Reformat namenode and delete data from data directory
  1. sudosuhdfs
  1. Execute hadoop-0.20 namenode format
  1. Delete data from data directories on both machine
  1. /var/lib/hadoop-0.20/cache/hdfs/dfs/data
Start Distributed Cluster
  1. Start master node normally
  1. Start datanode and tasktracker on slave node only
  1. sudo /etc/init.d/hadoop-0.20-tasktracker start
  1. sudo /etc/init.d/hadoop-0.20-datanode start
Verification
  1. Check Web interface to verify that second node is connected
  1. Run sleep example to make sure that mapreduce is working properly

Format Conversion

Programming Lab 3: Format Conversion
  1. This lab demonstrates how to use different map-reduce formats for input and output
  2. Also shows how to convert from one format to another
Problem to Solve
  1. We will work with one set of data OrgChart and will represent that data in a different format:
Create Project and Link with Libraries
  1. Copy provided libraries and java code from USB drives
  2. Create project in NetBeans or Eclipse
  3. Link with libraries
  4. Create new class and copy provided code
  5. Modify input and output directory
  6. Run code and examine result
  7. For detailed instructions on project creating refer to Lab1.
Task1. Parse existing Key Value Text Input
  1. We need to parse existing Key Value Text input and output it in the Key Value Text.
  2. Data representation is not going to change, but data will be sorted by its key
  3. Set proper input format
  1. Set proper output format
  1. Don’t forget to change input path to the correct path on your workstation
  2. Expected output
Task2. Generate Sequence File From Key Value Text Input
  1. We need to parse existing Key Value Text input and output it in the Sequence File
  2. Data representation is not going to change, but data item will be sorted by its key and it will be in a compressed form
  3. Set proper input format
  1. Set proper output format
  1. Enable compression
  1. Expected output will not be readable since it is compressed
Task3. Generate Map File From Key Value Text Input
  1. We need to parse existing Key Value Text input and output it in the Map File
  2. Data representation is not going to change, but data will be sorted by its key and it will be in compressed form. Also, Map file will create index for faster access of the data.
  3. Set proper input format
  1. Set proper output format
  1. Compression remains enabled
Expected output will not be readable since it is compressed and map file output will generate index file
Task4. Generate Key Value Text Format  From Map File
  1. We need to parse existing Map File and output it in the Key Value Text input
  2. Data representation is not going to change, but data will be sorted by its key and has to match the input data
  3. Keep in mind that if you keep compression enabled the data will be compressed
  4. Set proper input format
  1. Set proper output format
  1. We can choose to keep compression  enabled
Expected output will not be readable since it is compressed and map file output will generate index file
Summary
  1. This lab has demonstrated how to work with different data format
  2. You should be comfortable in using any of them

Map Side Join


Programming Lab 4: Map Side Join
  1. This lab  demonstrates Map-Side join technique used to quickly and efficiently join data in multiple directories
  2. In a lot of cases this is the fastest way to process data.
 Problem to Solve
  1. We will work with two sets of data OrgChart and Salary and will will need to join them together using Map Reduce join technique
Create Project and Link with Libraries
  1. Copy provided libraries and java code from USB drives
  2. Create project in NetBeans or Eclipse
  3. Link with libraries
  4. Create new class and copy provided code
  5. Modify input and output directory to SalaryMap and OrgChartMap folders
  6. For detailed instructions on project creating refer to Lab1.
Task1. Create Map Files Using Previous Lab
  1. Using lab 3 code point input directory to lab4/input/Salary directory and point output directory to lab4/input/SalaryMap directory
  2. Run the code and it will generate map file (sequence file with index)
  3. Repeat the same procedure with OrgChart
Task2. Point MapSide Directories to Newly Generated Map Files
  1. Change directory to point to newly generated map files
  2. Run code and see the outcome
  1. Please, note that unlike regular  MR job, map-side joint quarantines order of the data.
Walk Through
  1. Map-Side join takes array of the path to join
  1. Set specific Map Side join input format
 
Walk Through Mapper
Mapper provides access to values via TupleWritable value
    value.has(0) checks if you have left value
    value.get(0) gives you the left value
    value.has(1) checks if you have right value
    value.get(1) gives you the right value
Task3. Adding Vacation Time
  1. Generate Map File for vacation time
  2. Extend Map-Reduce job to add another folder in join
  3. Modify Mapper code to add vacation time into output value
  4. Hint: to add two text object:

Monday, December 26, 2011

Reduce Side Join

Programming Lab 5: Reduce Side Join
  1. This lab demonstrates reduce-side join with different reducers
Problem to Solve
  1. We will work with two sets of data OrgChart and Salary, and we will need to join them together using Reduce Side join technique
  2. However, in some cases position description can consist of two different words and we only will compare the first one
 
Create Project and Link with Libraries
  1. Copy provided libraries and java code from USB drives
  2. Create project in NetBeans or Eclipse
  3. Link with libraries
  4. Create new classes for each file with the same name as file in the directory or you can copy java files to your source directory
  5. For detailed instructions on creating a project please refer to Programming Lab1
Walk Through
  1. Please, make sure to adjust input path to your local machine
  1. We must assign different mappers to different input folders
  1. Engineer salary should be applicable to Engineer and Engineer 2
  2. Position will be represented by Position class and we will compare and match only the first word
  3. The position class will implement a comparator method
Expected Result
  1. The expected result is the following:

Input Format

Programming Lab 2: Input Format
  1. We will learn how to join different sets of data on the same key using regular Map-Reduce approach.
  2. Also, this lab will show limitations of the regular Map-Reduce approach and what developer need to do in order to overcome those limitations
Problem to Solve
  1. We have two sets of data : position information and salary information
  2. We need to join them together!!!
  1. Produce:
Create Project and Link with libraries
  1. Copy provided libraries and java code from USB drives
  2. Create project in NetBeans or Eclipse
  3. Link with libraries
  4. Create new class and copy provided code
  5. Modify input and output directory
  6. Run code and examine result
  7. For detailed instructions on project creating refer to Lab1.
  8. Our key is position in the company (CEO, Engineer, etc.)
  9. Our data is separated by “=”. We will use KeyValueTextInputFormat.
  10. We also need to set special separator since its different from default tab
Expected output
Please, note that order is not quarantine. We need to do special processing to guarantee order.
Task
  1. Additional input folder contains data for vacation time
  2. Additional input folder can be specified by calling  FileInputFormat.addInputPath
  3. We need to add that data for every record

Word Count

Programming Lab 1:  Word Count
  1. We will learn how to setup development environment for Hadoop projects
  2. Run “Word Count ” applications
  3. Create new application to count letters in text documents
Prerequisites
  1. Java 1.6
  2. Hadoop and log4j libraries
  3. NetBeans or Eclipse
Create Project and Link with Libraries
  1. Copy provided libraries and java code from USB drives
  2. Create project in NetBeans or Eclipse (specific instructions on the next page)
  3. Link with libraries
  4. Create new class and copy provided code
  5. Modify input and output directory
  6. Run code and examine result
Create Project with NetBeans
  1. Click on File -> New Project
  2. Select Java Application type
  3. Set main class to WordCount
  4. Name your project HadoopLab1WordCount
  5. Follow instructions on the screen
  6. See the next page for a screenshot
  1. Open a file from the provided material in ProgLabs/lab1/original
  2. Copy everything except the package name
  3. Insert code into newly created class right after the package name
Let’s link with appropriate libraries
  1. Right click on project name and select properties
  2. Pick Libraries option and click on the Compile tab
  3. Click on Add JAR/Folder button
  4. Add everything from ProgLabs/lib folder
  5. NetBeans will re-evaluate dependencies and you should not see any errors at this point
  6. Adjust input/output values in run method to ProgLabs/lab1/<<input|output>> accordingly  
 
Link the project with appropriate libraries
  1. Right click on project name and select properties
  2. Pick Libraries option and click on Compile tab
  3. Click on Add JAR/Folder button
  4. Add everything from ProgLab/lib folder
  5. NetBeans will reevaluate dependencies and you should not see any errors at this point
  1. Adjust the input/output values in run method to ProgLabs/lab1/<<input|output>> accordingly
  2. Right-click and run!
Create Project with Eclipse
  1. Create New Project: Click on File -> New Project
  2. Name project HadoopLab1WordCount
  3. Select Java 1.6
  4. Click Next and add libraries from ProgLabs/lib
  1. Create a new class – WordCount
  2. Right-click on the project -> select New -> Class
  3. Copy code from Labs/ProgLabs/lab4/original into new class
  4. Right click and run it!
Lets examine output directory
You will see two files: one indicating result of the Map Reduce job and the second one containing result of the job  
Result
Adam        1
Brandon        1
Graig        1
Kim        1
Marty        1
Mike        1
Nancy        1
Nick        1
Nishani        1
Steve        2
Tracy        2
Vidur        1
Exercise: Let’s Count Letters!
  1. Modify word count application to count letters in the document
  2. Create another class that implements Reducer and switch application to use it in run method
  3. Hint:  
                String ch = String.valueOf(line.charAt(i));

Wednesday, December 21, 2011

MapReduce Filesystem Recovery


Administration Lab 5:  MapReduce Filesystem Recovery

Restore the Last State of VM
*    Open Virtual Box application
*    Start the last VM state
*    Stop Hadoop cluster
for x in /etc/init.d/hadoop-* ; do sudo $x stop; done
Procedure to Recover Namenode
*    Change user to hdfs
sudo su hdfs
*    Lets try to import image
hadoop-0.20 namenode –importCheckpoint
*    This will fail!!! (Why???)
*    Lets move data to home directory
mkdir ~/name
mv /var/lib/hadoop-0.20/cache/hadoop/dfs/name/* ~/name
ls /var/lib/hadoop-0.20/cache/hadoop/dfs/name
*    Lets try import again
hadoop-0.20 namenode –importCheckpoint
*    This will work!
*    Ctrl-C to terminate node and exit from shell
*    Start the cluster
for x in /etc/init.d/hadoop-* ; do sudo $x start; done
Checking Filesystem
hadoop-0.20 fsck <file name> -files -blocks -racks