Programming Lab 4: Map Side Join
- This lab demonstrates Map-Side join technique used to quickly and efficiently join data in multiple directories
- In a lot of cases this is the fastest way to process data.
Problem to Solve
- We will work with two sets of data OrgChart and Salary and will will need to join them together using Map Reduce join technique
Create Project and Link with Libraries
- Copy provided libraries and java code from USB drives
- Create project in NetBeans or Eclipse
- Link with libraries
- Create new class and copy provided code
- Modify input and output directory to SalaryMap and OrgChartMap folders
- For detailed instructions on project creating refer to Lab1.
Task1. Create Map Files Using Previous Lab
- Using lab 3 code point input directory to lab4/input/Salary directory and point output directory to lab4/input/SalaryMap directory
- Run the code and it will generate map file (sequence file with index)
- Repeat the same procedure with OrgChart
Task2. Point MapSide Directories to Newly Generated Map Files
- Change directory to point to newly generated map files
- Run code and see the outcome
- Please, note that unlike regular MR job, map-side joint quarantines order of the data.
Walk Through
- Map-Side join takes array of the path to join
- Set specific Map Side join input format
Walk Through Mapper
Mapper provides access to values via TupleWritable value
value.has(0) checks if you have left value
value.get(0) gives you the left value
value.has(1) checks if you have right value
value.get(1) gives you the right value
Task3. Adding Vacation Time
- Generate Map File for vacation time
- Extend Map-Reduce job to add another folder in join
- Modify Mapper code to add vacation time into output value
- Hint: to add two text object:
Hi
ReplyDeleteReally very informative.Thanks for sharing.Recently I bought The hadoop videos from www.hadooponlinetutor.com.The videos are really very good