求助:懂MapReduce的大神们帮我看看 Hortonworks 资格证 (HDPCD:Java) 考试题的代码吧

wuyeyoulanjian 2018-07-21 02:40:53
之前hortonworks官方做活动推广他们家的大数据资格证,可以免费注册考一次试,所以我和我一个朋友都报了名。朋友不是搞大数据的,没有准备,是裸考的,胡乱写了一通就退出来了,也没管结果;我做这一行,为这个考试花了两个多月准备,还蛮想考下来这个证的,结果没有考过。

给 hortonworks 发了邮件咨询,但是官方回复明确说了不会告知没有考过的原因,但是告诉我他们的评判标准是代码和结果两方面。我想自己掏银子再考一次,不过考试费挺贵的,$250, 大概 1500RMB,为了确保下次考能做对,希望版上的大神们能帮我参详一下目前写的代码。

首先介绍一下考试背景要求:

Hortonworks的资格证考试主要集中在大数据这块,其中 HDPCD:JAVA 以考察 MapReduce的代码编写为主,也需要掌握HDFS的基础知识。MapReduce的考核重点是 combiners, partitioners, custom keys, custom sorting 和 joining of datasets。

官方提供了一套练习题和相关的练习环境,真实的考试题型和考试环境跟练习环境是一样的,不过难度小很多。
根据官方提供的配置说明搭建练习环境,需要自己开一个AWS 账户,然后创建一个EC2 instance (费用大概一小时 $0.4 左右)。 这个EC2 instance 是ubuntu操作系统,里面装了一套hadoop,一个eclipse,在hadoop的HDFS上提供了一些练习题需要用到的数据。为了省去这些环境搭建的麻烦,我下面会把练习题的题目要求和数据贴上来。我和朋友参加的真实考试题目我也会贴出来,但是题目中用到的真实数据没办法提供,不过它的数据结构和练习题用到的数据的结构是一样的。

附官方给的考试说明:

The HDPCD:Java Exam
Our certifications are exclusively hands-on, performance-based exams that require you to complete a set of tasks. By performing tasks on an actual Hadoop cluster instead of just guessing at multiple-choice questions, Hortonworks Certified Professionals have proven competency and Big Data expertise. The HDPCD:Java exam consists of tasks associated with writing Java MapReduce jobs, including the development and configuring of:

combiners,
partitioners,
custom keys,
custom sorting,
and the joining of datasets.

The exam is based on the Hortonworks Data Platform 2.2 and candidates are provided with an Eclipse environment that is pre-configured and ready for the writing of Java classes.
...全文
884 9 打赏 收藏 转发到动态 举报
写回复
用AI写文章
9 条回复
切换为时间正序
请发表友善的回复…
发表回复
  • 打赏
  • 举报
回复
Scala能用吗?
wuyeyoulanjian 2019-02-16
  • 打赏
  • 举报
回复
引用 7 楼 lxhiwyn 的回复:
你好, 我也想考这个证书想问下 这个证书在国内找工作有用吗


我也不是很清楚,我不在国内。
在国外挺有用的,有些招聘方指定要这个证或者是Cloudera 的CCP/CCA。
lxhiwyn 2019-02-15
  • 打赏
  • 举报
回复
你好, 我也想考这个证书想问下 这个证书在国内找工作有用吗
wuyeyoulanjian1 2018-07-23
  • 打赏
  • 举报
回复
另一题是我自己参加考试后凭记忆记录下的题目要求,求每个机场延误的最大值:

HDFS系统中/user/horton/flights下有两个文件, 2007.cvs 和 2008.cvs (结构和上面的附件第一题里的一样)
要求写一个mapreduce程序,实现以下功能:

1. 求出每个Arrival airport code中arrival delay时间最长的一个

2. 结果存放在 HDFS中的 /user/horton/task1下

3. 每条结果包含以下数据:
Arrival airport code, Maximu Arrival delay, Departure airport code, Year, Month, DayOfMonth
每个字段之间用逗号隔开

4. 输出结果按照2007和2008分成两个文件

5. 最终结果以Arrival airport code的字母正序排列

用到的数据和上一题一样:

https://github.com/WileyWu12555/big-data-fun/tree/master/sample_data/2007_2008_average_maximum

我的解答代码地址是:

https://github.com/WileyWu12555/big-data-fun/tree/master/src/flight2/maximum/delay

这一题我的解答肯定是不对的,但是我不确定到底是哪里出了问题,程序跑出来的结果看起来也符合题目的要求,我觉得唯一有可能出错的地方,是为在对原始文件中某些数据进行判断的时候遗漏了什么条件,最大的可能性,是 Task1Maximum.java 这个类的第125行,给 int maximum 赋的初始值不应该为 0, 而应该是 Integer.MIN_VALUE, 因为有可能 Arrival delay 的值为负数。
希望版上的大神帮我检查一下,看看问题出在哪里。


package flight2.maximum.delay;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;
import java.util.Iterator;

/**
Data:

------------ 2007.csv ------------
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2007,1,6,7,1050,1050,1211,1210,WN,680,N283WN,81,80,65,1,0,LAX,SFO,337,6,10,0,,0,NA,NA,NA,NA,NA
2007,1,6,7,1244,1245,1405,1405,WN,776,N720WN,81,80,68,0,-1,LAX,SFO,337,3,10,0,,0,NA,NA,NA,NA,NA
2007,1,6,7,1547,1455,1655,1600,WN,173,N350SW,68,65,58,55,52,LAX,SJC,308,4,6,0,,0,21,0,3,0,31

---------- 2008.csv ------------

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2008,1,9,3,1552,1550,1856,1900,WN,438,N307SW,124,130,113,-4,2,LIT,BWI,912,3,8,0,,0,NA,NA,NA,NA,NA
2008,1,9,3,706,705,809,810,WN,7,N902WN,63,65,49,-1,1,LIT,DAL,296,3,11,0,,0,NA,NA,NA,NA,NA
2008,1,9,3,1454,1500,1558,1605,WN,41,N465WN,64,65,53,-7,-6,LIT,DAL,296,4,7,0,,0,NA,NA,NA,NA,NA
2008,1,9,3,732,735,826,835,WN,1194,N660SW,54,60,44,-9,-3,MAF,AUS,294,4,6,0,,0,NA,NA,NA,NA,NA

Understand:

1. Map side: Airport codes (column 17) as the key, [arrival delay (column 15) + Departure airport code, Year, Month, DayOfMonth] => flightDelay as the value;
2. Reduce side find the Maximum arrival delay for each key;
3. Reduce side write the result: (key, value);
4. Set Partition:
public static HashMap<Integer, Integer> years = new HashMap<>();
static{
airCode.put(2007, 1);
airCode.put(2008, 2);
}
getPartition(){
years.get(flightDelay.year)
}

*/

public class Task1Maximum extends Configured implements Tool {

public static class MaximumMapper extends Mapper<LongWritable, Text, Text, FlightDelay>{

@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

String[] delays = StringUtils.split(line, '\\', ',');

if(delays[0].equalsIgnoreCase("year")){
return;
}

if(Utils.replaceNAwithZero(delays)){
return;
}

String airCode = delays[16];

FlightDelay flightDelay = new FlightDelay(
Integer.parseInt(delays[14]),
delays[17],
Integer.parseInt(delays[0]),
Integer.parseInt(delays[1]),
Integer.parseInt(delays[2]));

context.write(new Text(airCode), flightDelay);

}
}

public static class MaximumReducer extends Reducer<Text, FlightDelay, Text, FlightDelay>{

@Override
protected void reduce(Text airCode, Iterable<FlightDelay> values, Context context) throws IOException, InterruptedException {

int maximum = 0;

String dptAirCode = "";
int year = 0;
int month = 0;
int dayOfMonth = 0;

for (FlightDelay flightDelay : values) {

if(flightDelay.getArrDelay() > maximum){
maximum = flightDelay.getArrDelay();
dptAirCode = flightDelay.getDptAirCode();
year = flightDelay.getYear();
month = flightDelay.getMonth();
dayOfMonth = flightDelay.getDayOfMonth();
}
}
context.write(airCode, new FlightDelay(maximum, dptAirCode, year, month, dayOfMonth ));
}
}

public static void main(String[] args) {

int result = 0;

try {
result = ToolRunner.run(new Configuration(), new Task1Maximum(), args);
} catch (Exception e) {
e.printStackTrace();
}

System.exit(result);

}

@Override
public int run(String[] strings) throws Exception {

Job job = Job.getInstance(getConf(), "Task1Maximum");
Configuration conf = job.getConfiguration();
conf.set(TextOutputFormat.SEPERATOR, ",");

job.setJarByClass(Task1Maximum.class);
job.setMapperClass(MaximumMapper.class);
job.setReducerClass(MaximumReducer.class);
job.setPartitionerClass(YearPartitioner.class);
job.setNumReduceTasks(2);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlightDelay.class);
job.setOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlightDelay.class);
FileInputFormat.setInputPaths(job, new Path("/user/horton/flights/real/"));
FileOutputFormat.setOutputPath(job, new Path("/user/horton/Task1Maximum"));

return job.waitForCompletion(true) ? 0 : 1;

}

}




package flight2.maximum.delay;

import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Objects;

public class FlightDelay implements Writable {

private int arrDelay;
private String dptAirCode;
private int year;
private int month;
private int dayOfMonth;

public FlightDelay() {
}

public FlightDelay(int arrDelay, String dptAirCode, int year, int month, int dayOfMonth) {
this.arrDelay = arrDelay;
this.dptAirCode = dptAirCode;
this.year = year;
this.month = month;
this.dayOfMonth = dayOfMonth;
}

@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeInt(arrDelay);
dataOutput.writeUTF(dptAirCode);
dataOutput.writeInt(year);
dataOutput.writeInt(month);
dataOutput.writeInt(dayOfMonth);
}

@Override
public void readFields(DataInput dataInput) throws IOException {

this.arrDelay = dataInput.readInt();
this.dptAirCode = dataInput.readUTF();
this.year = dataInput.readInt();
this.month = dataInput.readInt();
this.dayOfMonth = dataInput.readInt();
}

public int getArrDelay() {
return arrDelay;
}

public void setArrDelay(int arrDelay) {
this.arrDelay = arrDelay;
}

public String getDptAirCode() {
return dptAirCode;
}

public void setDptAirCode(String dptAirCode) {




package flight2.maximum.delay;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

import java.util.HashMap;
import java.util.Map;

public class YearPartitioner extends Partitioner<Text, FlightDelay> {

public static Map<Integer, Integer> yearMap = new HashMap<>();

static {
yearMap.put(2007, 0);
yearMap.put(2008, 1);
}

@Override
public int getPartition(Text key, FlightDelay flightDelay, int i) {
Integer partitionID = yearMap.get(flightDelay.getYear());
return partitionID;
}
}



package flight2.maximum.delay;

public class Utils {

public static boolean replaceNAwithZero(String[] strs){
if(strs == null || strs.length == 0){
return false;
}

for (String str : strs ) {
if(str.trim().equalsIgnoreCase("NA")){
return true;
}
}

return false;
}
}
wuyeyoulanjian1 2018-07-23
  • 打赏
  • 举报
回复
接下来是两道考试原题,第一道是我朋友考试时的截图,求飞机延误时间的平均值:

大致翻译一下题目的要求:

1. 在HDFS上有2个csv文件,放在/user/horton/flights/下,分别是2007.csv 和 2008.csv。

2. 写一个MapReduce程序,计算每个机场 (第17列 Origin) 飞机离开的平均延误时间 (第16列 DepDelay)

3. 运行结果存放在HDFS上的 /user/horton/task1下

4. 输出结果要按照机场编码的首字母 partition 成两个文件,其中一个文件首字母为 ‘A’ 到 ‘M’, 另一个文件首字母为 ‘N’ 到 ‘Z’

5. 输出结果每一行包含两个用逗号分隔的值: 机场编码 和 平均离开的延误时间

6. 不要按年份分别计算平均值,计算的是2007和2008两年所有延误时间的平均值。

题目用到的原始数据一个文件有 135MB, 考试的时候没办法打开看;这里提供的参考数据是对练习题的参考数据加以修改后的得到的csv文件,
但是数据结构和真实考试的数据结构是一模一样的,github地址是:

https://github.com/WileyWu12555/big-data-fun/tree/master/sample_data/2007_2008_average_maximum

我的解答代码地址是: https://github.com/WileyWu12555/big-data-fun/tree/master/src/flight1/average/delay
不确定对不对


package flight1.average.delay;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;

/**
Write and execute a Java MapReduce application that satisfies all of the following criteria:

1. The input of the application is the two text files in /user/horton/flights/.
2. Your application computes the average departure delay (column 16) for each distinct airport code (column 17).
3. Store the output in a new folder in HDFS named /user/horton/task1.
4. The output is partitioned into exactly two files. Airport codes that start with 'A' through 'M' should be in one file,
and airport codes that start with 'N' through 'Z' should be in another file.
5. Each row in the output should consist of two values separated by a comma: the airport code and the value you computed for the average departure delay.
6. Do NOT compute two averages (one for each year). Compute the average departure delay over the two year span of 2007 and 2008.

Data:

------------ 2007.csv ------------
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2007,1,6,7,1050,1050,1211,1210,WN,680,N283WN,81,80,65,1,0,LAX,SFO,337,6,10,0,,0,NA,NA,NA,NA,NA
2007,1,6,7,1244,1245,1405,1405,WN,776,N720WN,81,80,68,0,-1,LAX,SFO,337,3,10,0,,0,NA,NA,NA,NA,NA
2007,1,6,7,1547,1455,1655,1600,WN,173,N350SW,68,65,58,55,52,LAX,SJC,308,4,6,0,,0,21,0,3,0,31
2007,1,6,7,1909,1910,1918,1915,WN,160,N489WN,69,65,56,3,-1,LBB,ABQ,289,5,8,0,,0,NA,NA,NA,NA,NA
2007,1,6,7,1759,1745,1859,1850,WN,555,N512SW,60,65,49,9,14,LBB,AUS,341,3,8,0,,0,NA,NA,NA,NA,NA
2007,1,6,7,847,850,954,955,WN,836,N775SW,67,65,52,-1,-3,LBB,AUS,341,5,10,0,,0,NA,NA,NA,NA,NA

---------- 2008.csv ------------

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2008,1,9,3,1552,1550,1856,1900,WN,438,N307SW,124,130,113,-4,2,LIT,BWI,912,3,8,0,,0,NA,NA,NA,NA,NA
2008,1,9,3,706,705,809,810,WN,7,N902WN,63,65,49,-1,1,LIT,DAL,296,3,11,0,,0,NA,NA,NA,NA,NA
2008,1,9,3,1454,1500,1558,1605,WN,41,N465WN,64,65,53,-7,-6,LIT,DAL,296,4,7,0,,0,NA,NA,NA,NA,NA
2008,1,9,3,732,735,826,835,WN,1194,N660SW,54,60,44,-9,-3,MAF,AUS,294,4,6,0,,0,NA,NA,NA,NA,NA
2008,1,9,3,1835,1830,1928,1925,WN,2374,N347SW,53,55,41,3,5,MAF,AUS,294,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,9,3,1537,1535,1636,1635,WN,43,N305SW,59,60,46,1,2,MAF,DAL,319,3,10,0,,0,NA,NA,NA,NA,NA

Understand:

1. Map side: Airport codes (column 17) as the key, departure delay (column 16) as the value;
2. Reduce side: get the sum and the count of the departure delay;
3. Reduce side write the result: (key, sum / count)
4. Set Partition:
public static HashMap<String, Integer> airCode = new HashMap<>();
static{
airCode.put("A", 0);
airCode.put("B", 0);
...
airCode.put("N", 0);;
...
}
* */

public class Task1Average extends Configured implements Tool {

public static class AverageMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] airCodes = StringUtils.split(value.toString(), '\\', ',');
if(airCodes[16].equalsIgnoreCase("Origin")){
return;
}
boolean naCheck = Utils.replaceNAWithZero(airCodes);
if(naCheck){
return;
}

String airCode = airCodes[16];
int delay = Integer.parseInt(airCodes[15]);

context.write(new Text(airCode), new IntWritable(delay));

}
}

public static class AverageReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int count = 0;
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
count++;
}

int avg = sum / count;

context.write(key, new IntWritable(avg));
}

}

@Override
public int run(String[] args) throws Exception {

Job job = Job.getInstance(getConf(), "Task1Average");
Configuration conf = job.getConfiguration();
conf.set(TextOutputFormat.SEPERATOR, ",");

job.setJarByClass(Task1Average.class);
job.setMapperClass(AverageMapper.class);
job.setReducerClass(AverageReducer.class);
job.setPartitionerClass(AirCodePartitioner.class);
job.setNumReduceTasks(2);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path("/user/horton/flights/real/"));
FileOutputFormat.setOutputPath(job, new Path("/user/horton/Task1Average"));

return job.waitForCompletion(true)? 0:1;
}

public static void main(String[] args) {
//

int result = 0;

try {
result = ToolRunner.run(new Configuration(), new Task1Average(), args);
} catch (Exception e) {
e.printStackTrace();
}

System.exit(result);
}
}



package flight1.average.delay;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

import java.util.HashMap;
import java.util.Map;

public class AirCodePartitioner extends Partitioner<Text, IntWritable> {

public static Map<String, Integer> airCodeMap = new HashMap<>();

static{
airCodeMap.put("A", 0);
airCodeMap.put("B", 0);
airCodeMap.put("C", 0);
airCodeMap.put("D", 0);
airCodeMap.put("E", 0);
airCodeMap.put("F", 0);
airCodeMap.put("G", 0);
airCodeMap.put("H", 0);
airCodeMap.put("I", 0);
airCodeMap.put("J", 0);
airCodeMap.put("K", 0);
airCodeMap.put("L", 0);
airCodeMap.put("M", 0);
airCodeMap.put("N", 1);
airCodeMap.put("O", 1);
airCodeMap.put("P", 1);
airCodeMap.put("Q", 1);
airCodeMap.put("R", 1);
airCodeMap.put("S", 1);
airCodeMap.put("T", 1);
airCodeMap.put("U", 1);
airCodeMap.put("V", 1);
airCodeMap.put("W", 1);
airCodeMap.put("X", 1);
airCodeMap.put("Y", 1);
airCodeMap.put("Z", 1);
}



@Override
public int getPartition(Text text, IntWritable intWritable, int i) {
String prefix = text.toString().substring(0,1);
Integer patitionID = airCodeMap.get(prefix);
return patitionID;
}
}



package flight1.average.delay;


public class Utils {

public static boolean replaceNAWithZero(String[] strs){
if(strs == null || strs.length == 0){
return false;
}

for(int i = 0; i < strs.length; i++){
if(strs[i].trim().toUpperCase().equals("NA")){
return true;
}
}

return false;
}

}


附原题截图:


wuyeyoulanjian1 2018-07-23
  • 打赏
  • 举报
回复

package flightdelay1.join.practice;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Objects;

public class Datee implements WritableComparable<Datee> {

public int year;
public int month;
public int day;

public Datee() {
}

public Datee(int year, int month, int day) {
this.year = year;
this.month = month;
this.day = day;
}

@Override
public void write(DataOutput dataOutput) throws IOException {

dataOutput.writeInt(year);
dataOutput.writeInt(month);
dataOutput.writeInt(day);
}

@Override
public void readFields(DataInput dataInput) throws IOException {

year = dataInput.readInt();
month = dataInput.readInt();
day = dataInput.readInt();

}

public int getYear() {
return year;
}

public void setYear(int year) {
this.year = year;
}

public int getMonth() {
return month;
}

public void setMonth(int month) {
this.month = month;
}

public int getDay() {
return day;
}

public void setDay(int day) {
this.day = day;
}

@Override
public int compareTo(Datee o) {

int response = this.year - o.year;
if(response == 0){
response = this.month - o.month;
}
if(response == 0){
response = this.day - o.day;
}

return response;
}

@Override
public boolean equals(Object o) {

if(o instanceof Datee){
Datee datee = (Datee) o;
if(year == datee.year && month == datee.month && day == datee.day){
return true;
}
}

return false;

}


@Override
public int hashCode() {

return year + month + day;
}

@Override
public String toString() {
return this.year + "," +
this.month + "," +
this.day;
}
}




package flightdelay1.join.practice;

import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class DelayFileOutputFormat extends FileOutputFormat {


@Override
public RecordWriter<DateDelay, DelayWeather> getRecordWriter(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {

int partition = taskAttemptContext.getTaskAttemptID().getTaskID().getId();
Path outDir = FileOutputFormat.getOutputPath(taskAttemptContext);
Path fileName = new Path(outDir.getName() + Path.SEPARATOR + taskAttemptContext.getJobName() + "_" + partition);

FileSystem fileSystem = fileName.getFileSystem(taskAttemptContext.getConfiguration());

FSDataOutputStream out = fileSystem.create(fileName);

return new DelayRecordWriter(out);
}
}



package flightdelay1.join.practice;

import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

import java.io.DataOutputStream;
import java.io.IOException;

public class DelayRecordWriter extends RecordWriter<DateDelay, DelayWeather> {

private DataOutputStream out;

private final static String SEPERATOR = ",";

public DelayRecordWriter() {
}

public DelayRecordWriter(DataOutputStream out) {
this.out = out;
}

@Override
public void write(DateDelay dateDelay, DelayWeather delayWeather) throws IOException, InterruptedException {

StringBuilder builder = new StringBuilder();
builder.append(dateDelay.datee);
builder.append(SEPERATOR);
builder.append(delayWeather);
builder.append("\n");
out.write(builder.toString().getBytes());
}

@Override
public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {

out.close();

}
}



package flightdelay1.join.practice;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class DelayWeather implements Writable {

public FlightDelay flightDelay;
public Weather weather;

@Override
public void write(DataOutput dataOutput) throws IOException {

flightDelay.write(dataOutput);
weather.write(dataOutput);
}

@Override
public void readFields(DataInput dataInput) throws IOException {

flightDelay = new FlightDelay();
flightDelay.readFields(dataInput);
weather = new Weather();
weather.readFields(dataInput);
}

@Override
public String toString() {
return this.flightDelay + "," + this.weather;
}
}



package flightdelay1.join.practice;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class FlightDelay implements Writable {

public int depTime;
public int arrTime;
public String uniqueCarrier;
public int flightNum;
public int actualElapsedTime;
public int arrDelay;
public int depDelay;
public String origin;
public String destination;


public FlightDelay() {
}


public FlightDelay(int depTime, int arrTime, String uniqueCarrier, int flightNum, int actualElapsedTime, int arrDelay, int depDelay, String origin, String destination) {
this.depTime = depTime;
this.arrTime = arrTime;
this.uniqueCarrier = uniqueCarrier;
this.flightNum = flightNum;
this.actualElapsedTime = actualElapsedTime;
this.arrDelay = arrDelay;
this.depDelay = depDelay;
this.origin = origin;
this.destination = destination;
}

@Override
public void write(DataOutput dataOutput) throws IOException {

dataOutput.writeInt(depTime);
dataOutput.writeInt(arrTime);
dataOutput.writeUTF(uniqueCarrier);
dataOutput.writeInt(flightNum);
dataOutput.writeInt(actualElapsedTime);
dataOutput.writeInt(arrDelay);
dataOutput.writeInt(depDelay);
dataOutput.writeUTF(origin);
dataOutput.writeUTF(destination);



}

@Override
public void readFields(DataInput dataInput) throws IOException {

this.depTime = dataInput.readInt();
this.arrTime = dataInput.readInt();
this.uniqueCarrier = dataInput.readUTF();
this.flightNum = dataInput.readInt();
this.actualElapsedTime = dataInput.readInt();
this.arrDelay = dataInput.readInt();
this.depDelay = dataInput.readInt();
this.origin = dataInput.readUTF();
this.destination = dataInput.readUTF();

}

@Override
public String toString() {
return
this.depTime + "," +
this.arrTime + "," +
this.uniqueCarrier + "," +
this.flightNum + "," +
this.actualElapsedTime + "," +
this.arrDelay + "," +
this.depDelay + "," +
this.origin + "," +
this.destination + ",";
}
}



package flightdelay1.join.practice;

public class Utils {

public static boolean replaceNAWithZero(String[] strs){
if(strs == null || strs.length == 0){
return false;
}

for(int i = 0; i < strs.length; i++){
if(strs[i].trim().toUpperCase().equals("NA")){
return true;
}
}

return false;
}
}



package flightdelay1.join.practice;

import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class Weather implements Writable {

private int prcp;
private int tMax;
private int tMin;

public Weather() {
}

public Weather(int prcp, int tMax, int tMin) {
this.prcp = prcp;
this.tMax = tMax;
this.tMin = tMin;
}

@Override
public void write(DataOutput dataOutput) throws IOException {

dataOutput.writeInt(prcp);
dataOutput.writeInt(tMax);
dataOutput.writeInt(tMin);

}

@Override
public void readFields(DataInput dataInput) throws IOException {

prcp = dataInput.readInt();
tMax = dataInput.readInt();
tMax = dataInput.readInt();
}

public String toString(){
return this.prcp + "," + this.tMax + "," + this.tMin;
}

}
wuyeyoulanjian 2018-07-21
  • 打赏
  • 举报
回复

package flightdelay1.join.practice;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class DateDelay implements WritableComparable<DateDelay> {

public Datee datee;
public int arrDelay;

public DateDelay() {
}

public DateDelay(Datee datee, int arrDelay) {
this.datee = datee;
this.arrDelay = arrDelay;
}

@Override
public void write(DataOutput dataOutput) throws IOException {

datee.write(dataOutput);
dataOutput.writeInt(arrDelay);

}

@Override
public void readFields(DataInput dataInput) throws IOException {

datee = new Datee();
datee.readFields(dataInput);
arrDelay = dataInput.readInt();

}

@Override
public int compareTo(DateDelay o) {

int response = this.datee.compareTo(o.datee);

if(response == 0){
response = o.arrDelay - this.arrDelay;
}

return response;
}

@Override
public String toString() {
return this.datee + "," + this.arrDelay;
}
}
wuyeyoulanjian 2018-07-21
  • 打赏
  • 举报
回复
练习题的解答代码和数据如下:
https://github.com/WileyWu12555/big-data-fun/tree/master/sample_data/sfo_weather
https://github.com/WileyWu12555/big-data-fun/tree/master/src/flightdelay1/join/practice



import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;

public class Task1 extends Configured implements Tool {

private static final String DESTINATION = "Dest";

public static class DelayJoinMapper extends Mapper<LongWritable, Text, DateDelay, DelayWeather>{

private Map<Datee, Weather> map = new HashMap<>();

private String destination;

@Override
protected void setup(Mapper<LongWritable, Text, DateDelay, DelayWeather>.Context context) throws IOException {

destination = context.getConfiguration().get(DESTINATION);
BufferedReader reader = new BufferedReader(new FileReader("sfo_weather.csv"));
String line;
String[] wStr;
Datee datee;
Weather weather;
while((line = reader.readLine()) != null){
wStr = StringUtils.split(line, '\\', ',');
if(wStr[1].equals("YEAR")){
continue;
}

datee = new Datee(Integer.parseInt(wStr[1]),
Integer.parseInt(wStr[2]),
Integer.parseInt(wStr[3]));

weather = new Weather(Integer.parseInt(wStr[4]),
Integer.parseInt(wStr[5]),
Integer.parseInt(wStr[6]));

map.put(datee, weather);
}

reader.close();
}

protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, DateDelay, DelayWeather>.Context context) throws IOException, InterruptedException {

String[] delays = StringUtils.split(value.toString(), '\\', ',');
DateDelay dateDelay;
Datee datee;
if(delays[0].equals("Year")){
return;
}

if(delays[17].trim().equals(destination)){

boolean xx = Utils.replaceNAWithZero(delays);
if(xx){
return;
}


datee = new Datee(Integer.parseInt(delays[0]),
Integer.parseInt(delays[1]),
Integer.parseInt(delays[2]));

if(map.containsKey(datee)){

dateDelay = new DateDelay(datee, Integer.parseInt(delays[14]));

FlightDelay flightDelay = new FlightDelay(
Integer.parseInt(delays[4]),
Integer.parseInt(delays[6]),
delays[8],
Integer.parseInt(delays[9]),
Integer.parseInt(delays[11]),
Integer.parseInt(delays[14]),
Integer.parseInt(delays[15]),
delays[16],
delays[17]
);
DelayWeather delayWeather = new DelayWeather();
delayWeather.flightDelay = flightDelay;
delayWeather.weather = map.get(datee);
context.write(dateDelay, delayWeather);
}

}

}
}

public static final class DelayJoinReducer extends Reducer<DateDelay, DelayWeather, DateDelay, DelayWeather>{

@Override
protected void reduce(DateDelay key, Iterable<DelayWeather> values, Reducer<DateDelay, DelayWeather, DateDelay, DelayWeather>.Context context) throws IOException, InterruptedException {
Iterator<DelayWeather> iterator = values.iterator();
while(iterator.hasNext()){
context.write(key, iterator.next());
}
}
}

// sudo date -s "2018-06-17 22:04:30"

public static void main(String[] args) {
//
int result = 0;
try{
result = ToolRunner.run(new Configuration(), new Task1(), args);
} catch (Exception e) {
e.printStackTrace();
}

System.exit(result);

}

@Override
public int run(String[]



wuyeyoulanjian 2018-07-21
  • 打赏
  • 举报
回复
首先要明确感谢一下 Younge__ 大神,准备考试的过程中参考了他的博客,帮助非常大。之前也有私下请他帮我检查代码,他都很认真的回复我了。下面这道练习题的解答代码是从他的博客上复制过来的 https://blog.csdn.net/yongaini10/article/details/78453210

我大概翻译一下题目要求:

在HDFS上有4个csv文件,/user/horton/weather/下放的是sfo_weather.csv,/user/horton/flightdelays 下放的是 flight_delays1.csv,flight_delays2.csv,flight_delays3.csv。

四个文件的字段都是以逗号分隔,flightdelays中的文件记录了2008年飞机延误的信息,sfo_weather.csv文件中记录了这一年SFO的天气状况,要写一个MapReduce程序,满足以下需求:

根据日期把 flightdelays 中的 delay 信息和 sfo_weather.csv 的天气信息 join 起来,即以日,月,年 join,并且 dest 列为SFO

输出结果的字段包括:

Year,Month,DayofMonth,DepTime,ArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,ArrDelay,DepDelay,Origin,Dest,PRCP,TMAX,TMIN

举例: 2008年1月3号,从 LAS 飞往 SFO 的488号航班延误信息如下:

2008,1,3,1426,1605,WN,488,99,35,31,LAS,SFO,43,150,94

最终的输出结果,以日期正序排序,如果日期相同,则以ArrDelay进行倒序排序(delay时间最久的排最前); 输出结果包括两个文本文件,字段以逗号分隔,放在HDFS的/user/horton/task1

---------- 附英文原版题目要求 ------------
Environment Details
A one-node HDP cluster is running on a server named namenode that is installed with various HDP components, including HDFS, MapReduce, YARN, Tez and Slider.
You are currently logged in to an Ubuntu instance as a user named horton. As the horton user, you can SSH onto the cluster as the root user:
$ ssh root@namenode
The root password on the namenode is hadoop.
Eclipse is installed and a shortcut is provided on the Desktop.
A project named Task1 is created for you, and a class named task1.Task1 is stubbed out already. The build file for this project is preconfigured to use task1.Task1 as the main class, and the project has the proper build path for developing Hadoop MapReduce applications.
To build the project, right-click on the Task1 project folder in Eclipse and select Run As -> Gradle Build.
Ambari is available at http://namenode:8080. The username and password for Ambari are both admin.

TASK 1
There are two folders in HDFS in the /user/horton folder: flightdelays and weather. These are comma-separated files that contain flight delay information for airports in the U.S. for the year 2008, along with the weather data from the San Francisco airport. Write and execute a Java MapReduce application that satisfies the following criteria:

Join the flight delay data in flightdelays with the weather data in weather. Join the data by the day, month and year and also where the "Dest" column in flightdelays is equal to "SFO".

The output of each delayed flight into SFO consists of the following fields:
Year,Month,DayofMonth,DepTime,ArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,ArrDelay,DepDelay,Origin,Dest,PRCP,TMAX,TMIN

For example, for the date 2008-01-03, there is a delayed flight number 488 from Las Vegas (LAS) to San Francisco (SFO). The corresponding output would be:

2008,1,3,1426,1605,WN,488,99,35,31,LAS,SFO,43,150,94

The output is sorted by date ascending, and on each day the output is sorted by ArrDelay descending (so that the longest arrival delays appear first).
The output is in text files in a new folder in HDFS named task1 with values separated by commas
The output is in two text files


Data:
-------------------------sfo_weather.csv ---------------------------------------
STATION_NAME,YEAR,MONTH,DAY,PRCP,TMAX,TMIN
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,01,0,122,39
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,02,0,117,39
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,03,43,150,94
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,04,533,150,100
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,05,196,122,78
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,06,15,106,50
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,07,0,111,67
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,08,20,128,61

------flight_delays1.csv,flight_delays2.csv,flight_delays3.csv-----

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,
ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,
CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,926,930,1054,1100,WN,1746,N612SW,88,90,78,-6,-4,IND,BWI,515,3,7,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,1829,1755,1959,1925,WN,3920,N464WN,90,90,77,34,34,IND,BWI,515,3,10,0,,0,2,0,0,0,32
2008,1,3,4,1940,1915,2121,2110,WN,378,N726SW,101,115,87,11,25,IND,JAX,688,4,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,1937,1830,2037,1940,WN,509,N763SW,240,250,230,57,67,IND,LAS,1591,3,7,0,,0,10,0,0,0,47
2008,1,3,4,1039,1040,1132,1150,WN,535,N428WN,233,250,219,-18,-1,IND,LAS,1591,7,7,0,,0,NA,NA,NA,NA,NA

Understand:

1.Inner join, we can see weather data is small enough to get into memory, so let's start with map side join.
a.Add weather data as cache file.Use day, month and year as the join key.
b.Use "Dest" column in flightdelays as filter, which will filter "Dest" column in flightdelays is equal to "SFO".
c.Get the "SFO" from arguments in main().

2. Use Year,Month,DayofMonth as key, use DepTime,ArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,ArrDelay,DepDelay,Origin,Dest,PRCP,TMAX,TMIN as value.

a. DepTime,ArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,ArrDelay,DepDelay,Origin,Dest from flight_delays1.csv,flight_delays2.csv,flight_delays3.csv.
b.PRCP,TMAX,TMIN from sfo_weather.csv.

3.Modify Year,Month,DayofMonth, ArrDelay as key. Custom output format.

4.Output dir is "task1", output file is text file, fields separated by commas.

5.Reducer task is two.

20,810

社区成员

发帖
与我相关
我的任务
社区描述
Hadoop生态大数据交流社区,致力于有Hadoop,hive,Spark,Hbase,Flink,ClickHouse,Kafka,数据仓库,大数据集群运维技术分享和交流等。致力于收集优质的博客
社区管理员
  • 分布式计算/Hadoop社区
  • 涤生大数据
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧