hadoop中使用TotalOrderOartitioner的全排序问题

大数据中的大叔 2017-09-26 05:53:10

近期想解决一下hadoop全排序的问题，参考了这位大神的代码
https://www.iteblog.com/archives/2147.html#TotalOrderPartitioner-2

输入数据用

#!/bin/sh

for i in {1..100000};do
echo $RANDOM
done;

这段代码

sh iteblog.sh > data1 产生。

但是我的输出中运用TotalOrderPartitioner产生了分割点，以SequenceFile格式存放的，但是我的3个reduce产生的文件只是文件内部有序，相互之间并不是全排序的关系，看了网上大部分的博客，运用TotalOrderPartitioner是可以实现全排序的结果的，我的代码如下：

hadoop版本2.7.3

Mapper程序：
public class SimpleMapper extends Mapper<Text, Text, Text, IntWritable> {

@Override
protected void map(Text key, Text value,Context context) throws IOException, InterruptedException {
IntWritable intWritable = new IntWritable(Integer.parseInt(key.toString()));
context.write(key, intWritable);
}
}

reducer程序：
public class SimpleReducer extends Reducer<Text, IntWritable, IntWritable, NullWritable> {

protected void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
for (IntWritable value : values)
context.write(value, NullWritable.get());
}

}

Driver程序：
public class SimpleDriver {

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Total Order Sorting");
job.setJarByClass(SimpleDriver.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setSortComparatorClass(KeyComparator.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setNumReduceTasks(3);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(NullWritable.class);

TotalOrderPartitioner.setPartitionFile(job.getConfiguration(), new Path(args[2]));
InputSampler.Sampler<Text, Text> sampler = new InputSampler.RandomSampler<>(0.01, 1000, 100);
InputSampler.writePartitionFile(job, sampler);

job.setPartitionerClass(TotalOrderPartitioner.class);
job.setMapperClass(SimpleMapper.class);
job.setReducerClass(SimpleReducer.class);

job.setJobName("iteblog");
if (!job.waitForCompletion(true))
return;

}
}

KeyComparator程序：
public class KeyComparator extends WritableComparator {
public int compare(WritableComparable w1, WritableComparable w2) {
int v1 = Integer.parseInt(w1.toString());
int v2 = Integer.parseInt(w2.toString());

return v1 - v2;
}
protected KeyComparator() {
super(Text.class, true);
}
}

如能解决，不胜感激

...全文

284 2 打赏收藏转发到动态举报

写回复

用AI写文章

2 条回复

切换为时间正序

请发表友善的回复…

发表回复

大数据中的大叔 2017-10-30

打赏
举报

您好，我刚试了一下，不成功。我看了这篇博客，了解了个大概：https://www.iteblog.com/archives/2147.html 我联系过博主，包括代码，都是一模一样的，但那是不知道为什么就是出不来结果。 Mapper的键如果改成IntWritable，不符合InputSampler.writePartitionFile 函数实现呀，这个函数是要求输入和输出的key的类型是一样的，我改成了IntWritable，报出了 Exception in thread "main" java.io.IOException: wrong key class: org.apache.hadoop.io.Text is not class org.apache.hadoop.io.IntWritable 我用的InputFormat并没有改，还是KeyValueTextInputFormat，map函数的输入就是Text类型的。