hadoop 中reducer非常慢算法很简单

zjh9058 2012-06-17 01:37:53

我的算法非常简单数据文件24个G 每行都是有两个数字组成例如
12 14
12 15
12 29
13 90
。。。。
算法就是设置第一个数字为key 然后找这个key对应的所有的第二个数字，有点像社交网络里找“粉丝”的意思
最后输出为
12 [14,15,29..]
13 [.....]

但是不知道为什么放到hadoop 里面跑 map里面很快 reduce的时候就非常非常慢 reduce 里面67%之前很快 67%之后到67.42%用了尼玛一个多小时
而我跑wordcount(hadoop官网的例子)统计这24个G的数据里面第一个数字出现的次数这个却挺快的
是不是算法的问题？我的code如下：求各路大神指点

package org.myorg;
import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class followerTwitter {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
//private final static IntWritable one = new IntWritable(1);
private Text twitterID = new Text();
private Text followerID = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
if (tokenizer.hasMoreTokens()) {

twitterID.set(tokenizer.nextToken());

}

if (tokenizer.hasMoreTokens()) {
followerID.set(tokenizer.nextToken());

}

output.collect(twitterID, followerID);

}
}

public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
Text followers = new Text();
String orginal = "Followers:[";
//Text sum2 = new Text();
while (values.hasNext()) {
Text temp =values.next();
String temps=temp.toString();
orginal = orginal+temps+',';
}

orginal =orginal+']';
Text followerList=new Text(orginal);
output.collect(key, followerList);
}
}

public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(followerTwitter.class);
conf.setJobName("followerTwitter");

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);

conf.setMapperClass(Map.class);
//conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
}
}

...全文

508 5 打赏收藏转发到动态举报

写回复

用AI写文章

5 条回复

切换为时间正序

请发表友善的回复…

发表回复

mrsworf 2012-06-19

打赏
举报

reduse用字符串的+操作可能比较慢，改成StringBuilder之类的试试看
另外避免用append 若append多次也会很慢

zjh9058 2012-06-17

打赏
举报

[Quote=引用 1 楼的回复:]

reduse用字符串的+操作可能比较慢，改成StringBuilder之类的试试看

Java code
//Text followers = new Text();
//String orginal = "Followers:[";
StringBuilder orginal = new StringBuilder("Followers:[");
//Text sum2 ……
[/Quote]
哥！我都要爱上你了！！我之前也用了append 但是是最土的一次一次的append！所以也卡在reducer上了！我用了你的code果断10几分钟reduce完了！现在在merge结果！中国网友太强大了我爱我的祖国！

brightyq 2012-06-17

打赏
举报

和hadoop没关系，应该是字行串那里：

String temps=temp.toString();
orginal = orginal+temps+',';

字符串拼接时，如果不存在多线程的情况就使用StringBuilder，比StringBuffer要快。
平时用orginal = orginal+temps+',';看不出效率的区别，你这里20G的话，效率就区别很大了。
像1楼那样，用StringBuilder的append方法拼接字符串。

MiceRice 2012-06-17

打赏
举报

另一个可能的原因：数据规模这么大，最后都靠一个地方reduce。

比较怀疑是GC过于频繁导致，请问你内存配置多大？

另外建议你开启GC日志：“-XX:+PrintGCDetails” 然后再跑一次，看最后是不是频繁GC了。

qybao 2012-06-17

打赏
举报

reduse用字符串的+操作可能比较慢，改成StringBuilder之类的试试看

  //Text followers = new Text();

  //String orginal = "Followers:[";

  StringBuilder orginal = new StringBuilder("Followers:[");

  //Text sum2 = new Text();

  while (values.hasNext()) {

    //Text temp =values.next();   

    //String temps=temp.toString();

    //orginal = orginal+temps+',';

    orginal.append(values.next().toString()).append(",");

  }

    

  //orginal =orginal+']';   

  orginal.append("]");

  //Text followerList=new Text(orginal);  

  Text followerList=new Text(orginal.toString()); 

  output.collect(key, followerList);