100分求解决2个简单的问题，新手求指导

plumebobo 2014-01-10 03:03:21

问题1：
现在有这样两个文件
a|b
a|c|d
想通过mapreduce处理成
a|b|c|d
不知道该如何做。。。

问题2：
一个文件
a1|b1|c1|d1
a2|b2|c2|d2
怎么做成一个数组（数据量现在是200W，不知道用啥好，先用数组试试）传入到mapreduce里面。
其实想做这样的事：
判断，a和b是不是同时在这个数组里。

不知道以上问题能否实现。。。刚学hadoop代码。。。
以前都是弄环境，弄hadoop升级神马的，坑跌啊。。。

...全文

615 13 打赏收藏转发到动态举报

写回复

用AI写文章

13 条回复

切换为时间正序

请发表友善的回复…

发表回复

plumebobo 2014-02-27

打赏
举报

引用 12 楼 tntzbzc 的回复:

LZ是盛大的？

额。。。我在南京做华为外包的一个破公司。。。

撸大湿 2014-01-14

打赏
举报

LZ是盛大的？

plumebobo 2014-01-14

打赏
举报

引用 1 楼 tntzbzc 的回复:

都可以解决晚点贴源码给你

OK,搞定，大师太强了，100分给了~~~

zippooooo 2014-01-13

打赏
举报

l楼上太强了，撸大湿真牛啊

撸大湿 2014-01-13

打赏
举报

引用 8 楼 plumebobo 的回复:

第二个问题大概想做成这样- -
其实就是想知道在那个list怎样传到mapper里面。。。
有点无从下手。。。在看书找例子。。。光看API有点伤。。。

这里解答一下第二个问题

如果用SQL做很简单



SELECT * FROM A 

  WHERE EXISTS(

    SELECT * FROM B WHERE A.C1=B.C1 AND A.C2=B.C2)

A表是信息库，B表是LZ想要筛选的list

但是基于MapReduce的Hive没有EXISTS语法，咋办？
换个思路



SELECT C1,C2

FROM (

  SELECT DISTINCT C1,C2 FROM A 

  UNION ALL

  SELECT DISTINCT C1,C2 FROM B

  )TB

GROUP BY C1,C2 HAVING COUNT(1)>1

把SQL改成这样就行了
现在只要把上面这条Hive SQL翻译成 Java MapReduce就行了

输入文件一资料库文件：/tmp/plumebobo/test0002
a1|b1|c1|d1
a2|b2|c2|d2
a3|b3|c4|d4
a1|b3|c3|d2
a4|b2|c3|d2
a1|b4|c2|d1
a8|b1|c4|d9
a5|b6|c1|d3

输入文件二 list文件：/tmp/plumebobo/keyword0002
a1|b3
a2|b2
a7|b1
a8|b1
a3|b9

输出结果：/tmp/plumebobo/out002/part-r-xxxxx
a1|b3
a2|b2
a8|b1



import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;



import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;



public class plumebobo0002 {



	public static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {

		Text myKey = new Text();

		Text myValue = new Text();



		public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

			if (!value.toString().contains("|"))

				return;

			String[] myStr = value.toString().split("\\|");

			myKey.set(myStr[0] + "|" + myStr[1]); // 只要第一第二个列就行了

			if (myStr.length > 2) {

				myValue.set("W"); // 资料库数据

			}

			else {

				myValue.set("L"); // List数据

			}

			context.write(myKey, myValue);

		}

	}



	public static class MyCombiner extends Reducer<Text, Text, Text, Text>

	{

		Text myKey = new Text();

		Text myValue = new Text();



		public void reduce(Text key, Iterable<Text> values, Context context)

				throws IOException, InterruptedException {



			// 不要小看这个Combiner，在大数据情况下，性能明显提升

			myValue.set(values.iterator().next().toString()); // 资料数据和List数据不会出现在一个Map中，所以只取第一条数据即可

			context.write(key, myValue);

		}

	}



	public static class MyReducer extends Reducer<Text, Text, Text, NullWritable>

	{

		Text myKey = new Text();

		NullWritable myValue;



		public void reduce(Text key, Iterable<Text> values, Context context)

				throws IOException, InterruptedException {



			int WordCount = 0;

			int ListCount = 0;

			for (Text val : values) {

				if (val.toString().equals("W") && WordCount == 0) {

					WordCount++;

				}

				else if (val.toString().equals("L") && ListCount == 0)

				{

					ListCount++;

				}

			}

			// 抛出所有Count大于1的数据

			if (WordCount == 1 && ListCount == 1)

				context.write(key, myValue);

		}

	}



	public static void main(String[] args) throws Exception {

		String Oarg[] = new String[3];

		Oarg[0] = "/tmp/plumebobo/test0002"; // 资料库文件

		Oarg[1] = "/tmp/plumebobo/out002";

		Oarg[2] = "/tmp/plumebobo/keyword0002"; // list文件

		Configuration conf = new Configuration();

		conf.set("mapred.job.tracker", "m04.ct1.r01.hdp:9001");

		Job job = new Job(conf, "plumebobo0002");



		job.setJarByClass(plumebobo0002.class);

		job.setMapperClass(MyMapper.class);

		job.setCombinerClass(MyCombiner.class);

		job.setReducerClass(MyReducer.class);

		job.setNumReduceTasks(1);

		job.setOutputFormatClass(TextOutputFormat.class);



		job.setMapOutputKeyClass(Text.class);

		job.setMapOutputValueClass(Text.class);

		job.setOutputKeyClass(Text.class);

		job.setOutputValueClass(NullWritable.class);



		FileInputFormat.addInputPath(job, new Path(Oarg[0]));

		FileInputFormat.addInputPath(job, new Path(Oarg[2]));

		FileOutputFormat.setOutputPath(job, new Path(Oarg[1]));

		job.waitForCompletion(true);



	}

}

我重新编辑了一下，修正了一个跨block的bug

plumebobo 2014-01-12

打赏
举报

引用 1 楼 tntzbzc 的回复:

都可以解决晚点贴源码给你


import java.util.ArrayList;
import java.util.List;

public class Test 
{
	public static void main(String[] args) {
		String lineValue = "12345678905|read.qidian.com|http://read.qidian.com/BookReader/2932090,50482961.aspx|0|2932090";
		String[] tempArray = lineValue.split("\\|");
		
		StringBuffer str = new StringBuffer();
		String secondUrl = tempArray[1];//二级域名
		String bookId = tempArray[4];//bookid
		String tempStr = secondUrl + "|" + bookId;//用来indexof的临时字符
		boolean isFlag = false;//是否在知识库状态标识
		
		List list =  new ArrayList();
		list.add("read.qidian.com|2932091|bookname1|author1|xuanhuan");
		list.add("read.qidian.com|2932092|bookname2|author2|xuanhuan");
		list.add("read.qidian.com|2932093|bookname3|author3|yanqing");
		list.add("read.qidian.com|2932094|bookname4|author4|yanqing");
		list.add("read.qidian.com|2932090|bookname5|author5|xuanhuan");
		
		int listLength = list.size();
		for(int i=0;i<=listLength-1;i++)
		{
			if(list.get(i).toString().indexOf(tempStr)!= -1)
			{
				isFlag = true;//是否在知识库状态标识变为true;
				System.out.println("-----该网站已经存在于知识库中-----");
			}
		}
	}
}

第二个问题大概想做成这样- - 其实就是想知道在那个list怎样传到mapper里面。。。有点无从下手。。。在看书找例子。。。光看API有点伤。。。

zcw1967 2014-01-12

打赏
举报

引用 3 楼 tntzbzc 的回复:

第一个很简单输入文件：/tmp/plumebobo/test0001 a|b a|c|d b|a|c b|d b|e|f 输出结果：/tmp/plumebobo/out001/part-r-xxxxx a|b|c|d b|a|c|d|e|f 结果支持排序，代码如下


import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DataInputBuffer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

public class plumebobo0001 {

	public static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
		Text myKey = new Text();
		Text myValue = new Text();

		public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
			if (!value.toString().contains("|"))
				return;
			String[] myStr = value.toString().split("\\|");
			for (int i = 1; i < myStr.length; i++) {
				myKey.set(myStr[0] + "|" + myStr[i]); //把数据放在Key中输出，value空
				context.write(myKey, myValue);
			}
			
		}
	}

	public static class MyReducer extends Reducer<Text, Text, Text, NullWritable>
	{
		Text myKey = new Text();
		NullWritable myValue;

		public void reduce(Text key, Iterable<Text> values, Context context)
				throws IOException, InterruptedException {

			StringBuilder myStr = new StringBuilder("");
			
			//迭代取出Key中的数据
			//重写了grouping，所以这里不用再作二次排序
			for (Text val : values) {
				if (myStr.length() == 0) {
					myStr.append(key.toString());
				}
				else {
					myStr.append("|");
					myStr.append(key.toString().split("\\|")[1]);
				}
			}
			myKey.set(myStr.toString());
			context.write(myKey, myValue);
		}
	}

	public static void main(String[] args) throws Exception {
		String Oarg[] = new String[2];
		Oarg[0] = "/tmp/plumebobo/test0001";
		Oarg[1] = "/tmp/plumebobo/out001";
		Configuration conf = new Configuration();
		conf.set("mapred.job.tracker", "m04.ct1.r01.hdp:9001");
		Job job = new Job(conf, "plumebobo0001");

		job.setJarByClass(plumebobo0001.class);
		job.setMapperClass(MyMapper.class);
		job.setReducerClass(MyReducer.class);
		job.setNumReduceTasks(1);
		job.setOutputFormatClass(TextOutputFormat.class);

		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Text.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);

		job.setPartitionerClass(MyPartitioner.class);
		job.setGroupingComparatorClass(MyGroupingComparator.class);

		FileInputFormat.setInputPaths(job, new Path(Oarg[0]));
		FileOutputFormat.setOutputPath(job, new Path(Oarg[1]));
		job.waitForCompletion(true);

	}
}

// 根据第一列 分区
class MyPartitioner extends HashPartitioner<Text, Text>
{
	@Override
	public int getPartition(Text key, Text value, int numPartitions) {
		Text cols = new Text(key.toString().split("\\|")[0]);
		return super.getPartition(cols, value, numPartitions);// cols[0]
	}
}

// 以第一列 值 分组
class MyGroupingComparator implements RawComparator<Text>
{

	// @Override
	public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
		Text key1 = new Text();
		Text key2 = new Text();

		DataInputBuffer buffer = new DataInputBuffer();
		try {
			buffer.reset(b1, s1, l1);
			key1.readFields(buffer);
			buffer.reset(b2, s2, l2);
			key2.readFields(buffer);
		}
		catch (IOException e) {
			throw new RuntimeException(e);
		}

		String str1 = key1.toString().split("\\|")[0];
		String str2 = key2.toString().split("\\|")[0];
		return str1.compareTo(str2);
	}

	public int compare(Text o1, Text o2) {
		return 0;
	}
}

你的第二个问题要去输出是什么 LZ把第二个问题需求写清楚点吧，晚点贴代码给你看

真正牛人，佩服！

plumebobo 2014-01-10

打赏
举报

引用 1 楼 tntzbzc 的回复:

都可以解决晚点贴源码给你

新建一个txt，在里面输可以用的，不过结果含有中文字符的是乱码改成utf-8，会换行~~~我等下改改，或者我传分类id也行我是在windows下开发的，把工程打jar包到服务器。。。需要上传到hdfs。。。 cygwin的debug不会玩，只有LOGGER.info... 我的方式似乎很没效率。。。看到继承不认识的类，我才想到API了，再次感谢！

plumebobo 2014-01-10

打赏
举报

引用 1 楼 tntzbzc 的回复:

都可以解决晚点贴源码给你

其实是这样的，我做了一个阅读专题的代码。比如 http://read.qidian.com/BookReader/3033699,49363375.aspx，这个url种，read.qidian.com和3033699，一个是二级域名，一个是bookid 两个作为联合主键，可以唯一确定一本书在所有起点小说中是否存在。我现在有一个知识库，里面有很多bookid，对应的二级域名，及小说的其他信息。用户比如看了一本小说，产生了url，我对它做分类，就要先判断他在知识库中是否存在。我把知识库抽成一个文件二级域名|bookid|书名|作者|分类想看看read.qidian.com|3033699,是不是在其中。。。先感谢版主对我第一个问题的帮助，感觉打开新世界了~~~

mimixigu5 2014-01-10

打赏
举报

留名，学习下。。。。

撸大湿 2014-01-10

打赏
举报

第一个很简单输入文件：/tmp/plumebobo/test0001 a|b a|c|d b|a|c b|d b|e|f 输出结果：/tmp/plumebobo/out001/part-r-xxxxx a|b|c|d b|a|c|d|e|f 结果支持排序，代码如下


import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DataInputBuffer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

public class plumebobo0001 {

	public static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
		Text myKey = new Text();
		Text myValue = new Text();

		public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
			if (!value.toString().contains("|"))
				return;
			String[] myStr = value.toString().split("\\|");
			for (int i = 1; i < myStr.length; i++) {
				myKey.set(myStr[0] + "|" + myStr[i]); //把数据放在Key中输出，value空
				context.write(myKey, myValue);
			}
			
		}
	}

	public static class MyReducer extends Reducer<Text, Text, Text, NullWritable>
	{
		Text myKey = new Text();
		NullWritable myValue;

		public void reduce(Text key, Iterable<Text> values, Context context)
				throws IOException, InterruptedException {

			StringBuilder myStr = new StringBuilder("");
			
			//迭代取出Key中的数据
			//重写了grouping，所以这里不用再作二次排序
			for (Text val : values) {
				if (myStr.length() == 0) {
					myStr.append(key.toString());
				}
				else {
					myStr.append("|");
					myStr.append(key.toString().split("\\|")[1]);
				}
			}
			myKey.set(myStr.toString());
			context.write(myKey, myValue);
		}
	}

	public static void main(String[] args) throws Exception {
		String Oarg[] = new String[2];
		Oarg[0] = "/tmp/plumebobo/test0001";
		Oarg[1] = "/tmp/plumebobo/out001";
		Configuration conf = new Configuration();
		conf.set("mapred.job.tracker", "m04.ct1.r01.hdp:9001");
		Job job = new Job(conf, "plumebobo0001");

		job.setJarByClass(plumebobo0001.class);
		job.setMapperClass(MyMapper.class);
		job.setReducerClass(MyReducer.class);
		job.setNumReduceTasks(1);
		job.setOutputFormatClass(TextOutputFormat.class);

		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Text.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);

		job.setPartitionerClass(MyPartitioner.class);
		job.setGroupingComparatorClass(MyGroupingComparator.class);

		FileInputFormat.setInputPaths(job, new Path(Oarg[0]));
		FileOutputFormat.setOutputPath(job, new Path(Oarg[1]));
		job.waitForCompletion(true);

	}
}

// 根据第一列 分区
class MyPartitioner extends HashPartitioner<Text, Text>
{
	@Override
	public int getPartition(Text key, Text value, int numPartitions) {
		Text cols = new Text(key.toString().split("\\|")[0]);
		return super.getPartition(cols, value, numPartitions);// cols[0]
	}
}

// 以第一列 值 分组
class MyGroupingComparator implements RawComparator<Text>
{

	// @Override
	public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
		Text key1 = new Text();
		Text key2 = new Text();

		DataInputBuffer buffer = new DataInputBuffer();
		try {
			buffer.reset(b1, s1, l1);
			key1.readFields(buffer);
			buffer.reset(b2, s2, l2);
			key2.readFields(buffer);
		}
		catch (IOException e) {
			throw new RuntimeException(e);
		}

		String str1 = key1.toString().split("\\|")[0];
		String str2 = key2.toString().split("\\|")[0];
		return str1.compareTo(str2);
	}

	public int compare(Text o1, Text o2) {
		return 0;
	}
}

你的第二个问题要去输出是什么 LZ把第二个问题需求写清楚点吧，晚点贴代码给你看

plumebobo 2014-01-10