Hbase 入库优化求大神帮忙看看我的代码如何改进

曹宇 2014-03-04 11:59:52

集群是 17+1的架构有3个zookeeper

向Hbase入库数据为 2500W行一行有46个字段，以|分隔，其中第10个字段作为行健
hbase表结构是行健+一个列族一个列族下45个列

入库代码为：



package com.ericsson.andromeda.put;



import java.io.BufferedReader;

import java.io.File;

import java.io.FileReader;

import java.text.SimpleDateFormat;

import java.util.ArrayList;

import java.util.List;



import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Put;



public class PutToHbase {

	

	public static Configuration conf;

	private static HTable table;

	private static File file;

	static{

		

		/**

		 * 

		 */

		

		conf = HBaseConfiguration.create();

		conf.set("hbase.zookeeper.property.clientPort", "2181");

		conf.set(

				"hbase.zookeeper.quorum",

				"sdw017.andromeda.com,sdw016.andromeda.com,sdw015.andromeda.com");

		conf.set("hbase.master", "192.168.1.202:60000");

//		conf.set("hbase.zookeeper.property.dataDir", "/root/zookeeper");

		conf.set("hbase.cluster.distributed", "true");

		conf.set("hbase.rootdir", "hdfs://mdw.andromeda.com:8020/apps/hbase/data");

		

		file = new File("/andromeda/cdrfile/NC_5G.csv");

	}

	

	public static void main(String[] args) {

		long startTime = System.currentTimeMillis();

		

		//set a list for hadoop put be used

		List<Put> puts = new ArrayList<Put>();

		

		//read the source file

		BufferedReader bufr;

		

		

		int count = 0;

		int count2 = 0;

		

		try {

			 table = new HTable(conf, "testnc");

			 

			 //table.setAutoFlush(false);

			 

			//To determine whether a file exists

			if(!file.exists()){

				System.out.println("source file is not exist or find");

				System.exit(0);

			}

			

			

			bufr = new BufferedReader(new FileReader(file));

			String read = null;

			

			//print time on start job is done

			System.out.println("Start time : "+new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(System.currentTimeMillis()));

			

			//start the put process

			while((read = bufr.readLine())!=null){

				//split the source date to the String[]

				String[] fields = read.split("\\|");

				

				//the callingNum is the Row_Key

				String callingNum = fields[9];

				

				//get the Put Object

				Put put = new Put(callingNum.getBytes());

				for(int i=0;i<fields.length;i++){

					//get the field from source file,transfrom this to be the hbase field,eg:family,column and value

					put.add("other".getBytes(),("field"+i).getBytes(), fields[i].getBytes());

					

				}

				puts.add(put);

				//if the put object > 5000 ,commit date to hbase

				if(puts.size()>5000){

					table.put(puts);

					table.flushCommits();

					puts.clear();

					System.out.println("now it is "+(count2++)+".........5000.........");

				}

				

				count++;

			}

			

			

			if(puts.size()>0){

				table.put(puts);

				table.flushCommits();

				puts.clear();

			}

			

			

			

			

			System.out.println("Put data is done.Total is "+count+" line");

			

			

			

			

			table.close();

			

			

			long endTime = System.currentTimeMillis();

			long time = endTime-startTime;

			System.out.println("end time : "+new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(System.currentTimeMillis()));

			System.out.println("The put job has time is: "+time+" second");

			 

		} catch (Exception e) {

			e.printStackTrace();

		}

		

	}

}

源数据的一行数据如下：
20140217 11:16:58:177|20140217 11:16:58:177|116|24|1|1|1|1|1|15201023462|13101023462|13101023462|18501023457|18601023460|0|0|0|0|1|0|3|1|3|65534|1|311|490|813|421|956|603|413|377|2|4|3|6|6|6|1|3|0|0|12345678903|82345638902|12345638982

-------------------------------------
现在我主要是觉得TPS好慢，大约1秒会完成上面的if循环判断的puts数组中超过5000个Put对象然后commit数据

也就是说大约1秒钟会完成5000个Put对象的入库而一个put对象内有45个列族：列名：value数据

大约TPS为 5000*45=225000 的TPS

现在是单线程一个节点入库多线程应该会快一些

1，求大神给一些优化意见谢谢了
2，我原本用了 table.setAutoFlush(false); 可以用这个就会在入库的时候下线regionserver 而且是随机下线，莫名其妙的求解。。。

万分感谢各位大神们。

...全文

558 3 打赏收藏转发到动态举报

写回复

用AI写文章

3 条回复

切换为时间正序

请发表友善的回复…

发表回复

java8964 2014-03-09

打赏
举报

One sec for one put operation is too slow, doesn't matter it is a single put or batch put. Based my experience, HBase should be around 10ms for random write. Here is what you need to find out: 1) Did you run Hbase performance evaluation test of your cluster? See here: http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation 2) Try to see if it is network reason. Run the tests from client, then from one of the region server directly. If the result is much high from the client, then you may have a network problem between your client and Hbase cluster. 3) What 撸大湿 said are all correct. What I can suggestion is using the Hbase performance evaluation to test your cluster first, to rule out any thing could be caused by your own code. If the Hbase evaluation proves that your cluster is good, then try to find out what is wrong in your own code.

jpjiang4648 2014-03-05

打赏
举报

建议lz用bulk load吧，感觉put适合少量数据入库，海量数据的话还是bulk load比较方便

撸大湿 2014-03-04

打赏
举报

这个问题解释起来很麻烦，因为涉及的面很广

代码：
1、putlist + putlist size 控制批量提交可以用 put + writebuffersize代替，孰优孰劣很难讲，需要测试
2、如果是批量更新，可以考虑关闭 setWriteToWAL，这样可以获得 5%到10%性能优势，但是如果Region宕了，还没flush到fstore的数据也就丢了
3、split性能是不是瓶颈？看看这篇文章 http://bbs.csdn.net/topics/390715632

put设计：
1、在put前，扩展htable的region数据量
2、put的rowkey一定要散列化，单调递增的数据会导致单region压力过大
3、把cf和c的长度降低，比如"other"可以改成"_0","field"+i 改成 "f0"+i (不要忘了补0)
4、多线程肯定是要的！！！不要拿单线程和多线程比较。但是不要忽略网络带宽问题，如果在一台CLIENT机上，开启多线程对一个HBASE集群写数据，这台Client的网络是不是瓶颈？？！！检查一下！！

配置方面遵守一下原则（不同版本的HBASE配置有差异，具体配置我不贴了，LZ（根据自己HBASE版本）参照官方文档）
1、memstore大小、数量可以增加。特别是在大批量随机put的情况下
2、storefile合并不要太频繁
3、split需要减少，所以需要在put前与创建region（参照前面第一条）
通过以上配置可以减小 tps波动，以及flush、compact、split造成的卡死

最后提一下，如果数据量实在庞大，请考虑自定义Mapreduce或者bulk load