批量数据导入Hbase效率问题

JOE-1992 2014-11-23 10:40:22

最近在研究hadoop和hbase，在hbase单机情况下导入56.7MB的数据时，花费时间为9分34秒，觉得非常慢。想请问有经验的关于如何高效的批量导入数据。

我是在linux下，hbase单机模式，数据是bat型文件，每个数据以 “ | ”隔开，数据总共1百万条；

我是这样做的：每行读取数据，通过“ | ”区分数据保存到一个数组中，从前往后按顺序排列。一条完整的数据保存到数组后，然后put到对应的列中，设置的setwritebuffer大小为5MB。

以下是我的基本代码，希望得到大家指点

：

// 从读取文件中读取数据，并存入array数组中
String filepath = "//home//joe//data.dat";
String array[] = new String[10];

Scanner in = new Scanner(new FileInputStream(filepath));// 读取文件

while (in.hasNextLine())// 对每行做处理
{
int count = 0;
String line = in.nextLine();
StringTokenizer lineTokenizer = new StringTokenizer(line, "|");// 用来分开数字和标点符号
// ArrayList<String> l = new ArrayList<String>();
while (lineTokenizer.hasMoreTokens())// 把一行里的每个字符添加进去
{
String num = lineTokenizer.nextToken();

// int temp = Integer.parseInt(num);
array[count] = num;
// System.out.print(array[count]+" ");
count++;
}
// table.flushCommits();
for (int j = 0; j < 10; j++) {
// System.out.print(array[j] + " ");
// 每行数据插入数据库

switch (j) {
case 0: {
Put put0 = new Put(Bytes.toBytes(array[8]));
put0.add(Bytes.toBytes("Info"), Bytes.toBytes("SrcIP"),
Bytes.toBytes(array[0]));
table.put(put0);
}
case 1: {
Put put1 = new Put(Bytes.toBytes(array[8]));
put1.add(Bytes.toBytes("Info"), Bytes.toBytes("DestIP"),
Bytes.toBytes(array[1]));
table.put(put1);
}
case 2: {
Put put2 = new Put(Bytes.toBytes(array[8]));
put2.add(Bytes.toBytes("Info"), Bytes.toBytes("SrcPort"),
Bytes.toBytes(array[2]));
table.put(put2);
}
case 3: {
Put put3 = new Put(Bytes.toBytes(array[8]));
put3.add(Bytes.toBytes("Info"), Bytes.toBytes("DestPort"),
Bytes.toBytes(array[3]));
table.put(put3);
}
case 4: {
Put put4 = new Put(Bytes.toBytes(array[8]));
put4.add(Bytes.toBytes("Info"),
Bytes.toBytes("CaptureTime"),
Bytes.toBytes(array[4]));
table.put(put4);
}
case 5: {
Put put5 = new Put(Bytes.toBytes(array[8]));
put5.add(Bytes.toBytes("Info"), Bytes.toBytes("Flag"),
Bytes.toBytes(array[5]));
table.put(put5);
}
case 6: {
Put put6 = new Put(Bytes.toBytes(array[8]));
put6.add(Bytes.toBytes("Info"), Bytes.toBytes("Protocol"),
Bytes.toBytes(array[0]));
table.put(put6);
}
case 7: {
Put put7 = new Put(Bytes.toBytes(array[8]));
put7.add(Bytes.toBytes("Info"), Bytes.toBytes("ISP"),
Bytes.toBytes(array[0]));
table.put(put7);
}
case 9: {
Put put9 = new Put(Bytes.toBytes(array[8]));
put9.add(Bytes.toBytes("Info"), Bytes.toBytes("QueryType"),
Bytes.toBytes(array[9]));
table.put(put9);
}
}
}
n++;
// System.out.println();

if(n % 100000 == 0){
table.flushCommits();
// System.out.println("--------------------------------------------");
}
}
table.flushCommits();

...全文

1909 10 打赏收藏转发到动态举报

写回复

用AI写文章

10 条回复

切换为时间正序

请发表友善的回复…

发表回复

人生偌只如初见 2014-11-25

打赏
举报

引用 2 楼 dingji_ping 的回复:

[quote=引用 1 楼 wulinshishen 的回复:] 大量的数据导入到HBase中, 可以先采用MapReduce生成HFile文件, 然后使用BulkLoad导入HBase中。

是不是通过调用他的API方法没有办法做到高效率，你所说的先采用Mapreduce生产HFile文件是要做文本文件转化为HFile文件这一步吗？[/quote] 是先采用Mapreduce把HDFS文本文件转化为HBase特定格式存储的HFile文件，然后通过HBase提供的BulkLoad方法将生成的HFile文件上传至合适位置。

JOE-1992 2014-11-25