Sqoop导入较大数据量的时候，数据集中在个别文件

ycj1918 2014-09-03 04:30:04

我在用sqoop从oracle导入数据到hdfs中的时候，只要数据在几十个GB的时候就会出现数据分布严重不均匀。
-m 4 ，结果分为了4个文件，但是两个文件是空的，大多集中在另外两个文件，-m 3的时候是集中在一个文件中。。

看hadoop的监控页面，空的哪几个人物都显示

Container killed by the ApplicationMaster.
Container killed on request.Exit code is 143
Container exited with a non-zero exit code 143..

这个要怎么解决啊。。。。只要数据小点就能分布的很均匀，每个任务都进行的很顺利。。。

求大神指点啊。。。

...全文

1763 9 打赏收藏转发到动态举报

写回复

用AI写文章

9 条回复

切换为时间正序

请发表友善的回复…

发表回复

qingyuan18 2014-09-13

打赏
举报

你导入到哪里？hdfs还是Hbase？

java8964 2014-09-13

打赏
举报

In sqoop, you have to choose your own partition key, and this key needs to be balanced partitioned.

hhhhhhhhik 2014-09-11

打赏
举报

引用 6 楼 hhhhhhhhik 的回复:

引用
Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. Do not increase the degree of parallelism greater than that available within your MapReduce cluster; tasks will run serially and will likely increase the amount of time required to perform the import. Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result. When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks. If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by employee_id. Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html 可以试试oraoop,sqoop for oracle的插件，可以按rowid划分task，效果应该会好一些

刚看，已经集成到sqoop1.4.5当中了，可以看看文档

hhhhhhhhik 2014-09-11

打赏
举报

引用

Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. Do not increase the degree of parallelism greater than that available within your MapReduce cluster; tasks will run serially and will likely increase the amount of time required to perform the import. Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result. When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks. If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by employee_id. Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.

http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html 可以试试oraoop,sqoop for oracle的插件，可以按rowid划分task，效果应该会好一些

ycj1918 2014-09-05

打赏
举报

没有大神帮忙么。。

ycj1918 2014-09-04

打赏
举报

引用 2 楼 zixin1990 的回复:

在网上Google了一下，找到别人的观点：　　经个人总结，这通常是由于以下几种原因造成的：（1）你编写了一个java lib，封装成了jar，然后再写了一个Hadoop程序，调用这个jar完成mapper和reducer的编写（2）你编写了一个Hadoop程序，期间调用了一个第三方java lib。之后，你将自己的jar包或者第三方java包分发到各个TaskTracker的HADOOP_HOME目录下，运行你的JAVA程序，报了以上错误。看这个意思，是说节点上并没有JAR包的lib

这个我自己也看过，但是不是啊。。我用的是sqoop，我可没有编写任何东西啊。。只是数据量一大就会这样，，大概抽取千万条数据是正常的，，上亿就不行了

zixin1990 2014-09-03

打赏
举报

以上来自 http://www.iteblog.com/archives/789和https://issues.apache.org/jira/browse/YARN-182

zixin1990 2014-09-03

打赏
举报

在网上Google了一下，找到别人的观点：　　经个人总结，这通常是由于以下几种原因造成的：（1）你编写了一个java lib，封装成了jar，然后再写了一个Hadoop程序，调用这个jar完成mapper和reducer的编写（2）你编写了一个Hadoop程序，期间调用了一个第三方java lib。之后，你将自己的jar包或者第三方java包分发到各个TaskTracker的HADOOP_HOME目录下，运行你的JAVA程序，报了以上错误。看这个意思，是说节点上并没有JAR包的lib

zixin1990 2014-09-03