hive multi-group-by 用distinct去重出现问题

tchqiq 2014-07-15 07:05:02

写HiveQL遇到个问题~主要用的是hive特有的语法，hive支持一种multi group 不用的写法（或叫multi insert？），执行完后发现，计算的uv不对。不多说废话，直接贴代码吧：



FROM (

select expid(curl,'pid2.properties') as epid,--抽取产品id

expid(exagent(agt)['name'],'browse.properties') as browser,--根据agent头抽取浏览器

expid(exagent(agt)['category'],'category.properties') as category,--根据agent头抽取平台

expid(exagent(agt)['os'],'os.properties') as os,--根据agent头抽取操作系统

uid,

cid

from file_pv_track where pdate = '$date'

) a

INSERT OVERWRITE DIRECTORY '$output1' $sql1

INSERT OVERWRITE DIRECTORY '$output2' $sql2

INSERT OVERWRITE DIRECTORY '$output3' $sql3

3个sql变量如下：



sql1="select '$date',

a.epid,

a.os,

count(1) as pv,

count(distinct a.cid) as uv,

count(distinct a.uid) as l_uv

group by a.epid,a.os"; 



sql2="select '$date',

a.epid,

a.category,

count(1) as pv,

count(distinct a.cid) as uv,

count(distinct a.uid) as l_uv

group by a.epid,a.category";



sql3="select '$date',

a.epid,

a.browser,

count(1) as pv,

count(distinct a.cid) as uv,

count(distinct a.uid) as l_uv

group by a.epid,a.browser";

我的意图是先查出一个集合，然后用这个集合进行不同的group by操作
这个HiveQL要进行4个job，第一个就是跑From查询，然后后边3个依赖第一个，可以并行跑，执行过程如下：



ABSTRACT SYNTAX TREE:

(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME file_pv_track))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTION expid (TOK_TABLE_OR_COL curl) 'pid2.properties') epid) (TOK_SELEXPR (TOK_FUNCTION expid ([ (TOK_FUNCTION exagent (TOK_TABLE_OR_COL agt)) 'name') 'browse.properties') browser) (TOK_SELEXPR (TOK_FUNCTION expid ([ (TOK_FUNCTION exagent (TOK_TABLE_OR_COL agt)) 'category') 'category.properties') category) (TOK_SELEXPR (TOK_FUNCTION expid ([ (TOK_FUNCTION exagent (TOK_TABLE_OR_COL agt)) 'os') 'os.properties') os) (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_TABLE_OR_COL cid))) (TOK_WHERE (= (TOK_TABLE_OR_COL pdate) '2014-07-14')))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR '/output/online/fact_agent/2014-07-14/1405393817/fact_agent_os/')) (TOK_SELECT (TOK_SELEXPR '2014-07-14') (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) epid)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) os)) (TOK_SELEXPR (TOK_FUNCTION count 1) pv) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL cid)) uv) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL uid)) l_uv)) (TOK_GROUPBY (. (TOK_TABLE_OR_COL a) epid) (. (TOK_TABLE_OR_COL a) os))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR '/output/online/fact_agent/2014-07-14/1405393817/fact_agent_platform/')) (TOK_SELECT (TOK_SELEXPR '2014-07-14') (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) epid)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) category)) (TOK_SELEXPR (TOK_FUNCTION count 1) pv) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL cid)) uv) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL uid)) l_uv)) (TOK_GROUPBY (. (TOK_TABLE_OR_COL a) epid) (. (TOK_TABLE_OR_COL a) category))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR '/output/online/fact_agent/2014-07-14/1405393817/fact_agent_browser/')) (TOK_SELECT (TOK_SELEXPR '2014-07-14') (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) epid)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) browser)) (TOK_SELEXPR (TOK_FUNCTION count 1) pv) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL cid)) uv) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL uid)) l_uv)) (TOK_GROUPBY (. (TOK_TABLE_OR_COL a) epid) (. (TOK_TABLE_OR_COL a) browser))))

CONSOLE#

STAGE DEPENDENCIES:

Stage-3 is a root stage

Stage-4 depends on stages: Stage-3

Stage-0 depends on stages: Stage-4

Stage-5 depends on stages: Stage-3

Stage-1 depends on stages: Stage-5

Stage-6 depends on stages: Stage-3

Stage-2 depends on stages: Stage-6

CONSOLE#

STAGE PLANS:

Stage: Stage-3

Map Reduce

Alias -> Map Operator Tree:

a:file_pv_track

TableScan

alias: file_pv_track

Select Operator

expressions:

expr: expid(curl, 'pid2.properties')

type: string

expr: expid(exagent(agt)['name'], 'browse.properties')

type: string

expr: expid(exagent(agt)['category'], 'category.properties')

type: string

expr: expid(exagent(agt)['os'], 'os.properties')

type: string

expr: uid

type: string

expr: cid

type: string

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5

Reduce Output Operator

key expressions:

expr: _col5

type: string

expr: _col4

type: string

sort order: ++

Map-reduce partition columns:

expr: _col5

type: string

expr: _col4

type: string

tag: -1

value expressions:

expr: _col0

type: string

expr: _col3

type: string

expr: 1

type: int

expr: _col2

type: string

expr: _col1

type: string

Reduce Operator Tree:

Forward

Group By Operator

aggregations:

expr: count(VALUE._col2)

expr: count(DISTINCT KEY._col0)

expr: count(DISTINCT KEY._col1)

bucketGroup: false

keys:

expr: VALUE._col0

type: string

expr: VALUE._col1

type: string

mode: hash

outputColumnNames: _col0, _col1, _col2, _col3, _col4

File Output Operator

compressed: false

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.SequenceFileInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Group By Operator

aggregations:

expr: count(VALUE._col2)

expr: count(DISTINCT KEY._col0)

expr: count(DISTINCT KEY._col1)

bucketGroup: false

keys:

expr: VALUE._col0

type: string

expr: VALUE._col3

type: string

mode: hash

outputColumnNames: _col0, _col1, _col2, _col3, _col4

File Output Operator

compressed: false

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.SequenceFileInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Group By Operator

aggregations:

expr: count(VALUE._col2)

expr: count(DISTINCT KEY._col0)

expr: count(DISTINCT KEY._col1)

bucketGroup: false

keys:

expr: VALUE._col0

type: string

expr: VALUE._col4

type: string

mode: hash

outputColumnNames: _col0, _col1, _col2, _col3, _col4

File Output Operator

compressed: false

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.SequenceFileInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

CONSOLE#

Stage: Stage-4

Map Reduce

Alias -> Map Operator Tree:

hdfs://csdn/tmp/hive-hadoop/hive_2014-07-15_11-10-24_836_3002486700983717417-1/-mr-10003

Reduce Output Operator

key expressions:

expr: _col0

type: string

expr: _col1

type: string

sort order: ++

Map-reduce partition columns:

expr: _col0

type: string

expr: _col1

type: string

tag: -1

value expressions:

expr: _col2

type: bigint

expr: _col3

type: bigint

expr: _col4

type: bigint

Reduce Operator Tree:

Group By Operator

aggregations:

expr: count(VALUE._col0)

expr: count(VALUE._col1)

expr: count(VALUE._col2)

bucketGroup: false

keys:

expr: KEY._col0

type: string

expr: KEY._col1

type: string

mode: final

outputColumnNames: _col0, _col1, _col2, _col3, _col4

Select Operator

expressions:

expr: '2014-07-14'

type: string

expr: _col0

type: string

expr: _col1

type: string

expr: _col2

type: bigint

expr: _col3

type: bigint

expr: _col4

type: bigint

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5

File Output Operator

compressed: false

GlobalTableId: 1

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

CONSOLE#

Stage: Stage-0

Move Operator

files:

hdfs directory: true

destination: /output/online/fact_agent/2014-07-14/1405393817/fact_agent_os

CONSOLE#

Stage: Stage-5

Map Reduce

Alias -> Map Operator Tree:

hdfs://csdn/tmp/hive-hadoop/hive_2014-07-15_11-10-24_836_3002486700983717417-1/-mr-10004

Reduce Output Operator

key expressions:

expr: _col0

type: string

expr: _col1

type: string

sort order: ++

Map-reduce partition columns:

expr: _col0

type: string

expr: _col1

type: string

tag: -1

value expressions:

expr: _col2

type: bigint

expr: _col3

type: bigint

expr: _col4

type: bigint

Reduce Operator Tree:

Group By Operator

aggregations:

expr: count(VALUE._col0)

expr: count(VALUE._col1)

expr: count(VALUE._col2)

bucketGroup: false

keys:

expr: KEY._col0

type: string

expr: KEY._col1

type: string

mode: final

outputColumnNames: _col0, _col1, _col2, _col3, _col4

Select Operator

expressions:

expr: '2014-07-14'

type: string

expr: _col0

type: string

expr: _col1

type: string

expr: _col2

type: bigint

expr: _col3

type: bigint

expr: _col4

type: bigint

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5

File Output Operator

compressed: false

GlobalTableId: 2

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

CONSOLE#

Stage: Stage-1

Move Operator

files:

hdfs directory: true

destination: /output/online/fact_agent/2014-07-14/1405393817/fact_agent_platform

..........

可以看到它似乎是直接就先执行了distinct而没有等到stage-4，stage-5再distinct。问题应该出现这里

如果不用multi-group-by就没有问题了，代码如下：



select '2014-07-07',

a.epid,

a.category,

count(1) as pv,

count(distinct cid) as uv,

count(distinct uid) as l_uv

FROM (

select expid(curl,'pid2.properties') as epid,

expid(exagent(agt)['name'],'browse.properties') as browser,

expid(exagent(agt)['category'],'category.properties') as category,

expid(exagent(agt)['os'],'os.properties') as os,

uid,

cid

from file_pv_track where pdate = '2014-07-07'

) a

group by a.epid,a.category

用multi-group-by的算出来的uv比不用的要多。。。写了这么多~哪位大神能给解答一下？

...全文

982 3 打赏收藏转发到动态举报

写回复

用AI写文章

3 条回复

切换为时间正序

请发表友善的回复…

发表回复

tchqiq 2014-07-21

打赏
举报

引用 2 楼 weitao1234 的回复:

顶

谢谢哈~我发现这问题是不是太复杂了？没人管啊。。。

weitao1234 2014-07-19

打赏
举报

顶

tchqiq 2014-07-15

打赏
举报

另外不是用multi-group-by的sql的执行过程如下：


ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME file_pv_track))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTION expid (TOK_TABLE_OR_COL curl) 'pid2.properties') epid) (TOK_SELEXPR (TOK_FUNCTION expid ([ (TOK_FUNCTION exagent (TOK_TABLE_OR_COL agt)) 'name') 'browse.properties') browser) (TOK_SELEXPR (TOK_FUNCTION expid ([ (TOK_FUNCTION exagent (TOK_TABLE_OR_COL agt)) 'category') 'category.properties') category) (TOK_SELEXPR (TOK_FUNCTION expid ([ (TOK_FUNCTION exagent (TOK_TABLE_OR_COL agt)) 'os') 'os.properties') os) (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_TABLE_OR_COL cid))) (TOK_WHERE (= (TOK_TABLE_OR_COL pdate) '2014-07-07')))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR '/output/debug/t')) (TOK_SELECT (TOK_SELEXPR '2014-07-07') (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) epid)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) category)) (TOK_SELEXPR (TOK_FUNCTION count 1) pv) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL cid)) uv) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL uid)) l_uv)) (TOK_GROUPBY (. (TOK_TABLE_OR_COL a) epid) (. (TOK_TABLE_OR_COL a) category))))
CONSOLE#
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
CONSOLE#
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:file_pv_track
TableScan
alias: file_pv_track
Select Operator
expressions:
expr: expid(curl, 'pid2.properties')
type: string
expr: expid(exagent(agt)['category'], 'category.properties')
type: string
expr: uid
type: string
expr: cid
type: string
outputColumnNames: _col0, _col2, _col4, _col5
Select Operator
expressions:
expr: _col0
type: string
expr: _col2
type: string
expr: _col5
type: string
expr: _col4
type: string
outputColumnNames: _col0, _col2, _col5, _col4
Group By Operator
aggregations:
expr: count(1)
expr: count(DISTINCT _col5)
expr: count(DISTINCT _col4)
bucketGroup: false
keys:
expr: _col0
type: string
expr: _col2
type: string
expr: _col5
type: string
expr: _col4
type: string
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
Reduce Output Operator
key expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
expr: _col3
type: string
sort order: ++++
Map-reduce partition columns:
expr: _col0
type: string
expr: _col1
type: string
tag: -1
value expressions:
expr: _col4
type: bigint
expr: _col5
type: bigint
expr: _col6
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(VALUE._col0)
expr: count(DISTINCT KEY._col2:0._col0)
expr: count(DISTINCT KEY._col2:1._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: string
expr: KEY._col1
type: string
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Select Operator
expressions:
expr: '2014-07-07'
type: string
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: bigint
expr: _col3
type: bigint
expr: _col4
type: bigint
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
CONSOLE#
Stage: Stage-0
Move Operator
files:
hdfs directory: true
destination: /output/debug/t

基于Facebook的Hive开发（Hive development at Facebook）

淘宝的数据云梯介绍对Hadoop的主要功能扩展与改造 Hive实践对Hive的改造分布式数据仓库构思

Hadoop!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! hadoop china 2010 资料下载

背景当多个 distinct 操作同时出现在 select 中，数据会分发多次。容易造成Reduce数据倾斜优化点 1、如果不要求精确值，可以使用 spark-sql approx_count_distinct函数（基数计数 hyperloglog） 2、修改SQL 基数数据如下，需要计算不同渠道下的不同周期的访问uv create table if not exists tmp.m...

有多条hive语句从同一张表中做count操作，语句如下： select pf,way,method,count(distinct id) where push_date_type ='$push_date_type' group by pf,way,method; select pf,way,-1,count(distinct id) where push_date_type ='$...