distinct和group by的问题

Xiao_Ma123456789 2015-12-11 02:33:03

加精

目的：用item_06到item_10这个几个字段过滤掉重复的数据
select * from temp 经查询后有5322条数据

--用distinct过滤数据
select distinct item_06,item_07,item_08,item_09,item_10 from temp 经查询后有1363条记录

--用group by 过滤数据 (经查询重复的数据有1344条)
select * from temp a where (a.item_06,a.item_07,a.item_08,a.item_09,a.item_10) in (select item_06,item_07,item_08,item_09,item_10 from temp group bytem_06,item_07,item_08,item_09,item_10 having count(*)>1 )

--去除重复数据 (去除1195条)
delete from temp a where (a.item_06,a.item_07,a.item_08,a.item_09,a.item_10) in (select item_06,item_07,item_08,item_09,item_10 from temp group by item_06,item_07,item_08,item_09,item_10 having count(*)>1 ) and rowid not in (select min(rowid) from temp group by item_06,item_07,item_08,item_09,item_10 having count(*)>1)

那么问题来了，总数据为5322条，去掉重复的数据后还剩4127条。同样是用这几个字段做过滤，为什么数据会产生这么大的差异。

经实际数据的查看，发现用group by这种方式过滤的数据不正确，里面仍然存在重复的数据（用item_06到item_10做条件判断）。跪求高手解释下到底怎么回事！！！！！！！！！！！

...全文

1648 9 打赏收藏转发到动态举报

写回复

用AI写文章

9 条回复

切换为时间正序

请发表友善的回复…

发表回复

小灰狼W 2015-12-14

打赏
举报

Nice. 的确把空值问题忽略了

Xiao_Ma123456789 2015-12-14

打赏
举报

情况一：表内存在字段为null或者多个字段为null的情况，且有多条这样的重复数据存在。
更正一下前面的说法。总结一下，distinct和group by过滤数据时（多列），都适用于多列的值是否为null的情况，但删除数据时，不会处理null值为空的情况，无论是单个字段为null或是多个字段为null的情况。（前提是有多条同样字段为null的重复数据存在）
create table sales(
empid varchar(20), --雇员ID
depid number, --部门ID
area varchar(20), --区域
salenum number)--销售额

insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN','1','ASIA',500);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN','2','ASIA',500);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN','1','ASIA',500);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN','1','ASIA',500);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN','1','EUR',800);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN',null,'EUR',null);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN',null,'EUR',null);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN',null,'EUR',800);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN',null,'EUR',800);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN','1','EUR',null);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN','1','EUR',null);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('LISI','1','EUR',800);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('LISI','1','EUR',800);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('LISI','2','EUR',1000);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('LISI','2','EUR',1000);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('LISI','2','EUR',800);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('LISI','2','EUR',800);

select distinct depid,salenum from sales

select depid,salenum from sales group by depid,salenum having count(*)>1

执行下面的去重复语句后
delete from sales a where (a.DEPID,a.SALENUM) in(select depid,salenum from sales group by depid,salenum having count(*)>1)
and rowid not in(select min(rowid) from sales group by depid,salenum having count(*)>1)
再查看sales表的所有数据后发现，无论单列为null值的数据或多列为null值的数据用group by 这条语句去重复都未
select * from sales （总表共17条数据，通过执行delete语句后，还剩10条数据）

因此，造成了distinct和group by数据上的区别。

情况二：表内容存在字段为null或者多个字段为null的情况，但有且只有一条这样的空数据存在。
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN','1','ASIA',500);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN','2','ASIA',500);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN','1','ASIA',500);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN','1','ASIA',500);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN','1','EUR',800);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN',null,'EUR',null);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN',null,'EUR',800);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('ZHANGSAN','1','EUR',null);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('LISI','1','EUR',800);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('LISI','1','EUR',800);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('LISI','2','EUR',1000);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('LISI','2','EUR',1000);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('LISI','2','EUR',800);
insert into sales(EMPID,DEPID,AREA,SALENUM) values('LISI','2','EUR',800);

再来看一下distinct和group by的情况
select * from sales;

用group by过滤数据时，null值字段的数据压根没有被算进去
select depid,salenum from sales group by depid,salenum having count(*)>1

用distinct过滤数据
select distinct depid,salenum from sales

因此可想而知，用group by的语句删除重复数据时，必然会少删除了大量重复数据！

csliu_cool887 2015-12-14

打赏
举报

为什么我这样写会select 出来 null值呢？

with t as (
select '1' as id,'a' as name  from dual
union all
select '1' as id,'b' as name from dual
union all
select '2' as id ,'c' as name from dual
union all
select '' as id ,'d' as name from dual
union all
select '' as id,'d' as name from dual
union all
select '' as id,'' as name from dual


)
select distinct id,name name from t ;

Xiao_Ma123456789 2015-12-14

打赏
举报

引用 5 楼 ssqtjffcu1 的回复:

select distinct item_06,item_07,item_08,item_09,item_10 from temp  --这里distinct后不包括item_06,item_07,item_08,item_09,item_10都为null的数据哦，你的数据会不会存在这四列都为null的数据呢，所以造成跟select * from temp 5322条数据差距,同样，group by 中也不会包括这四列都为空的数据

大侠，我做了实验哦，distinct过滤的时候会把null值显示出来，无论是所有列数据都为null或者单个列的数据为null但是其他列有值的情况都会有值，但是group by就不行。它只会列出所有列都不为null值的数据。所以最后造成了大量的数据差异！最后还是要感谢大侠的提醒！

Xiao_Ma123456789 2015-12-13

打赏
举报

引用 3 楼 chenfeng1122 的回复:

应该是有字段数据是null

意思有null值的就会不正确么？

ssqtjffcu 2015-12-13

打赏
举报

select distinct item_06,item_07,item_08,item_09,item_10 from temp  --这里distinct后不包括item_06,item_07,item_08,item_09,item_10都为null的数据哦，你的数据会不会存在这四列都为null的数据呢，所以造成跟select * from temp 5322条数据差距,同样，group by 中也不会包括这四列都为空的数据

陈灬风 2015-12-11

打赏
举报

应该是有字段数据是null

Xiao_Ma123456789 2015-12-11

打赏
举报

回复1楼的大侠

大侠，这个语句执行完整好是4127条，右侧的数据全为空。浪费您的一片苦心了，麻烦再给解释具体点呗。

小灰狼W 2015-12-11

打赏
举报

从描述上没看出问题。但是，如果distinct 没有问题的话，重复的数据应该在4000条以上，那么试试这个吧

SELECT t.item_06,t.item_07,t.item_08,t.item_09,t.item_10,
x.item_06,x.item_07,x.item_08,x.item_09,x.item_10 from temp t,(
select a.item_06,a.item_07,a.item_08,a.item_09,a.item_10,a.rowid rid
from temp a where (a.item_06,a.item_07,a.item_08,a.item_09,a.item_10) in
(select item_06,item_07,item_08,item_09,item_10 from temp group bytem_06,item_07,item_08,item_09,item_10 having count(*)>1 ))x
WHERE t.rowid=x.rid(+)
ORDER BY 1,2,3,4,5;

将全表数据和“用group by 过滤数据” 进行比较，观察下数据有什么问题

变量定义，字段约束、等等，Select 、Where、Top、Group By 、Order By，Distinct、Declare等，详细可看图片和课程，本课程带有完整的SQL数据库课程课件，代码有注释解析等等，学习起来简单易懂，易上手直接操作，...

MySQL 中的 distinct 和 group by 的性能比较

sql中distinct和group by都可以用于去重复，distinct是将重复的行从结果中出去，group by是按指定的列分组。区别是group by 中可以包含聚合函数。如下所示的表结构，id为编号，name是姓名，class是班级，score是...

在数据库中，DISTINCT 和 GROUP BY 都是用来过滤重复数据的，但它们的原理和用法是不同的。 DISTINCT 是通过比较每一行数据，仅返回不重复的数据，因此如果数据量较大，DISTINCT 的效率较低。 GROUP BY 是通过分组，...

原创文章，希望多多关注...group by和distinct都能使用索引，效率相同。在语义相同，无索引的情况下： distinct效率高于group by。原因是distinct 和 group by都会进行分组操作，但group by可能会进行排序，触发fil