文章相似度的比较，1000篇的时候，就很慢，我的程序有问题吗？

keaizhong 2006-11-25 10:25:51

//检查内容重复情况
$content = $repeat = array();
$num = 0;
$dir = "cs/";

if (is_dir($dir)) {
if ($dh = opendir($dir)) {
while (($file = readdir($dh)) !== false) {
if ( is_dir( $dir.$file) ) continue;
$f = file( $dir.$file ) ;
unset($f[0]);
unset($f[1]);
unset($f[2]);
$text = str_replace(array(" ", "\n", " ", "\t"), array("","","", ""), trim(strip_tags(join("", $f))) );
$lenText = strlen($text);
//echo $text;exit;

foreach( $content as $key => $val ) {
$similar = similar_text( $val, $text );
if( $similar/$lenText > 0.9 ) {
$repeat[$key][] = $file;
$num++;
continue 2;
}
}
$content[$file] = $text;
}
closedir($dh);
}
}
echo "Repeat:".$num." ";
echo "content:".count($content);
print_r($repeat);

...全文

589 14 打赏收藏转发到动态举报

写回复

用AI写文章

14 条回复

切换为时间正序

请发表友善的回复…

发表回复

懒得去死 2006-12-05

打赏
举报

强烈同意唠叨的说法，感觉真的没有什么意义

iasky 2006-12-05

打赏
举报

mark

超级大笨狼 2006-12-05

打赏
举报

SET QUOTED_IDENTIFIER ON
GO
SET ANSI_NULLS ON
GO

CREATE function get_semblance_By_2words
(
@word1 varchar(50),
@word2 varchar(50)
)
returns nvarchar(4000)
as
begin
declare @re int
declare @maxLenth int
declare @i int,@l int
declare @tb1 table(child varchar(50))
declare @tb2 table(child varchar(50))
set @i=1
set @l=2
set @maxLenth=len(@word1)
if len(@word1)<len(@word2)
begin
set @maxLenth=len(@word2)
end
while @l<=len(@word1)
begin
while @i<len(@word1)-1
begin
insert @tb1 (child) values( SUBSTRING(@word1,@i,@l) )
set @i=@i+1
end
set @i=1
set @l=@l+1
end

set @i=1
set @l=2

while @l<=len(@word2)
begin
while @i<len(@word2)-1
begin
insert @tb2 (child) values( SUBSTRING(@word2,@i,@l) )
set @i=@i+1
end
set @i=1
set @l=@l+1
end

select @re=isnull(max( len(a.child)*100/ @maxLenth ) ,0) from @tb1 a, @tb2 b where a.child=b.child
return @re
end

GO
SET QUOTED_IDENTIFIER OFF
GO
SET ANSI_NULLS ON
GO
--测试
--select dbo.get_semblance_By_2words('我是谁','我是谁啊')
--75
--相似度

visam168 2006-12-05

打赏
举报

帮顶

keaizhong 2006-11-28

打赏
举报

没有人理我了吗？？有想法的再讨论一下呀。。

keaizhong 2006-11-27

打赏
举报

不知道是不是有意义，但有这样子的函数了，就用一下喽。。

至少这个函数可以比较一下两个文章的相似程度，不是一句话的意思比较吧。

我现在是抓取了网上很多的文章，然后想进行比对一下，想把重复的给找出来，就这么简单。

li1229363 2006-11-27

打赏
举报

强烈同意唠叨的说法，感觉真的没有什么意义

xuzuning 2006-11-27

打赏
举报

int similar_text ( string first, string second [, float &percent] )
percent 直接反映相似度

不知道计算这个有什么实际意义
echo similar_text('检查内容重复情况', '检查内容不重复情况', $r); //16 已比较的长度
echo $r; //94.1176470588 两个字符串的相似程度
但是他们的意义是相反的！