悬赏比我速度快的算法，哈哈

沙老师 2010-02-10 06:58:30

这样的一个问题：给定一组字符串和一个文件，要求把文件中包含这组字符串之一（至少包含其中之一，不区分大小写）的所有行剔除掉。呵呵看似简单，如果这组字符串和待处理的文件很大的话，传统算法就有很大的性能问题了。

感兴趣的可以从http://www.rayfile.com/files/ecb5d0d7-1630-11df-904f-0015c55db73d/处下载data.zip，其中的strings.txt包含的是字符串，一行一个，in.txt是待处理的文件，out.txt是我处理好的文件，供对比用。

我是用C++/STL写的代码，VC 7.1编译，开启所有优化，在我P4赛扬1.6G的本本上平均速度约3.64秒（VC的STL做得还是不错的，GCC的话要6秒多，整整一倍），单考虑CPU因素的话，3.64/1.6=2.3。

各位来试一试，语言、算法随便，100分全额奉送给比我快的第一位同学，当然了，您得把代码贴上来，到时候我也把自己的代码贴出来，哈哈

BTW：可以参考UNIX下面的grep实现，我用Windows版的grep试过，一眨眼的功夫就处理好了！

...全文

460 21 打赏收藏转发到动态举报

写回复

用AI写文章

21 条回复

切换为时间正序

请发表友善的回复…

发表回复

沙老师 2010-02-12

打赏
举报

写了N个Lua脚本，从最原始的二重循环80多秒，到散列的4秒，到AC算法的小于1秒，算法的威力真的不可小视，哈哈

沙老师 2010-02-12

打赏
举报

呵呵感谢各位，学到了“模式匹配”的很多算法，发现这些算法还真是厉害啊！

信守承诺，贴上我的ugly代码，用的是散列技术，见笑了



#include <time.h>

#include <iostream>

#include <fstream>

#include <set>

#include <string>

#include <algorithm>



using namespace std;



#define ASSERT(T) if (!(T)) {exit(-1);}



set<string> matchstrs;

size_t maxmatchstrlen = 0;



// 初始化匹配字符串

void init_matchstrs()

{

    ifstream f("strings.txt");

    ASSERT(f);

    string s;

    while(getline(f, s))

    {

        transform(s.begin(), s.end(), s.begin(), (int(*)(int))tolower);

        matchstrs.insert(s);

        size_t len = s.length();

        if (len > maxmatchstrlen)

        {

            maxmatchstrlen = len;

        }

    }

    f.close();

}



// 判断字符串是否匹配

bool match(string str)

{

    // 转换为小写

    transform(str.begin(), str.end(), str.begin(), (int(*)(int))tolower);

    size_t len = str.length();

    for (size_t offset = 0; offset < len; offset ++)

    {

        for (size_t count = 1; count <= min(maxmatchstrlen, len - offset); count++)

        {

            string sub_s = str.substr(offset, count);

            if (matchstrs.find(sub_s) != matchstrs.end())

            {

                return true;

            }

        }

    }

    return false;

}



int main()

{

    // 初始化匹配字符串

    init_matchstrs();



    ifstream fin("in.txt");

    ofstream fout("out.txt");

    ASSERT(fin);

    ASSERT(fout);



    string line;

    while(getline(fin, line))

    {

        if (!match(line))

        {

            fout << line << endl;

        }

    }



    fin.close();

    fout.close();



    cout << "耗时：" << clock() * 1.0 / CLOCKS_PER_SEC << "秒" << endl;



    return 0;

}

后来用Lua脚本实现了AC算法，速度简直太快了！



--[[

使用Aho-Corasick算法多模式匹配，时间控制在1秒内

]]



-- 初始化根

root = {};

root.value = 0;

root.failure = root;



-- 创建失败指针

function makefailure(node)

	local p = node.parent.failure;

	while true do

		if p[node.value] then

			node.failure = p[node.value];

			break;

		elseif p == root then

			node.failure = p;

		else

			p = p.parent;

		end;

	end;

end;



-- 插入一个字符串

function insert(pattern)

	-- 循环插入字节

	local pos;

	local p = root;

	for pos = 1, pattern:len() do

		local byte = pattern:byte(pos);

		if not p[byte] then

			-- 创建节点

			p[byte] = {};

			p[byte].value = byte;

			-- 父节点

			p[byte].parent = p;

			-- 失败指针

			makefailure(p[byte]);

		end;

		p = p[byte];

	end;

	p.matched = true;

end;



-- 初始化树

function init(patternfile)

	-- 按行读模式字符串

	local pattern;

	for pattern in io.lines(patternfile) do

		-- 插入字符串

		insert(pattern:lower());

	end;

end;



-- 判断字符串是否匹配

function match(str)

	local p = root;

	local pos;

	for pos = 1, str:len() do

		local byte = str:byte(pos);

		if p[byte] then

			p = p[byte];

		else

			p = p.failure;

			while not (p[byte] or p == root) do

				p = p.parent;

			end;

		end;

		

		if p.matched then

			return true;

		end;

	end;

	return false;

end;



-- 函数：转换文件

function convert(inname, outname)

	io.write("正在转换：\"", inname, "\" -> \"", outname, "\"\n");

	

	-- 打开输出文件

	local fileout = assert(io.open(outname, "w"));



	-- 按行读取输入文件

	local line;

	for line in io.lines(inname) do

		-- 如果不包含匹配字符串，则写到输出文件中

		if not match(line:lower()) then

			fileout:write(line, "\n");

		end;

	end;



	-- 关闭输入和输出文件

	fileout:close();

end;



starttime = os.time();

print("初始化...");

init("matchstr.txt");

convert("in\\DOSNET.INF", "out\\DOSNET.INF");

convert("in\\DRVINDEX.INF", "out\\DRVINDEX.INF");

convert("in\\TXTSETUP.SIF", "out\\TXTSETUP.SIF");

endtime = os.time();

io.write("用时：", os.difftime(endtime, starttime), "秒\n");

subenly 2010-02-11

打赏
举报

看过。虽然不怎么清楚。

saramand9 2010-02-11

打赏
举报





#include<iostream>

#include<string>

#include<deque>

#include<time.h>

using namespace std;



int step = 250 ; 

struct trieTree

{

    trieTree * next[400] ; 

    int n ; 

    trieTree * fail ; 



    trieTree(){

        memset( next , 0 , sizeof(next) ) ; 

        n = 0 ; 

        fail = 0 ; 

    } 

} ;

trieTree * u , v ; 

deque<trieTree*> dq ; 

bool isLetter(char c ) 

{

	if( c >='a' && c<='z') return true;

	//if( c >='A' && c<='Z') return true; 

	return false; 

}

int turnLetter(char c)

{

	if(!isLetter(c) )

		return c + step ;

	else return c + step - 32 ; 

}

void  Fail ( trieTree * root ){





    root->fail = root  ;

    dq.clear() ; 



    dq.push_back( root ) ;



    int i , j ; 

    while(!dq.empty() ){



        u = dq.front() ; dq.pop_front() ;  



        //for( i = 0 ; i < 26 ; i ++ ){

		for( i = 0 ;i < 400 ; i ++) {

            

            if( !u ->next[i] )  continue ; 



            if( u == root ){



                u->next[i]->fail =  root ; 

            }



            else{

                trieTree * tmp = u ;

                

                while (!tmp->fail->next[i] ){

                    tmp = tmp->fail ;

                    if( tmp == root ) break;



                }



                if( tmp!=root )

                    u->next[i]->fail = tmp->fail->next[i] ;

                else 

                    u->next[i]->fail = root ; 

            }

            dq.push_back ( u->next[i] ) ;

        }

    

    }

}





int run ( trieTree * root , char * text ){



    trieTree *  p = root ; 



    int ret = 0 ; 

    int i , j ; 

    for( i = 0 ; text[i] ;i ++){



      //  j = text[i] -'a' ;

		///j = text[i] + step ;

		j = turnLetter(text[i]);

        while ( !p->next[j] && p!=root) p = p->fail ; 



        p = p ->next[j] ;



        if ( !p ) p  = root ;



        trieTree * tmp =p ; 



		if(tmp!=root && tmp->n !=0) return true ;

    //    while( tmp!=root && tmp->n!=-1){

         //   ret += tmp->n ; 

         //   tmp->n = -1 ; 

         //   tmp = tmp->fail ;

       // }



    }

   // return ret ; 

	return false;

}





void insert ( char key[] , trieTree *root ){

     

   ///  printf("%s\n",key);

    int i = 0 , j = 0; 

    for( i = 0; key[i] ; i ++ ){



		//j = key[i]+step ;

		j = turnLetter(key[i]);

        if( !root->next[j])

        

         root->next[j] = new trieTree() ; 

        root = root->next[j];

    }

    root->n ++ ; 

}

char key[550] ;

char text[10000] ;

char textCopy[10000];



int main()

{ 

	trieTree * root = new trieTree() ; 

	freopen("strings.txt","r",stdin);

	int i ; 

	while( gets(key)>0){



	  if( strlen(key) < 1 ) continue; 

	  insert ( key ,  root ) ;

	}

    Fail ( root  ) ;

	freopen("in.txt","r",stdin);

	

	freopen("myOut.txt","w",stdout);



	while(gets(text)>0){  



		//puts(text);



		if(!run(root ,text) )

		//if(strlen(text)<1||!run(root , text));

			puts(text) ;



	}



	printf("耗时： %lf \n" ,clock() * 1.0 / CLOCKS_PER_SEC  );

    return 0 ; 

}

改完了。。输出与LZ给的一致

mLee79 2010-02-11

打赏
举报

跟你最后的结果对比了下, 你的多了好多空行, 其他倒一样, 难道是删除后还要保存 \r\n 么...

mLee79 2010-02-11

打赏
举报

LZ的结果是错的, 比如随便搜了下 "cq90alg6.out" 在 strings.txt 里是有的, 但包含之的行貌似没有删除....
写了个仍在这: http://x4c.googlecode.com/svn/trunk/x4c/build/win32/xdl/xdl.c
在偶这的DELL破本上运行时间是 0.1 秒左右, 其中0.08秒用于创建有穷自动机, 0.02秒用于扫描过程....

请结贴...

mLee79 2010-02-11

打赏
举报

个人认为这个怎么跑都不会超过1秒钟, 你的分偶先预定了 ....

日立奔腾浪潮微软松下联想 2010-02-11

打赏
举报

你的时间是否包括读入文件和输出的时间啊？
另外，建议楼主应该先给出自己的代码，这样才好比较，不要搞个exe给别人运行。:)

urakvv7 2010-02-10

打赏
举报

已阅。[回复内容不短了]

绿色夹克衫 2010-02-10

打赏
举报

应该是比较标准的多模式匹配，用trie或后缀树可以达到线性时间。

沙老师 2010-02-10

打赏
举报

测了一下，发现两个问题：1、区分了大小写；2、中文问题还是得解决的。

用WinDiff或者WinMerge可以观察输出文件的差异

saramand9 2010-02-10

打赏
举报



#include<iostream>

#include<string>

#include<deque>

#include<time.h>

using namespace std;

struct trieTree

{

    trieTree * next[128] ; 

    int n ; 

    trieTree * fail ; 



    trieTree(){

        memset( next , 0 , sizeof(next) ) ; 

        n = 0 ; 

        fail = 0 ; 

    } 

} ;

trieTree * u , v ; 

deque<trieTree*> dq ; 



void  Fail ( trieTree * root ){





    root->fail = root  ;

    dq.clear() ; 



    dq.push_back( root ) ;



    int i , j ; 

    while(!dq.empty() ){



        u = dq.front() ; dq.pop_front() ;  



        //for( i = 0 ; i < 26 ; i ++ ){

		for( i = 0 ;i < 128 ; i ++) {

            

            if( !u ->next[i] )  continue ; 



            if( u == root ){



                u->next[i]->fail =  root ; 

            }



            else{

                trieTree * tmp = u ;

                

                while (!tmp->fail->next[i] ){

                    tmp = tmp->fail ;

                    if( tmp == root ) break;



                }



                if( tmp!=root )

                    u->next[i]->fail = tmp->fail->next[i] ;

                else 

                    u->next[i]->fail = root ; 

            }

            dq.push_back ( u->next[i] ) ;

        }

    

    }

}





int run ( trieTree * root , char * text ){



    trieTree *  p = root ; 



    int ret = 0 ; 

    int i , j ; 

    for( i = 0 ; text[i] ;i ++){



      //  j = text[i] -'a' ;

		j = text[i] ;

        while ( !p->next[j] && p!=root) p = p->fail ; 



        p = p ->next[j] ;



        if ( !p ) p  = root ;



        trieTree * tmp =p ; 



		if(tmp!=root && tmp->n !=0) return true ;

    //    while( tmp!=root && tmp->n!=-1){

         //   ret += tmp->n ; 

         //   tmp->n = -1 ; 

         //   tmp = tmp->fail ;

       // }



    }

   // return ret ; 

	return false;

}





void insert ( char key[] , trieTree *root ){

     

   ///  printf("%s\n",key);

    int i = 0 , j = 0; 

    for( i = 0; key[i] ; i ++ ){

    

		// j = key[i]-'a' ;

		j = key[i] ;

        if( !root->next[j])

        

         root->next[j] = new trieTree() ; 

        root = root->next[j];

    }

    root->n ++ ; 

}

char key[550] ;

char text[10000] ;

int main()

{ 

	trieTree * root = new trieTree() ; 

	freopen("strings.txt","r",stdin);

	int i ; 

	while( gets(key)>0){



	  if( strlen(key) < 1 ) continue; 

	  insert ( key ,  root ) ;

	}

    Fail ( root  ) ;

	freopen("in.txt","r",stdin);

	

	freopen("myOut.txt","w",stdout);



	while(gets(text)>0){  



		//puts(text);



		if(!run(root ,text) )

		//if(strlen(text)<1||!run(root , text));

			puts(text) ;



	}



	printf("耗时： %lf \n" ,clock() * 1.0 / CLOCKS_PER_SEC  );

    return 0 ; 

}

看文件大小好像差不多，LZ可不可以把中文去掉跑下数据看 - -||
这个是我的代码

saramand9 2010-02-10

打赏
举报

哈，写完了，不过我偷偷把那些中文先去掉了。。跑出来的答案跟你好像有点不太一样。。
时间1秒内，不过我是2G内存,P8700CPU

沙老师 2010-02-10

打赏
举报

这是完整的程序加文件，可以下载运行实测：http://www.rayfile.com/files/706f2cc0-1648-11df-9ec2-0015c55db73d/

saramand9 2010-02-10

打赏
举报

....调试了好久，突然发现IN文件里有中文 - -||

沙老师 2010-02-10

打赏
举报

这个是改正之后产生的输出文件：http://www.rayfile.com/files/0cc32f4c-1648-11df-ab5e-0015c55db73d/

沙老师 2010-02-10

打赏
举报

有一个地方写错了，应该是<=，写成了<。不过改正之后速度更快了，测了5次平均3.2秒，呵呵

saramand9 2010-02-10

打赏
举报

LZ ....
“3155pcl.gpd”这个数据在你的string中出现了。。为什么在你的out文件中还是有出现啊，难道我提议理解错了？

saramand9 2010-02-10

打赏
举报

嘿嘿。。我去KMP水水看。。

沙老师 2010-02-10

打赏
举报

对了，在C++下面这样计算程序运行时间：



#include <time.h>

...

int main()

{

    ...

    ...

    // 在程序最后这样输出时间

    cout << "耗时：" << clock() * 1.0 / CLOCKS_PER_SEC << "秒" << endl;

    return 0;

}

相应地，C语言只需要把cout换成printf就可以了