超难算法：取得URL中如bbs.csdn.net的部分

Uindex 2007-09-06 03:17:38

例如：
一个地址：http://user:pass@www.163.com:8080/music/chs/index.aspx?mid=23081712

可以先删除http://user:pass@，然后从:8080/开始把后面的也删除了，这就得到www.163.com，如果我表达的不清楚下面有程序描述，

function GetDomainRoot(url: string): string;

//获得一个URL地址的域

var p1:integer;

begin

     //http://user:pass@www.163.com:8080/music/chs/index.aspx?mid=23081712

     //    p1

     result:='';

     if LeastUrlRequest(url) then//检验URL是否合法,判断是否含有,{<>这一类的字符

     begin

        //去掉协议

        p1:=pos('://',url);

        if (p1>0) then

           Delete(url,1,p1+2);

        p1:=pos('/',url);

        if (p1>1) then

        begin

           //得到主机地址user:pass@www.163.com:8080

           url:=Copy(url,1,p1-1);

           if url<>'' then

           begin

              //去掉用户名和密码www.163.com:8080

              p1:=pos('@',url);

              if (p1>0) then

                 Delete(url,1,p1);

              //去掉端口www.163.com

              Delete(url,pos(':',url),Length(url));

           end;

           if (pos('192.168.',url)>0)  or (NCpos('localhost',url)>0) or (Pos('.',url)<=0) then//局域网的不要

              result:=''

           else

              result:=url;

        end else if (length(url)>=3) and (url[1]<>'/') then //a.b满足域名或文件名的形式

            result:=url;

     end;

end;

我能想到的最好的就是上面这个算法了，有没有更好的办法呢？还可以如何改进？

...全文

206 19 打赏收藏转发到动态举报

写回复

用AI写文章

19 条回复

切换为时间正序

请发表友善的回复…

发表回复

Uindex 2007-09-16

打赏
举报

这样的话就是一次循环即可解决问题，比复制删除的办法高明多了，谢谢neweipeng ，谢谢浪子阿鹏!!
测试查找URL字符串100万次，原来的需要250多MS，这个只需要17MS

很土 2007-09-14

打赏
举报

pCh: PChar;

很土 2007-09-14

打赏
举报

url[i] 的操作已经多了一次间接寻址，建议直接使用 pCh = Pointer(url); pCh^ 取字符，下一个字符使用 Inc(pCh) 即可。

neweipeng 2007-09-12

打赏
举报

COPY,POS之类函数应该也是通过地址（或指针）操作来实现，多次使用可能意味着多次重复遍历该url string。
针对楼主给出的url字符串形式，只要找到(/,:,@)这类字符就可以提取出结果。
因此考虑使用字符指针一次遍历的方法，程序如下：
function GetDomainRoot(url: string): string;
//获得一个URL地址的域
var i:integer;//url下标
pstr:pchar;
flag:boolean;//判断标志
begin
//http://user:pass@www.163.com:8080/music/chs/index.aspx?mid=23081712
i := 1;
flag := false;
pstr := @url;
// result:='';
if LeastUrlRequest(url) then//检验URL是否合法,判断是否含有,{}<>这一类的字符
//我在Delphi7里没找到这个函数，此处应该也是遍历url查找，可以将它整合到下面的判断中
begin
while url[i] <>'' do
//此处用for可能能更好的替代处理if (length(url)>=3) and (url[1]<>'/') then
//我有些看不懂上句的具体作用，^_^
//url[i] <>'' 可以替代为 ord（url[i]）<> 0 ,我不太清楚哪个好
begin
if url[i] = '.' then
flag := True //找到 . 意味着已经进入域名，那么域名头不用再定位
else
begin
if (url[i] = '/') or (url[i] = '@') or (url[i] = ':') then
begin
if flag = false then
pstr := @url[i+1] //不断的定位域名头
else
begin
url[i] := char(0); //将url后面多于的部分删除
break;
end;
end;
end;
i := i+1; //inc(i);
end;
result := pstr;
if (pos('192.168.',result)>0) then //局域网的不要 ,(Pos('.',url)<=0) then已经在while里实现了
result:='';
end
else
result := '';
end;

ahjoe 2007-09-10

打赏
举报

简单呢，就是取得URL中的域名或IP。
1. 查找 //, 删除它及其左边的部分
2. 查找 /, 删除它及其右边的部分
3. 查找 @, 如果找到,删除它及其左边的部分
4. 查找 :, 如果找到,删除它及其右边的部分

很土 2007-09-10

打赏
举报

楼主的程序Pos，Copy和Delete已使用多次，这样的代码效率高不到哪儿去，切记！少用字符串相加，拷贝等函数。

其实取 URL 中的 Host 部分是非常容易的事，一次扫描就搞定！使用状态机来处理。

ly_liuyang 2007-09-10

打赏
举报

建议用标准函数完成,对可靠性有要求就别用控件的
即使用COPY,POS之类的对于不是要求极高的话,仍旧足够的
每天几百万次?肯定够了

Uindex 2007-09-10

打赏
举报

kyee : 何谓状态机？可否演示一二

Uindex 2007-09-08

打赏
举报

CSDN的编辑器有问题，显示不正常 P1在@，P2是8080前面的: 编辑的时候显示正常，怎么显示时错位了

Uindex 2007-09-08

打赏
举报

这里实在不能考虑正则，正则是速度最慢的！

http://user:pass@www.163.com:8080/music/chs/index.aspx?mid=23081712
p1 p2

为了快速取得p1（@可能没有,那就用://）,p2(也可能没有,那就用/)之间的字符串，难道没有更好的办法吗？

Uindex 2007-09-07

打赏
举报

正则表达式会不会比这个快
下面是正则查找一个字符串的方法，可以供我们参考改进？

帮帮我吧，这个函数每天要调用上百万次！

function TRegExpr.ExecPrim (AOffset: integer) : boolean;

 procedure ClearMatchs;

  // Clears matchs array

  var i : integer;

  begin

   for i := 0 to NSUBEXP - 1 do begin

     startp [i] := nil;

     endp [i] := nil;

    end;

  end; { of procedure ClearMatchs;

..............................................................}

 function RegMatch (str : PRegExprChar) : boolean;

  // try match at specific point

  begin

   //###0.949 removed clearing of start\endp

   reginput := str;

   Result := MatchPrim (programm + REOpSz);

   if Result then begin

     startp [0] := str;

     endp [0] := reginput;

    end;

  end; { of function RegMatch

..............................................................}

 var

  s : PRegExprChar;

  StartPtr: PRegExprChar;

  InputLen : integer;

 begin

  Result := false; // Be paranoid...



  ClearMatchs; //###0.949

  // ensure that Match cleared either if optimization tricks or some error

  // will lead to leaving ExecPrim without actual search. That is

  // importent for ExecNext logic and so on.



  if not IsProgrammOk //###0.929

   then EXIT;



  // Check InputString presence

  if not Assigned (fInputString) then begin

    Error (reeNoInpitStringSpecified);

    EXIT;

   end;



  InputLen := length (fInputString);



  //Check that the start position is not negative

  if AOffset < 1 then begin

    Error (reeOffsetMustBeGreaterThen0);

    EXIT;

   end;

  // Check that the start position is not longer than the line

  // If so then exit with nothing found

  if AOffset > (InputLen + 1) // for matching empty string after last char.

   then EXIT;



  StartPtr := fInputString + AOffset - 1;



  // If there is a "must appear" string, look for it.

  if regmust <> nil then begin

    s := StartPtr;

    REPEAT

     s := StrScan (s, regmust [0]);

     if s <> nil then begin

       if StrLComp (s, regmust, regmlen) = 0

        then BREAK; // Found it.

       inc (s);

      end;

    UNTIL s = nil;

    if s = nil // Not present.

     then EXIT;

   end;



  // Mark beginning of line for ^ .

  fInputStart := fInputString;



  // Pointer to end of input stream - for

  // pascal-style string processing (may include #0)

  fInputEnd := fInputString + InputLen;



  {$IFDEF ComplexBraces}

  // no loops started

  LoopStackIdx := 0; //###0.925

  {$ENDIF}



  // Simplest case:  anchored match need be tried only once.

  if reganch <> #0 then begin

    Result := RegMatch (StartPtr);

    EXIT;

   end;



  // Messy cases:  unanchored match.

  s := StartPtr;

  if regstart <> #0 then // We know what char it must start with.

    REPEAT

     s := StrScan (s, regstart);

     if s <> nil then begin

       Result := RegMatch (s);

       if Result

        then EXIT

        else ClearMatchs; //###0.949

       inc (s);

      end;

    UNTIL s = nil

   else begin // We don't - general case.

     repeat //###0.948

       {$IFDEF UseFirstCharSet}

       if s^ in FirstCharSet

        then Result := RegMatch (s);

       {$ELSE}

       Result := RegMatch (s);

       {$ENDIF}

       if Result or (s^ = #0) // Exit on a match or after testing the end-of-string.

        then EXIT

        else ClearMatchs; //###0.949

       inc (s);

     until false;

(*  optimized and fixed by Martin Fuller - empty strings

    were not allowed to pass thru in UseFirstCharSet mode

     {$IFDEF UseFirstCharSet} //###0.929

     while s^ <> #0 do begin

       if s^ in FirstCharSet

        then Result := RegMatch (s);

       if Result

        then EXIT;

       inc (s);

      end;

     {$ELSE}

     REPEAT

      Result := RegMatch (s);

      if Result

       then EXIT;

      inc (s);

     UNTIL s^ = #0;

     {$ENDIF}

*)

    end;

  // Failure

 end; { of function TRegExpr.ExecPrim

walllacecn 2007-09-07

打赏
举报

很简单的字符串提取罢了,地址也就这么点字符,性能能差到哪去?
如果是这个地址不是死的,有可能是SINA,有可能是搜狐,但是结构是WWW.XXX.COM,那可以直接用POS函数来确定
没有必要去把其他的字符去掉和
VAR
STR1,STR2,STR3:STRING;
A,B,C:INTEGER;
BEGIN
STR1:='http://user:pass@www.163.com:8080/music/chs/index.aspx?mid=23081712';
STR2:='';//把地址设置为关键字;比如www.sina.com,如果是地址比较复杂的.就设关键字www.
STR3:='HTTP://';
if pos(str2,str1)>0 and pos(STR3,STR1)<>0 then

STR1:=STR2;//如果地址多个,就用正则表达式,咋写的不记得了.取WWW.和.XXX之间的

END;

我觉得力图程序短小才是重要的,多用逻辑结构少用函数.太复杂的函数运行起来肯定不太好吧.