【高分求】如何使用正则表达式获取WEB上所有链接?

xieyj 2007-04-12 02:43:26
RT
...全文
242 2 打赏 收藏 转发到动态 举报
写回复
用AI写文章
2 条回复
切换为时间正序
请发表友善的回复…
发表回复
蒋晟 2007-04-13
  • 打赏
  • 举报
回复
HTML is a non-regular language. Like balanced parenthesis, matching balanced HTML tags is impossible to do with Regular Expressions (in CS terms, an NFA or DFA) alone. Impossible as in "halting problem" impossible.

You can write complex RegExs that will work on the sample documents you have, but you can never find a correct RegEx that will work on all valid HTML, what to speak of invalid HTML. What you *can* do is use regular expressions to match individual open and close tags and use a stack to keep track of your depth. In CS terms, this would be a Push-down Automata, or PDA.

The .NET Regular Expression engine does give away a method to imitate a simple PDA(http://www.oreilly.com/catalog/regex2/chapter/ch09.pdf)

Here is a regular expression that will match the balanced <a> tags.

(?:<a.*?href=[""'](?<url>.*?)[""'].*?>)(?<name>(?><a[^<]*>(?<DEPTH>)|</a>(?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:</a>)
xiaocai800322 2007-04-12
  • 打赏
  • 举报
回复
关注

1,593

社区成员

发帖
与我相关
我的任务
社区描述
Delphi 网络通信/分布式开发
社区管理员
  • 网络通信/分布式开发社区
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧