Python中用正则表达式发掘网页中的超链接(Crawler)

mercury1231 2004-10-07 11:22:48

要求如下：

(a) Hyperlink tag is in the form <a() href(s)=(s)url ()>, where (s) is whitespace, ()
is other attribute. Note that

i. () and (s) may not be present.
ii. a and href are case insensitive.
iii. Tag can span on multiple line
iv. Ignore the hyperlink tags which are inside comment tags (  )

Samples of hyperlink tag:
• <a href= www.company.com>
• <A title="Link to homepage"
• href="http://www.company.com/index.html">

(b) URLs which only differ in fragment ( i.e. the part which follows #) should be considered to be the same page. For example，
http://www.cuhk.edu.hk/index.html and http://www.cuhk.edu.hk/index.html#people
are same page.

(c) Only need to consider HTTP URL

(d) The initial link is a HTTP URL

我真是头疼死了，能不能用尽量少的RE来达到以上的要求。我是初学，所以水平很菜，写了一个也不知道问题在哪里。请各位DX帮帮忙好吗？急

hyperlink_pat = re.compile(r'<\s*(A|a)\s+[^>]*?\s*?(href|HREF)\s*=\s*["\'][^>]+?["\']\s*>')
comment_pat = re.compile(r'')

#search the matched patterns through the HTML source content string
match_comments = re.search(comment_pat, html_source)
match_links = re.search(hyperlink_pat, html_source)

...全文

843 17 打赏收藏转发到动态举报

写回复

用AI写文章

17 条回复

切换为时间正序

请发表友善的回复…

发表回复

metaphy 2005-05-24

打赏
举报

关注哦
我也很想写一个网络蜘蛛之类的东东，只是一直没有资料

楼主写完了没？能否发一份？
moonlinking@hotmail.com

mercury1231 2004-10-07

打赏
举报

谢两位。

不过这是作业要求，我也没办法啊。主要是写一个Web Crawler练习RE的用法。

大家能帮我看看怎么用简洁的方法排除comments里边的超链接和非HTTP的超链接吗？
如果觉得分太少，可以再开贴散分：）

xyzxyz1111 2004-10-07

打赏
举报

html要处理的问题是非常复杂的，用re并不能处理这种情况，如果一定要
可以看看HTMLParser.py:parse_starttag()
这个函数。
我觉得一切以解决问题方便为中心，没有必要非用一门。

xyzxyz1111 2004-10-07

打赏
举报

groups用于的到以匹配表达式中的括号括起来的位置，你的里面没有括号括号，所以groups()得不到任何值.group(0)得到整个匹配值的字符.group(1) ...得到每个括号内匹配的值。
所以应该在里面添加若干括号，把你要提取的部分括起来。
r'<\s*[Aa]{1}\s+[^>]*?[Hh][Rr][Ee][Ff]\s*=\s*["\']?([^>]+)?["\']?.*?>'

然后试试这个结果的groups()
"""<a
haid="sd>f"
href = "http://www.google.com"
hihi="wesdfh"

>"""
以及"""<a
haid="sdf"
href = "http://www.google.com"
hihi="wesdfh"

>"""

mercury1231 2004-10-07

打赏
举报

怎么样排除非HTTP链接呢，因为有些连接虽然是HTTP连接，但是是本地链接，比如/index.html

mercury1231 2004-10-07

打赏
举报

超链接的似乎对了
r'<\s*[Aa]{1}\s+[^>]*?[Hh][Rr][Ee][Ff]\s*=\s*["\']?[^>]+?["\']?.*?>'

mercury1231 2004-10-07

打赏
举报

不允许使用啊，只能用RE

xyzxyz1111 2004-10-07

打赏
举报

用htmlparser不好吗？

mercury1231 2004-10-07

打赏
举报

当然是比较简洁的方法

mercury1231 2004-10-07

打赏
举报

还有我想问怎么样排除comments中的超链接呢

mercury1231 2004-10-07

打赏
举报

我又写了一个，不知道能不能用，

a=re.compile(r'<\s*[Aa]{1}\s+[^>]*?href\s*=\s*["\']?[^>]+?["\']?.*?>')

还有，哪位DX能不能给我解释一下使用re.search()方法返回的这个Match Object其中有个方法叫.group()或者groups()，这个group是什么意思啊。我看API中说的是符合RE的subgroup，可是使用起来还是没明白。
谢谢。

mercury1231 2004-10-07

打赏
举报

我不知道还可以这样忽略大小写哦，我去试试

mercury1231 2004-10-07

打赏
举报

谢谢两位大侠，刚才我也想到了，已经写出来了。
不过就是怕常见的这几种够不够？

hyperlink_pat = re.compile(r'<\s*[Aa]{1}\s+[^>]*?[Hh][Rr][Ee][Ff]\s*=\s*[\"\']?([^>\"\']+)[\"\']?.*?>')
comment_pat = re.compile(r'')

#remove the comments fom the orginal html_source
html_source = comment_pat.sub('', html_source)

#find the matched patterns by scaning through the HTML source content string
match_links = hyperlink_pat.findall(html_source)

#extract only HTTP hyperlinks and drop other kinds
http_pat = re.compile(r"^([Hh][Tt]{2}[Pp]://){1}")
https_pat = re.compile(r"^([Hh]{2}[Tt][Pp][Ss]://){1}")
ftp_pat = re.compile(r"^([Ff][Tt][Pp]://){1}")
news_pat = re.compile(r"^([Nn][Ee][Ww][Ss]://){1}")
ladp_pat = re.compile(r"^([Ll][Aa][Dd][Pp]://){1}")
mailto_pat = re.compile(r"^([Mm][Aa][Ii][Ll][Tt][Oo]:){1}")
script_pat = re.compile(r"^(.*?[Ss][Cc][Ii][Pp][Tt]:){1}")

shhgs 2004-10-07

打赏
举报

再说得详细一点。所有a标记的模式应该是"(?si)<a\s*(.*?)>"。这里(?si)表示regex的匹配方式，忽略大小写，将整个字符串看作一行，也就是说\n也是一个字符。(.*?)的意思是non-greedy。如果还不懂，说明你真的是什么都不懂了，看书去。text processing in Python不错，当然python的文档也不错。

re非常的tricky，要多用。用得越多威力越大。

limodou 2004-10-07

打赏
举报

可以先使用re中的sub将所有注释替换成空，再分析链接。

对于链接的协议，常用的就几种，象http://, ftp://, mailto: 什么的，如果都没有就默认为http的就好了。

shhgs 2004-10-07

打赏
举报

1. 解决跨行问题，用single line (?s)
2. 排除注释，直接把它们删了。
re.sub("(?s)", "", string)
3. 要解决大小写，用ignore case。(?i)
4. 要解决部分的链接，用urlparse类库

先把所有a标记找出来<a\s*(.*?)>,然后再一个一个找属性。不要太心急，divide and conquer

limodou 2004-10-07