Python中用正则表达式发掘网页中的超链接(Crawler)
要求如下:
(a) Hyperlink tag is in the form <a() href(s)=(s)url ()>, where (s) is whitespace, ()
is other attribute. Note that
i. () and (s) may not be present.
ii. a and href are case insensitive.
iii. Tag can span on multiple line
iv. Ignore the hyperlink tags which are inside comment tags ( <!-- --> )
Samples of hyperlink tag:
• <a href= www.company.com>
• <A title="Link to homepage"
• href="http://www.company.com/index.html">
(b) URLs which only differ in fragment ( i.e. the part which follows #) should be considered to be the same page. For example,
http://www.cuhk.edu.hk/index.html and http://www.cuhk.edu.hk/index.html#people
are same page.
(c) Only need to consider HTTP URL
(d) The initial link is a HTTP URL
我真是头疼死了,能不能用尽量少的RE来达到以上的要求。我是初学,所以水平很菜,写了一个也不知道问题在哪里。请各位DX帮帮忙好吗?急
hyperlink_pat = re.compile(r'<\s*(A|a)\s+[^>]*?\s*?(href|HREF)\s*=\s*["\'][^>]+?["\']\s*>')
comment_pat = re.compile(r'<!--.*?-->')
#search the matched patterns through the HTML source content string
match_comments = re.search(comment_pat, html_source)
match_links = re.search(hyperlink_pat, html_source)