这个页面 为什么不能用BeautifulSoup解析?

xueshi 2009-05-31 09:04:04

url = 'http://news.cnad.com/html/Article/2009/0520/2009052011244679.shtml'

data = urllib.urlopen(url).read()

soup = BeautifulSoup(data)


#print soup

contents = soup.findAll('div', "newsbody")
print len(contents)
for c in contents:
print c


提示下列错误:
Traceback (most recent call last):
File "ipiao.py", line 36, in <module>
soup = BeautifulSoup(data)
File "/home/lixueshi/BeautifulSoup-3.1.0.1/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/home/lixueshi/BeautifulSoup-3.1.0.1/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/home/lixueshi/BeautifulSoup-3.1.0.1/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/usr/lib/python2.5/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.5/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.5/HTMLParser.py", line 301, in check_for_whole_start_tag
self.error("malformed start tag")
File "/usr/lib/python2.5/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 319, column 198


但是别的 url就可以解析,请问是这个页面不规范吗?
...全文
786 1 打赏 收藏 转发到动态 举报
写回复
用AI写文章
1 条回复
切换为时间正序
请发表友善的回复…
发表回复
xueshi 2009-05-31
  • 打赏
  • 举报
回复
找到原因了

%python test.py
Traceback (most recent call last):
File "test.py", line 37, in <module>
main()
File "test.py", line 33, in main
soup2 = BeautifulSoup(t2)
File "/usr/local/lib/python2.5/site-packages/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.5/site-packages/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.5/site-packages/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/usr/local/lib/python2.5/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/local/lib/python2.5/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "/usr/local/lib/python2.5/HTMLParser.py", line 314, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
File "/usr/local/lib/python2.5/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: u"</if' + 'rame>", at line 632, column 381

BeautifulSoup是用来解析html和xml的利器之一,以前就是用正则,后来改用BeautifulSoup,结果今天在做一个页面的时候报错了。

既然它在 这里出错,我们跳过去就是了

11 '''t2 = open("t2.txt","r").readlines()
12 data = ''
13 for i in t2:
14 if i.find("/if' + 'rame>'") != -1:
15 data += i

修改后,出现新错。

>> > check_for_whole_start_tag
>> > self.error("malformed start tag")
>> > File "/usr/lib/python2.5/HTMLParser.py", line 115, in error
>> > raise HTMLParseError(message, self.getpos())
>> > HTMLParser.HTMLParseError: malformed start tag, at line 49, column 20

web上查查,在beautifulsoup的Google group上,BeautifulSoup的作者回答了这个问题:

http://groups.google.com/group/beautifulsoup/msg/d5a7540620538d14

大致意思是最新版的beautifulsoup无法处理这个,有3个解决办法:

1. Pre-process the data so that HTMLParser can handle it.
2. Use lxml or html5lib.
3. Use Beautiful Soup 3.0.7a, the last version that uses SGMLParser.

看来只有退回3.0.7了...

郁闷换回原版后,上述问题都解决了。。崩溃中。

37,743

社区成员

发帖
与我相关
我的任务
社区描述
JavaScript,VBScript,AngleScript,ActionScript,Shell,Perl,Ruby,Lua,Tcl,Scala,MaxScript 等脚本语言交流。
社区管理员
  • 脚本语言(Perl/Python)社区
  • WuKongSecurity@BOB
加入社区
  • 近7日
  • 近30日
  • 至今

试试用AI创作助手写篇文章吧