过滤js时,SGMLParser失当

lioujian47 2008-07-14 03:30:57

我抓取网页的时候没有过滤js,结果报错:



Traceback (most recent call last):

  File "F:\Python25\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", line 310, in RunScript

    exec codeObject in __main__.__dict__

  File "F:\Python25\CTR_my_way.py", line 190, in <module>

    print cn2juhao(body(i,j))

  File "F:\Python25\CTR_my_way.py", line 131, in body

    parser.feed(html)

  File "F:\Python25\lib\sgmllib.py", line 99, in feed

    self.goahead(0)

  File "F:\Python25\lib\sgmllib.py", line 169, in goahead

    k = self.parse_declaration(i)

  File "F:\Python25\lib\markupbase.py", line 98, in parse_declaration

    decltype, j = self._scan_name(j, i)

  File "F:\Python25\lib\markupbase.py", line 388, in _scan_name

    % rawdata[declstartpos:declstartpos+20])

  File "F:\Python25\lib\sgmllib.py", line 106, in error

    raise SGMLParseError(message)

SGMLParseError: expected name token at '<!!---["+bb+"]-start'

估计是SGMLParser失当,原因如同这位大哥所说,但我却不知道如何使用re避免这种情况.高手们,帮个忙吧?

...全文

169 5 打赏收藏转发到动态举报

写回复

用AI写文章

5 条回复

切换为时间正序

请发表友善的回复…

发表回复

lioujian47 2008-07-14

打赏
举报

Thanks a lot
^.^

maplele20 2008-07-14

打赏
举报

这个问题应该是<script>xxx</script>这种格式嵌套引起的。
解决方法：
1.



html = re.sub('onload=\"\s*[^\"]*\"','',html)

html = re.sub('onmouseover=\"\s*[^\"]*\"','',html)

#修改为：

html = re.sub(r'b2="[^"].*"', '', html)

html = re.sub(r'e2="[^"].*"', '', html)

html = re.sub(r'b="[^"].*"', '', html)

html = re.sub(r'e="[^"].*"', '', html)



html = re.sub('onload=\"\s*[^\"]*\"','',html)

html = re.sub('onmouseover=\"\s*[^\"]*\"','',html)

#修改为：

html = re.sub(r'<![^-->].*-->','',html)

lioujian47 2008-07-14

打赏
举报

随便我再提一下,有个modlue叫IEC,具体见这里:http://www.mayukhbose.com/python/IEC/index.php,其中有个将html转换成txt的功能,perfect!!!,但我不知道他的code是怎么样的:(

我猜,如果有人能把这个搞下来就很cool了

lioujian47 2008-07-14

打赏
举报



def body(url1,url2):

    try:

        html = urllib.urlopen(url1).read()

    except Exception, e:

        html = urllib.urlopen(url2).read()

    #txt = unicode(txt,"gbk")

    html = re.sub('onload=\"\s*[^\"]*\"','',html)

    html = re.sub('onmouseover=\"\s*[^\"]*\"','',html)

    parser = html2txt()

    parser.feed(html)

    parser.close()

    return parser.text