Python读取网页内容显示问题

黑夜愁客 2011-07-31 08:29:54

我想读取土豆网站的一个页面地址，比如http://www.tudou.com/programs/view/kS03BynGs8Q
但是

>>> req = urllib2.Request('http://www.tudou.com/programs/view/kS03BynGs8Q')

>>> req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0)')

>>> page = urllib2.urlopen(req)

>>> data = page.read()

>>> print data

ﾋ

>>> print len(data)

7202

为何打印不出网页html内容呢，而且长度却是7202呢

...全文

464 8 打赏收藏转发到动态举报

写回复

用AI写文章

8 条回复

切换为时间正序

请发表友善的回复…

发表回复

I_NBFA 2011-08-03

打赏
举报

这种问题先抓包

zenxiaoxian 2011-08-03

打赏
举报

import urllib,re
url ='http://www.tudou.com/programs/view/kS03BynGs8Q'
wp = urllib.urlopen(url).read()
#bug = re.findall(r"<title>*......",wp);
#for mai in bug

print wp

Waistcoat22 2011-08-01

打赏
举报

应该是网站返回的内容为gzip压缩格式，但我测试不是每次都返回gzip格式，所以加了个判断：

import urllib2

import StringIO

import gzip



url = 'http://www.tudou.com/programs/view/kS03BynGs8Q'

req = urllib2.Request(url)

req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0)')

response = urllib2.urlopen(req)

content = response.read()

response.close()



html = ""

if response.headers["Content-Encoding"] == 'gzip':

    html = gzip.GzipFile(fileobj = StringIO.StringIO(content)).read()

else:

    html = content



print html