python 爬虫问题,求大神指教

Linux光 2017-08-20 10:50:38
写了一个简单爬虫,用来保存网站图片的,但是运行会出错,求各位大神指点


#!C:\Python27
#coding=utf-8
import urllib
import re

def getHtml(url):
page = urllib.urlopen(url)
html = page.read()
return html

def getImg(html):
reg = 'src="([^ >]+\.(?:jpeg|jpg))"'
imgre = re.compile(reg)
print imgre
imglist = re.findall(imgre,html)
for imgurl in imglist:
urllib.urlretrieve(imgurl)
print imgurl

html = getHtml("https://www.qqbody.com/gxtouxiang/19290.html")
print '*******************************************************************************'
getImg(html)
...全文
247 7 打赏 收藏 转发到动态 举报
写回复
用AI写文章
7 条回复
切换为时间正序
请发表友善的回复…
发表回复
LinDRon 2017-08-23
  • 打赏
  • 举报
回复
你的代码在python2下能跑,python3库名变了。python3入门建议使用requests和bs4库学习
Billy___Chen 2017-08-22
  • 打赏
  • 举报
回复
@ 胡工: Python 3.* 运行报错 : Traceback (most recent call last): File "C:/Users/************(省略路径).py", line 21, in <module> import bs4 ImportError: No module named 'bs4'
Linux光 2017-08-21
  • 打赏
  • 举报
回复
引用 2 楼 xpresslink 的回复:
非要在Python3.x下运行要改成下面这样


#coding=utf-8
import urllib.request
import re
 
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read().decode('gb2312')
    return html
 
def getImg(html):
    reg = 'src="([^ >]+\.(?:jpeg|jpg))"'
    imgre = re.compile(reg)
    print(imgre)
    imglist = re.findall(imgre,html)
    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl)
        print(imgurl)
 
html = getHtml("https://www.qqbody.com/gxtouxiang/19290.html")
print('*******************************************************************************')
getImg(html)


我运行了你这段代码都,运行不过 C:\Users\wdi>python D:\Python\capture.py Traceback (most recent call last): File "D:\Python\capture.py", line 3, in <module> import urllib.request ImportError: No module named request 然后我把import urllib.request改成 import urllib还是不行,报下面错 C:\Users\wdi>python D:\Python\capture.py Traceback (most recent call last): File "D:\Python\capture.py", line 20, in <module> html = getHtml("https://www.qqbody.com/gxtouxiang/19290.html") File "D:\Python\capture.py", line 7, in getHtml page = urllib.request.urlopen(url) AttributeError: 'module' object has no attribute 'request'
胡争辉 2017-08-21
  • 打赏
  • 举报
回复
为了清晰直观展现python严格要求的缩进,请访问博客上博文 详细说明见源代码中的注释 http://blog.csdn.net/hu_zhenghui/article/details/77450246
混沌鳄鱼 2017-08-21
  • 打赏
  • 举报
回复


#!C:\Python27
#coding=utf-8
import urllib
import re
import os.path

def getHtml(url):
page = urllib.urlopen(url)
html = page.read()
return html

def getImg(html):
reg = 'src="([^ >]+\.(?:jpeg|jpg))"'
imgre = re.compile(reg)
print imgre
imglist = re.findall(imgre,html)
for imgurl in imglist:
fil_name = os.path.split(imgurl)[-1]
file_data = urllib.urlretrieve(imgurl,fil_name)
print imgurl

html = getHtml("https://www.qqbody.com/gxtouxiang/19290.html")
print '*******************************************************************************'
getImg(html)

我在python2.7下运行了一下。是可以的啊。
================== RESTART: D:/sharefolder/python_temp/1.py ==================
*******************************************************************************
<_sre.SRE_Pattern object at 0x00000000038BFB70>
http://img.qqbody.com/uploads/allimg/201412/08-184024_462.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184023_846.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184023_704.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184020_333.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184021_301.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184022_67.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184013_389.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184014_122.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184015_286.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184016_797.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184016_996.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184017_902.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184018_103.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184019_319.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184020_74.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184025_227.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184025_112.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184026_123.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184027_328.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184028_587.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184028_246.jpg
http://upload.qqbody.com/allimg/1703/14-1F3161531260-L.jpg
>>>
混沌鳄鱼 2017-08-20
  • 打赏
  • 举报
回复
非要在Python3.x下运行要改成下面这样


#coding=utf-8
import urllib.request
import re
 
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read().decode('gb2312')
    return html
 
def getImg(html):
    reg = 'src="([^ >]+\.(?:jpeg|jpg))"'
    imgre = re.compile(reg)
    print(imgre)
    imglist = re.findall(imgre,html)
    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl)
        print(imgurl)
 
html = getHtml("https://www.qqbody.com/gxtouxiang/19290.html")
print('*******************************************************************************')
getImg(html)


混沌鳄鱼 2017-08-20
  • 打赏
  • 举报
回复
这段代码是为Python2.7写的必须在Python 2.7下运行,在Python3.x 下运行会报错,因为库的名字不同了。

37,719

社区成员

发帖
与我相关
我的任务
社区描述
JavaScript,VBScript,AngleScript,ActionScript,Shell,Perl,Ruby,Lua,Tcl,Scala,MaxScript 等脚本语言交流。
社区管理员
  • 脚本语言(Perl/Python)社区
  • IT.BOB
加入社区
  • 近7日
  • 近30日
  • 至今

试试用AI创作助手写篇文章吧