python 爬虫问题，求大神指教

Linux光 2017-08-20 10:50:38

写了一个简单爬虫，用来保存网站图片的，但是运行会出错，求各位大神指点



#!C:\Python27

#coding=utf-8

import urllib

import re



def getHtml(url):

    page = urllib.urlopen(url)

    html = page.read()

    return html



def getImg(html):

    reg = 'src="([^ >]+\.(?:jpeg|jpg))"'

    imgre = re.compile(reg)

    print imgre

    imglist = re.findall(imgre,html)

    for imgurl in imglist:

        urllib.urlretrieve(imgurl)

        print imgurl



html = getHtml("https://www.qqbody.com/gxtouxiang/19290.html")

print '*******************************************************************************'

getImg(html)

...全文

247 7 打赏收藏转发到动态举报

写回复

用AI写文章

7 条回复

切换为时间正序

请发表友善的回复…

发表回复

LinDRon 2017-08-23

打赏
举报

你的代码在python2下能跑，python3库名变了。python3入门建议使用requests和bs4库学习

Billy___Chen 2017-08-22

打赏
举报

@ 胡工： Python 3.* 运行报错 : Traceback (most recent call last): File "C:/Users/************（省略路径）.py", line 21, in <module> import bs4 ImportError: No module named 'bs4'

Linux光 2017-08-21

打赏
举报

引用 2 楼 xpresslink 的回复:

非要在Python3.x下运行要改成下面这样



#coding=utf-8
import urllib.request
import re
 
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read().decode('gb2312')
    return html
 
def getImg(html):
    reg = 'src="([^ >]+\.(?:jpeg|jpg))"'
    imgre = re.compile(reg)
    print(imgre)
    imglist = re.findall(imgre,html)
    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl)
        print(imgurl)
 
html = getHtml("https://www.qqbody.com/gxtouxiang/19290.html")
print('*******************************************************************************')
getImg(html)

我运行了你这段代码都，运行不过 C:\Users\wdi>python D:\Python\capture.py Traceback (most recent call last): File "D:\Python\capture.py", line 3, in <module> import urllib.request ImportError: No module named request 然后我把import urllib.request改成 import urllib还是不行，报下面错 C:\Users\wdi>python D:\Python\capture.py Traceback (most recent call last): File "D:\Python\capture.py", line 20, in <module> html = getHtml("https://www.qqbody.com/gxtouxiang/19290.html") File "D:\Python\capture.py", line 7, in getHtml page = urllib.request.urlopen(url) AttributeError: 'module' object has no attribute 'request'

胡争辉 2017-08-21

打赏
举报

为了清晰直观展现python严格要求的缩进，请访问博客上博文详细说明见源代码中的注释 http://blog.csdn.net/hu_zhenghui/article/details/77450246

混沌鳄鱼 2017-08-21

打赏
举报



#!C:\Python27

#coding=utf-8

import urllib

import re

import os.path

 

def getHtml(url):

    page = urllib.urlopen(url)

    html = page.read()

    return html

 

def getImg(html):

    reg = 'src="([^ >]+\.(?:jpeg|jpg))"'

    imgre = re.compile(reg)

    print imgre

    imglist = re.findall(imgre,html)

    for imgurl in imglist:

        fil_name = os.path.split(imgurl)[-1]

        file_data = urllib.urlretrieve(imgurl,fil_name)

        print imgurl

 

html = getHtml("https://www.qqbody.com/gxtouxiang/19290.html")

print '*******************************************************************************'

getImg(html)

我在python2.7下运行了一下。是可以的啊。
================== RESTART: D:/sharefolder/python_temp/1.py ==================
*******************************************************************************
<_sre.SRE_Pattern object at 0x00000000038BFB70>
http://img.qqbody.com/uploads/allimg/201412/08-184024_462.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184023_846.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184023_704.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184020_333.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184021_301.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184022_67.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184013_389.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184014_122.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184015_286.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184016_797.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184016_996.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184017_902.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184018_103.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184019_319.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184020_74.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184025_227.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184025_112.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184026_123.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184027_328.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184028_587.jpg
http://img.qqbody.com/uploads/allimg/201412/08-184028_246.jpg
http://upload.qqbody.com/allimg/1703/14-1F3161531260-L.jpg
>>>

混沌鳄鱼 2017-08-20

打赏
举报

非要在Python3.x下运行要改成下面这样



#coding=utf-8
import urllib.request
import re
 
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read().decode('gb2312')
    return html
 
def getImg(html):
    reg = 'src="([^ >]+\.(?:jpeg|jpg))"'
    imgre = re.compile(reg)
    print(imgre)
    imglist = re.findall(imgre,html)
    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl)
        print(imgurl)
 
html = getHtml("https://www.qqbody.com/gxtouxiang/19290.html")
print('*******************************************************************************')
getImg(html)

混沌鳄鱼 2017-08-20