为什么使用代理服务器时，设置了头部信息，但是爬取的时候还是出错了？

Peter_Luoz 2018-01-30 10:24:12

import re,requests,time

from urllib import request

from urllib import response

# url="http://31f.cn/"

'''

1.抓取网页

2.正则提取信息，保存进字典

3.对服务器地址进行校验

4.写入文本

'''

url="http://www.xicidaili.com"

def get_proxy(url):

    headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36"}

    ipcheck = []

    content = requests.get(url,headers=headers).text

    iplist = re.findall('''<tr class=".*?">.*?<td>([0-9.]*?)</td>\s*?<td>([0-9]*?)</td>\s*?<td>(\w*?)</td>\s*?<td class="country">.*?</td>\s*?<td>(.*?)</td>.*?</tr>''',content, flags=re.S)

    for i in iplist:

        dic1 = {"地区": i[2], "IP": i[0], "端口": i[1], "协议": i[3]}

        ipcheck.append(dic1)

    return ipcheck

get_proxy(url)



class Serveragent_check():

    def __init__(self,iplist):

        self.iplist=iplist

        self.timeout=10

        self.testurl="http://www.baidu.com"

        self.testinfo="柳絮"

        self.checkedlist=[]

    def check(self):

        for i in self.iplist:

            proxy={"http":"http://%s:%s"%(i["IP"],i["端口"])}

            headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36"}

            t1=time.time()

            try:

                f=requests.get(self.testurl,headers=headers,proxies=proxy,timeout=self.timeout)

                result=f.text

                time.sleep(5)

                pos=result.find(self.testinfo)

                t2=time.time()

                timeused=t2-t1

                if pos>1:

                    self.checkedlist.append({"地区":i[2],"IP":i[0],"端口":i[1],"协议":i[3],"time":timeused})

                else:

                    continue



            except Exception as e:

                print(e)#这里开始报错，都是说超时，或者代理错误

                continue

    def sorting(self):

        sorted(self.checkedlist,key=lambda x:x[4])

        print(self.checkedlist)

ipcheck=get_proxy(url)

c=Serveragent_check(ipcheck)

c.check()

c.sorting()

以下是报错信息：
C:\ProgramData\Anaconda3\python.exe C:/Users/Administrator/PycharmProjects/untitled/代理服务器抓取.py
HTTPConnectionPool(host='119.148.160.71', port=808): Max retries exceeded with url: http://www.baidu.com/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x0000000003B5FE48>, 'Connection to 119.148.160.71 timed out. (connect timeout=10)'))
HTTPConnectionPool(host='223.199.253.175', port=8118): Max retries exceeded with url: http://www.baidu.com/ (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None)))
HTTPConnectionPool(host='39.86.32.59', port=8118): Max retries exceeded with url: http://www.baidu.com/ (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response',)))
HTTPConnectionPool(host='117.36.103.170', port=8118): Max retries exceeded with url: http://www.baidu.com/ (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response',)))

我的代码貌似应该是没有错误，我换了几个代理服务器的网站，但是都没有用，我跑出来的结果都是没有一个代理IP可用，但是我ping这些代理的时候，确实可以用，而且速度还很快。
我换了几个网站进行test,但是都连接不上，我在想是不是因为cookie 的问题？但是我换了好几个网站都不行，不可能这些网站都需要cookie信息才能访问吧？
头部信息我也设置了，但是并没有什么效果，我设置了头部还是被网站服务器识别了？那应该怎么办？
中间我还以为是我的IP被代理服务器的网站给屏蔽了，但是一想不太可能啊（我不是计算机出身，有些基础不是很懂）
希望

...全文

1901 6 打赏收藏转发到动态举报

写回复

用AI写文章

6 条回复

切换为时间正序

请发表友善的回复…

发表回复

Peter_Luoz 2018-01-31

打赏
举报

引用 5 楼 BFInWR 的回复:

[quote=引用 3 楼 Peter_Luoz 的回复:] [quote=引用 1 楼 BFInWR 的回复:] proxy={"http": "http", "https": "https"} xici 也有很多https 的类型所以这要做个判断多用print 或者try看看你的代码运行报错

不是这个的问题，总不至于我所有的HTTP代理都出错吧？我print了使用代理之后抓取的内容，是能够抓取到的，但是为什么这样了后面的判断过不去，而且给我返回的错误要么是远端关闭了连接，要么就是代理错误，要是这样，那我应该就抓不到内容啊[/quote] 我跑了下你的代码发现是 headers 的问题你换一个user_agent就好了最后跑出来结果是 '[ ]'空列表你可能代码逻辑上也有点问题 [/quote] 感谢了，换了个头部，确实有些信息是跑出来了，然后我刚才才看到，原来我是检查之后插入字典的时候出错了，我把它看成了一个list，引用键值的时候出错了，所以是空列表，感谢了