爬虫返回403，但是网页却可以访问，怎么伪装爬虫？

lornechang 2011-04-18 01:55:45

网页可以访问证明并不是IP被封了，爬虫用Curl加了HEADER伪装浏览器，为何还是运行没多长时间就不行了？除了加浏览器怎么伪装？

	$curl = curl_init();

	curl_setopt($curl,CURLOPT_URL,$url);

	curl_setopt($curl,CURLOPT_USERAGENT,"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )");

	curl_setopt($curl,CURLOPT_HEADER,1);

	curl_setopt($curl,CURLOPT_RETURNTRANSFER,1);

	$a=curl_exec($curl);

...全文

3661 17 打赏收藏转发到动态举报

写回复

用AI写文章

17 条回复

切换为时间正序

请发表友善的回复…

发表回复

happypiggy2010 2011-04-19

打赏
举报

是不是请求太频繁了？

木目子 2011-04-19

打赏
举报

用浏览器访问以下，抓一下包看看，应该是需要cookie，但是你的爬虫没有把cookie信息发送过去！

lornechang 2011-04-19

打赏
举报

[Quote=引用 14 楼 happypiggy2010 的回复:]
是不是请求太频繁了？
[/Quote]
关键是爬虫403但浏览器就可以访问啊。而且加了sleep3，浏览器每秒刷3次，连刷一分钟都没有问题。

Ali 2011-04-19

打赏
举报

Glad you are able to solve your issue by employing different technique.

Are you invoking multiple requests to same server concurrently? It could be the reason that server blocks the IP which is causing too many concurrent requests as it won't be possible for human behavior (that is persons using the browser and simultaneously accessing the same server).

Just a thought :)

//Ali

lornechang 2011-04-18

打赏
举报

[Quote=引用 12 楼 alinaqvi 的回复:]

if same code working on one side and not working on other side then could be something different in environment. Let's analyze in different way:

I'm using below versions: (Actually I'm using WAMP……
[/Quote]
My versions are HIGHER than yours, and i has enabled ALL the features you've mentioned. I used another way to solve my problem: I got over 200+ Proxy IP, and made them an array, once an IP is refused by the server, it will change to another one. Seems running well now.

Though i solved my problem in that way, i'm just curious about why my crawler can run with proxy IP but cannot run with my local one. If that's because the server refused my IP why i can visit the site with my browser? Are there any aspect that i didn't disguise well in my crawler?

Thk you guy very much! You do help!

ImN1 2011-04-18

打赏
举报

豆瓣啊，你抓的东西不需要登录么？

另：#1能读能写中文（见他用过，希望不是在线翻译），只是他不常用，呵呵

Ali 2011-04-18

打赏
举报

if same code working on one side and not working on other side then could be something different in environment. Let's analyze in different way:

I'm using below versions: (Actually I'm using WAMP):

PHP :
PHP 5.2.5 (cli) (built: Nov 8 2007 23:18:51)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies
with the ionCube PHP Loader v3.3.18, Copyright (c) 2002-2010, by ionCube Ltd
., and
with Xdebug v2.1.0, Copyright (c) 2002-2010, by Derick Rethans

curl
curl version: 7.16.0

with features:
CURL_VERSION_SSL
CURL_VERSION_LIBZ

//Ali

lornechang 2011-04-18

打赏
举报

[Quote=引用 10 楼 alinaqvi 的回复:]

I just tried the below code on my local machine using the URL which you were using (seen from the logs) and I'm getting the results as intended:

PHP code

$url = 'http://book.douban.com/subject/1……
[/Quote]

After using your code, I cannot even get any data, I think the problem got different from the very first one i encountered. And here is my whole log(I set a limit that if it cannot get data it can only run 10 times), thanks again for your help!

{
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
}

Ali 2011-04-18

打赏
举报

I just tried the below code on my local machine using the URL which you were using (seen from the logs) and I'm getting the results as intended:



$url = 'http://book.douban.com/subject/1044915/';

$c = curl_init();

$curl_header = array(

    'Accept: */*',

    'User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )',

    'Connection: Keep-Alive');

curl_setopt($c, CURLOPT_URL, $url);

curl_setopt($c, CURLOPT_CUSTOMREQUEST, 'GET');

curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);

curl_setopt($c, CURLOPT_HTTPHEADER, $curl_header);

curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 30);

curl_setopt($c, CURLOPT_TIMEOUT, 30);

curl_setopt($c, CURLOPT_HEADER, 0);



$res = curl_exec($c);



echo "<H1>HERE ARE THE RESULTS</H1>";

echo $res;

I still believe you are missing some piece of information in your piece of code that's causing the 403 response on your side.

Hope it helps.

//Ali

lornechang 2011-04-18

打赏
举报

[Quote=引用 6 楼 snmr_com 的回复:]

豆瓣啊，你抓的东西不需要登录么？

另：#1能读能写中文（见他用过，希望不是在线翻译），只是他不常用，呵呵
[/Quote]
登陆干嘛？我直接用循环访问我需要的网页，他们每一个商品都有固定的编号like“douban.com/subject/xxxxxxx”，这样不也节省点他们的资源么。我是觉得豆瓣的信息整理的好，不去抓电商了，看来豆瓣太强悍，怎么都解决不了。

感谢外国友人！！！

lornechang 2011-04-18

打赏
举报

[Quote=引用 7 楼 alinaqvi 的回复:]

Yes, it is due to cookie. You can use something like below to handle cookies in Curl

PHP code

....
curl_setopt($curl, CURLOPT_COOKIEJAR, '/tmp/cookies.txt');
curl_setopt($curl, CURLOPT_COOKIEFIL……
[/Quote]
Well, it dosen't fix it. I still encounter the problem that when i use my crawler i get error 403 but i can visit the site with my browser (Dose this mean that my IP can still be used but my crawler doesnt pretend well?) I can't deny that DouBan is pretty powerful.

好吧，没有解决掉，如果用爬虫的话还是会有403问题，依然可以用浏览器访问（证明IP没有问题）。不得不承认豆瓣太强了。

lornechang 2011-04-18

打赏
举报

Seems like it is caused by cookie, isn't it? How can i handle it?

lornechang 2011-04-18

打赏
举报

[Quote=引用 3 楼 lornechang 的回复:]

引用 2 楼 alinaqvi 的回复:

Could be Cookie or Referrer issue. Better use verbose mode and post logs here.

Just a sample on how to add logging :
PHP code

$fp_err = fopen('verbose_log.txt', 'ab+');……
[/Quote]
I just cut where the problem comes above, and here is the whole log :
{
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044639/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com

< HTTP/1.1 200 OK
< Server: nginx
< Content-Type: text/html; charset=utf-8
< Connection: keep-alive
< Keep-Alive: timeout=20
< Content-Length: 23684
< Expires: Sun, 1 Jan 2006 01:00:00 GMT
< Pragma: no-cache
< Cache-Control: must-revalidate, no-cache, private
< P3P: CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
< Set-Cookie: bid="mRyYY+c5VLs"; path=/; domain=.douban.com; expires=Thu, 01-Jan-2012 00:00:00 GMT
< Set-Cookie: viewed="1044639"; path=/; domain=.douban.com; expires=Wed, 01-Jan-2012 00:00:00 GMT
< Date: Sun, 17 Apr 2011 18:34:01 GMT
<
* Connection #0 to host book.douban.com left intact
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044640/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com

< HTTP/1.1 200 OK
< Server: nginx
< Content-Type: text/html; charset=utf-8
< Connection: keep-alive
< Keep-Alive: timeout=20
< Content-Length: 25035
< Expires: Sun, 1 Jan 2006 01:00:00 GMT
< Pragma: no-cache
< Cache-Control: must-revalidate, no-cache, private
< P3P: CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
< Set-Cookie: bid="FVAMqO5XkaQ"; path=/; domain=.douban.com; expires=Thu, 01-Jan-2012 00:00:00 GMT
< Set-Cookie: viewed="1044640"; path=/; domain=.douban.com; expires=Wed, 01-Jan-2012 00:00:00 GMT
< Date: Sun, 17 Apr 2011 18:34:02 GMT
<
* Connection #0 to host book.douban.com left intact
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044641/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com

/*here's the same pattern with above '200 OK' code*/

//here comes the question
* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044907/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com

lornechang 2011-04-18

打赏
举报

[Quote=引用 2 楼 alinaqvi 的回复:]

Could be Cookie or Referrer issue. Better use verbose mode and post logs here.

Just a sample on how to add logging :
PHP code

$fp_err = fopen('verbose_log.txt', 'ab+');
curl_setopt($ch, CURLOPT……
[/Quote]

Here is my log. Can you read Chinese? I think i'd better translate my Question here: I pretend my crawler to be a browser, but it still got the 403 ERROR, but when i use my browser to visit their site it's ok even i refresh it with a considerable frequency. How can i handle it? Thks a lot!

{
* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com
}

Ali 2011-04-18

打赏
举报

Could be Cookie or Referrer issue. Better use verbose mode and post logs here.

Just a sample on how to add logging :



$fp_err = fopen('verbose_log.txt', 'ab+');

curl_setopt($ch, CURLOPT_VERBOSE, 1);

curl_setopt($ch, CURLOPT_FAILONERROR, true);

curl_setopt($ch, CURLOPT_STDERR, $fp_err);

Once done, post the contents of verbose_log.txt here.

//Ali

lornechang 2011-04-18

打赏
举报

我用浏览器每秒刷新3次刷了一分钟都没有403，我的爬虫还加了sleep(3);就会403....

Ali 2011-04-18

打赏
举报

Yes, it is due to cookie. You can use something like below to handle cookies in Curl



....

curl_setopt($curl, CURLOPT_COOKIEJAR, '/tmp/cookies.txt');

curl_setopt($curl, CURLOPT_COOKIEFILE, '/tmp/cookies.txt');

....

Java爬虫，信息抓取的实现详细完整源码实例打包给大家，需要的可以下载下载学习！！！

@【python爬虫】—爬虫报错：403访问太过频繁，禁止访问前言使用requests包建立访问时，正常的访问状态会返回状态代码200，但是在爬一些网站时，经常会返回403（众所周知的404代表的是网站disappear了。而403代表我们当前的IP被forbidden了）。这是因为在短时间内直接使用Get获取大量数据，会被服务器认为在对它进行攻击，所以拒绝我们的请求，自动把电脑IP封了。因此，这里介绍两种解决办法。方案一、请求页面的是添加headers 我们平时使用浏览器下载的图片或者是文

（总结时间）解决403错误就像和网站玩捉迷藏，关键是让你的爬虫看起来更像真人操作。记住：没有破解不了的网站，只有不够逼真的伪装！（超级重要）User-Agent就像你的网络身份证，很多网站会拦截默认的Python UA！403就像网站的保安大叔，当它觉得你的请求"有问题"时就会拦下你。（划重点）对于反爬机制严格的网站，直接模拟真实浏览器操作是最有效的解决方案！（注意）有些网站需要登录后才能访问，这时候就需要cookie维持会话状态！）完整的请求头能让你的爬虫看起来更像真人浏览器！

【疑惑】：使用python的requests库发起get或post请求返回403代码错误，使用postman发起请求发现状态码竟然成功了。首先排除ip问题，ip有问题的话postman也访问不了。难道是headers出现了问题吗，通过对比发现也不是headers的问题。【解疑】：其实遇到这种情况大概率是遇到了“原生模拟浏览器 TLS/JA3 指纹的验证”，浏览器和postman都有自带指纹验证，而唯独requests库没有。这就让反爬有了区分人为和爬虫的突破口。2、使用 pyhttpx 库。

返回403错误码，权限限制，一般来说政府网站不用登入，但也有权限限制，这一般是请求头缺少了。Network -> 选择对应请求-> Heaers -> Request Header。F12开发者模式，请求相关连接，查看请求的请求头复制即可。如上面代码，把User-Agent去掉之后就可解决。也可能是缺少其他的字段。查看其他请求头字段方式。

基础编程

21,893

社区成员

140,347

社区内容

发帖

与我相关

我的任务

社区管理员

加入社区

近7日
近30日
至今

加载中

查看更多榜单

社区公告

暂无公告

试试用AI创作助手写篇文章吧

+ 用AI写文章