爬虫返回403,但是网页却可以访问,怎么伪装爬虫?

lornechang 2011-04-18 01:55:45
网页可以访问证明并不是IP被封了,爬虫用Curl加了HEADER伪装浏览器,为何还是运行没多长时间就不行了?除了加浏览器怎么伪装?
	$curl = curl_init();
curl_setopt($curl,CURLOPT_URL,$url);
curl_setopt($curl,CURLOPT_USERAGENT,"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )");
curl_setopt($curl,CURLOPT_HEADER,1);
curl_setopt($curl,CURLOPT_RETURNTRANSFER,1);
$a=curl_exec($curl);
...全文
3490 17 打赏 收藏 转发到动态 举报
写回复
用AI写文章
17 条回复
切换为时间正序
请发表友善的回复…
发表回复
happypiggy2010 2011-04-19
  • 打赏
  • 举报
回复
是不是请求太频繁了?
木目子 2011-04-19
  • 打赏
  • 举报
回复
用浏览器访问以下,抓一下包看看,应该是需要cookie,但是你的爬虫没有把cookie信息发送过去!
lornechang 2011-04-19
  • 打赏
  • 举报
回复
[Quote=引用 14 楼 happypiggy2010 的回复:]
是不是请求太频繁了?
[/Quote]
关键是爬虫403但浏览器就可以访问啊。而且加了sleep3,浏览器每秒刷3次,连刷一分钟都没有问题。
Ali 2011-04-19
  • 打赏
  • 举报
回复
Glad you are able to solve your issue by employing different technique.

Are you invoking multiple requests to same server concurrently? It could be the reason that server blocks the IP which is causing too many concurrent requests as it won't be possible for human behavior (that is persons using the browser and simultaneously accessing the same server).

Just a thought :)

//Ali
lornechang 2011-04-18
  • 打赏
  • 举报
回复
[Quote=引用 12 楼 alinaqvi 的回复:]

if same code working on one side and not working on other side then could be something different in environment. Let's analyze in different way:

I'm using below versions: (Actually I'm using WAMP……
[/Quote]
My versions are HIGHER than yours, and i has enabled ALL the features you've mentioned. I used another way to solve my problem: I got over 200+ Proxy IP, and made them an array, once an IP is refused by the server, it will change to another one. Seems running well now.

Though i solved my problem in that way, i'm just curious about why my crawler can run with proxy IP but cannot run with my local one. If that's because the server refused my IP why i can visit the site with my browser? Are there any aspect that i didn't disguise well in my crawler?

Thk you guy very much! You do help!
ImN1 2011-04-18
  • 打赏
  • 举报
回复
豆瓣啊,你抓的东西不需要登录么?


另:#1能读能写中文(见他用过,希望不是在线翻译),只是他不常用,呵呵
Ali 2011-04-18
  • 打赏
  • 举报
回复
if same code working on one side and not working on other side then could be something different in environment. Let's analyze in different way:

I'm using below versions: (Actually I'm using WAMP):

PHP :
PHP 5.2.5 (cli) (built: Nov 8 2007 23:18:51)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies
with the ionCube PHP Loader v3.3.18, Copyright (c) 2002-2010, by ionCube Ltd
., and
with Xdebug v2.1.0, Copyright (c) 2002-2010, by Derick Rethans

curl
curl version: 7.16.0

with features:
CURL_VERSION_SSL
CURL_VERSION_LIBZ


//Ali
lornechang 2011-04-18
  • 打赏
  • 举报
回复
[Quote=引用 10 楼 alinaqvi 的回复:]

I just tried the below code on my local machine using the URL which you were using (seen from the logs) and I'm getting the results as intended:

PHP code

$url = 'http://book.douban.com/subject/1……
[/Quote]

After using your code, I cannot even get any data, I think the problem got different from the very first one i encountered. And here is my whole log(I set a limit that if it cannot get data it can only run 10 times), thanks again for your help!

{
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive

* The requested URL returned error: 403
* Closing connection #0
}
Ali 2011-04-18
  • 打赏
  • 举报
回复
I just tried the below code on my local machine using the URL which you were using (seen from the logs) and I'm getting the results as intended:


$url = 'http://book.douban.com/subject/1044915/';
$c = curl_init();
$curl_header = array(
'Accept: */*',
'User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )',
'Connection: Keep-Alive');
curl_setopt($c, CURLOPT_URL, $url);
curl_setopt($c, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_HTTPHEADER, $curl_header);
curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($c, CURLOPT_TIMEOUT, 30);
curl_setopt($c, CURLOPT_HEADER, 0);

$res = curl_exec($c);

echo "<H1>HERE ARE THE RESULTS</H1>";
echo $res;



I still believe you are missing some piece of information in your piece of code that's causing the 403 response on your side.

Hope it helps.

//Ali
lornechang 2011-04-18
  • 打赏
  • 举报
回复
[Quote=引用 6 楼 snmr_com 的回复:]

豆瓣啊,你抓的东西不需要登录么?


另:#1能读能写中文(见他用过,希望不是在线翻译),只是他不常用,呵呵
[/Quote]
登陆干嘛?我直接用循环访问我需要的网页,他们每一个商品都有固定的编号like“douban.com/subject/xxxxxxx”,这样不也节省点他们的资源么。我是觉得豆瓣的信息整理的好,不去抓电商了,看来豆瓣太强悍,怎么都解决不了。

感谢外国友人!!!
lornechang 2011-04-18
  • 打赏
  • 举报
回复
[Quote=引用 7 楼 alinaqvi 的回复:]

Yes, it is due to cookie. You can use something like below to handle cookies in Curl

PHP code

....
curl_setopt($curl, CURLOPT_COOKIEJAR, '/tmp/cookies.txt');
curl_setopt($curl, CURLOPT_COOKIEFIL……
[/Quote]
Well, it dosen't fix it. I still encounter the problem that when i use my crawler i get error 403 but i can visit the site with my browser (Dose this mean that my IP can still be used but my crawler doesnt pretend well?) I can't deny that DouBan is pretty powerful.

好吧,没有解决掉,如果用爬虫的话还是会有403问题,依然可以用浏览器访问(证明IP没有问题)。不得不承认豆瓣太强了。
lornechang 2011-04-18
  • 打赏
  • 举报
回复
Seems like it is caused by cookie, isn't it? How can i handle it?
lornechang 2011-04-18
  • 打赏
  • 举报
回复
[Quote=引用 3 楼 lornechang 的回复:]

引用 2 楼 alinaqvi 的回复:

Could be Cookie or Referrer issue. Better use verbose mode and post logs here.

Just a sample on how to add logging :
PHP code

$fp_err = fopen('verbose_log.txt', 'ab+');……
[/Quote]
I just cut where the problem comes above, and here is the whole log :
{
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044639/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com

< HTTP/1.1 200 OK
< Server: nginx
< Content-Type: text/html; charset=utf-8
< Connection: keep-alive
< Keep-Alive: timeout=20
< Content-Length: 23684
< Expires: Sun, 1 Jan 2006 01:00:00 GMT
< Pragma: no-cache
< Cache-Control: must-revalidate, no-cache, private
< P3P: CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
< Set-Cookie: bid="mRyYY+c5VLs"; path=/; domain=.douban.com; expires=Thu, 01-Jan-2012 00:00:00 GMT
< Set-Cookie: viewed="1044639"; path=/; domain=.douban.com; expires=Wed, 01-Jan-2012 00:00:00 GMT
< Date: Sun, 17 Apr 2011 18:34:01 GMT
<
* Connection #0 to host book.douban.com left intact
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044640/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com

< HTTP/1.1 200 OK
< Server: nginx
< Content-Type: text/html; charset=utf-8
< Connection: keep-alive
< Keep-Alive: timeout=20
< Content-Length: 25035
< Expires: Sun, 1 Jan 2006 01:00:00 GMT
< Pragma: no-cache
< Cache-Control: must-revalidate, no-cache, private
< P3P: CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
< Set-Cookie: bid="FVAMqO5XkaQ"; path=/; domain=.douban.com; expires=Thu, 01-Jan-2012 00:00:00 GMT
< Set-Cookie: viewed="1044640"; path=/; domain=.douban.com; expires=Wed, 01-Jan-2012 00:00:00 GMT
< Date: Sun, 17 Apr 2011 18:34:02 GMT
<
* Connection #0 to host book.douban.com left intact
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044641/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com

/*here's the same pattern with above '200 OK' code*/

//here comes the question
* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044907/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com
lornechang 2011-04-18
  • 打赏
  • 举报
回复
[Quote=引用 2 楼 alinaqvi 的回复:]

Could be Cookie or Referrer issue. Better use verbose mode and post logs here.

Just a sample on how to add logging :
PHP code

$fp_err = fopen('verbose_log.txt', 'ab+');
curl_setopt($ch, CURLOPT……
[/Quote]

Here is my log. Can you read Chinese? I think i'd better translate my Question here: I pretend my crawler to be a browser, but it still got the 403 ERROR, but when i use my browser to visit their site it's ok even i refresh it with a considerable frequency. How can i handle it? Thks a lot!

{
* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com
}
Ali 2011-04-18
  • 打赏
  • 举报
回复
Could be Cookie or Referrer issue. Better use verbose mode and post logs here.

Just a sample on how to add logging :

$fp_err = fopen('verbose_log.txt', 'ab+');
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_STDERR, $fp_err);


Once done, post the contents of verbose_log.txt here.

//Ali
lornechang 2011-04-18
  • 打赏
  • 举报
回复
我用浏览器每秒刷新3次刷了一分钟都没有403,我的爬虫还加了sleep(3);就会403....
Ali 2011-04-18
  • 打赏
  • 举报
回复
Yes, it is due to cookie. You can use something like below to handle cookies in Curl


....
curl_setopt($curl, CURLOPT_COOKIEJAR, '/tmp/cookies.txt');
curl_setopt($curl, CURLOPT_COOKIEFILE, '/tmp/cookies.txt');
....
【为什么学爬虫?】        1、爬虫入手容易,但是深入较难,如何写出高效率的爬虫,如何写出灵活性高可扩展的爬虫都是一项技术活。另外在爬虫过程中,经常容易遇到被反爬虫,比如字体反爬、IP识别、验证码等,如何层层攻克难点拿到想要的数据,这门课程,你都能学到!        2、如果是作为一个其他行业的开发者,比如app开发,web开发,学习爬虫能让你加强对技术的认知,能够开发出更加安全的软件和网站 【课程设计】 一个完整的爬虫程序,无论大小,总体来说可以分成三个步骤,分别是:网络请求:模拟浏览器的行为从网上抓取数据。数据解析:将请求下来的数据进行过滤,提取我们想要的数据。数据存储:将提取到的数据存储到硬盘或者内存中。比如用mysql数据库或者redis等。那么本课程也是按照这几个步骤循序渐进的进行讲解,带领学生完整的掌握每个步骤的技术。另外,因为爬虫的多样性,在爬取的过程中可能会发生被反爬、效率低下等。因此我们又增加了两个章节用来提高爬虫程序的灵活性,分别是:爬虫进阶:包括IP代理,多线程爬虫,图形验证码识别、JS加密解密、动态网页爬虫、字体反爬识别等。Scrapy和分布式爬虫:Scrapy框架、Scrapy-redis组件、分布式爬虫等。通过爬虫进阶的知识点我们能应付大量的反爬网站,而Scrapy框架作为一个专业的爬虫框架,使用他可以快速提高我们编写爬虫程序的效率和速度。另外如果一台机器不能满足你的需求,我们可以用分布式爬虫让多台机器帮助你快速爬取数据。 从基础爬虫到商业化应用爬虫,本套课程满足您的所有需求!【课程服务】 专属付费社群+定期答疑

21,887

社区成员

发帖
与我相关
我的任务
社区描述
从PHP安装配置,PHP入门,PHP基础到PHP应用
社区管理员
  • 基础编程社区
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧