python爬虫返回Access denied

gbl959001 2014-11-13 10:18:32
爬虫:
import urllib
import re

def getHtml(url):
page = urllib.urlopen(url)
html = page.read()
return html

html = getHtml("http://www.xxxx.com/")
f = file('html.txt','w')
f.write(html)
f.close()
返回页面:
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]> <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]> <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>
<title>Access denied | www.javlibrary.com used CloudFlare to restrict access</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" type="text/css" media="screen,projection" />
<!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-->
<style type="text/css">body{margin:0;padding:0}</style>
<!--[if lt IE 9]><script type="text/javascript" src="//cdnjs.cloudflare.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script><![endif]-->
<!--[if gte IE 9]><!--><script type="text/javascript" src="//cdnjs.cloudflare.com/ajax/libs/zepto/1.0/zepto.min.js"></script><!--<![endif]-->
<script type="text/javascript" src="/cdn-cgi/scripts/cf.common.js"></script>

</head>
<body>
<div id="cf-wrapper">
<div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>
<div id="cf-error-details" class="cf-error-details-wrapper">
<div class="cf-wrapper cf-header cf-error-overview">
<h1>
<span class="cf-error-type" data-translate="error">Error</span>
<span class="cf-error-code">1010</span>
<small class="heading-ray-id">Ray ID: 18877297d76c022d</small>
</h1>
<h2 class="cf-subheadline" data-translate="error_desc">Access denied</h2>
</div><!-- /.header -->

<section></section><!-- spacer -->

<div class="cf-section cf-wrapper">
<div class="cf-columns two">
<div class="cf-column">
<h2 data-translate="what_happened">What happened?</h2>
<p>The owner of this website (www.javlibrary.com) has banned your access based on your browser's signature (18877297d76c022d-ua48).</p>
</div>


</div>
</div><!-- /.section -->

<div class="cf-error-footer cf-wrapper">
<p>
<span class="cf-footer-item">CloudFlare Ray ID: <strong>18877297d76c022d</strong></span>
<span class="cf-footer-separator">•</span>
<span class="cf-footer-item"><span data-translate="your_ip">Your IP</span>: 101.231.129.82</span>
<span class="cf-footer-separator">•</span>
<span class="cf-footer-item"><span data-translate="performance_security_by">Performance & security by</span> <a data-orig-proto="https" data-orig-ref="www.cloudflare.com/5xx-error-landing" id="cloudflare_link" target="_blank">CloudFlare</a></span>

</p>
</div><!-- /.error-footer -->


</div><!-- /#cf-error-details -->
</div><!-- /#cf-wrapper -->

<script type="text/javascript">
window._cf_translation = {};


</script>

</body>
</html>



貌似原因是浏览器不支持。但是在爬虫里面该怎么写呢?
...全文
1349 1 打赏 收藏 转发到动态 举报
写回复
用AI写文章
1 条回复
切换为时间正序
请发表友善的回复…
发表回复
adrianlynn 2014-11-13
  • 打赏
  • 举报
回复
可以考虑伪装浏览器: class UrlRequest: def Read(url): opener = urllib.request.build_opener() opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1')] html = opener.open(url).read() return html

37,721

社区成员

发帖
与我相关
我的任务
社区描述
JavaScript,VBScript,AngleScript,ActionScript,Shell,Perl,Ruby,Lua,Tcl,Scala,MaxScript 等脚本语言交流。
社区管理员
  • 脚本语言(Perl/Python)社区
  • IT.BOB
加入社区
  • 近7日
  • 近30日
  • 至今

试试用AI创作助手写篇文章吧