Python爬虫一直重复爬取第一页的内容

Rainit 2018-12-15 05:16:23
最近在爬取Amazon的信息,想要实现循环翻页爬取数据。但是我构造的URL能够打开正确的页面,用requests请求的时候永远都是第一页的代码,请问各位这是怎么回事应当如何解决呢?不胜感激!
Maxpage=9
headers={"User-Agent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36"}

question=[]
ans_url=[]

for p in range(Maxpage):
url="https://www.amazon.com/ask/questions/asin/B07HCTCMT7/"+str(p+1)+"/ref=ask_ql_psf_ql_hza?isAnswered=true"

#requests
response=requests.get(url,headers=headers)
html=response.text
qq=re.findall('<p .*?a-text-bold">(.*?)</p>',html,re.S)
qq[0]=re.findall('.*?Q:\n[\s]*(.*?) \n[\s]*',qq[0])
for i in range(len(qq)-1):
qq[i+1]=re.sub('(Q:).*?','',qq[i+1]).strip()
question=question+qq
ansurl=re.findall('<a href="(.*?)" class=".*?link-noborder',html,re.S)
ans_url=ans_url+ansurl
...全文
2495 3 打赏 收藏 转发到动态 举报
写回复
用AI写文章
3 条回复
切换为时间正序
请发表友善的回复…
发表回复
weixin_44131290 2018-12-17
  • 打赏
  • 举报
回复
谢谢筒子们分享,帮助很大
qq_32650669 2018-12-16
  • 打赏
  • 举报
回复
将楼主的代码稍微修改了下,在windows和ubuntu环境下都执行了一下,是可以遍历所有页面的内容的。(数据也没有整理,不是很好看,哈哈)
import requests
import re

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}

question=[]
ans_url=[]

for p in range(9):
    url="https://www.amazon.com/ask/questions/asin/B07HCTCMT7/"+str(p+1)+"/ref=ask_ql_psf_ql_hza?isAnswered=true"
    #requests
    response=requests.get(url,headers=headers)
    html=response.text
    qq=re.findall('{"metricName":"top-question-text-click"}">(.*?)</span>',html,re.S)
    print('第',str(p+1),'页')
    print(qq)
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
小布 2018-12-16
  • 打赏
  • 举报
回复
  <body class="askPage askDesktopAllAnswersPage a-m-us a-aui_149818-c a-aui_152852-c a-aui_157141-c a-aui_158613-c a-aui_160684-c a-aui_57326-c a-aui_72554-c a-aui_accessibility_49860-c a-aui_attr_validations_1_51371-c a-aui_bolt_62845-c a-aui_perf_130093-c a-aui_tnr_v2_180836-c a-aui_ux_113788-c a-aui_ux_114039-c a-aui_ux_138741-c a-aui_ux_145937-c a-aui_ux_60000-c"><div id="a-page"><script type="a-state" data-a-state="{"key":"a-wlab-states"}">{"AUI_160684":"C","AUI_157141":"C","AUI_TNR_V2_180836":"C","AUI_UX_114039":"C","AUI_UX_113788":"C","AUI_UX_145937":"C","AUI_57326":"C","AUI_158613":"C","AUI_149818":"C","AUI_72554":"C","AUI_BOLT_62845":"C","AUI_UX_60000":"C","AUI_ATTR_VALIDATIONS_1_51371":"C","AUI_ACCESSIBILITY_49860":"C","AUI_152852":"C","AUI_PERF_130093":"C","AUI_UX_138741":"C"}</script>


<script type='text/javascript'>

(function(){function l(a){for(var c=b.location.search.substring(1).split("&"),e=0;e<c.length;e++){var d=c[e].split("=");if(d[0]===a)return d[1]}}window.amzn=window.amzn||{};amzn.copilot=amzn.copilot||{};var b=window,f=document,g=b.P||b.AmazonUIPageJS,h=f.head||f.getElementsByTagName("head")[0],m=0,n=0;amzn.copilot.checkCoPilotSession=function(){f.cookie.match("cpidv")&&("undefined"!==typeof jQuery&&k(jQuery),g&&g.when&&g.when("jQuery").execute(function(a){k(a)}),b.amznJQ&&b.amznJQ.available&&b.amznJQ.available("jQuery",
function(){k(jQuery)}),b.jQuery||g||b.amznJQ||q())};var q=function(){m?b.ue&&"function"===typeof b.ue.count&&b.ue.count("cpJQUnavailable",1):(m=1,f.addEventListener?f.addEventListener("DOMContentLoaded",amzn.copilot.checkCoPilotSession,!1):f.attachEvent&&f.attachEvent("onreadystatechange",function(){"complete"===f.readyState&&amzn.copilot.checkCoPilotSession()}))},k=function(a){if(!n){n=1;amzn.copilot.jQuery=a;a=l("debugJS");var c="https:"===b.location.protocol?1:0,e=1;url="/gp/copilot/handlers/copilot_strings_resources.html";
window.texas&&texas.locations&&(url=texas.locations.makeUrl(url));g&&g.AUI_BUILD_DATE&&(e=0);amzn.copilot.jQuery.ajax&&amzn.copilot.jQuery.ajax({url:url,dataType:"json",data:{isDebug:a,isSecure:c,includeAUIP:e},success:function(a){amzn.copilot.vip=a.serviceEndPoint;amzn.copilot.enableMultipleTabSession=a.isFollowMe;r(a)},error:function(){b.ue.count("cpLoadResourceError",1)}})}},r=function(a){var c=amzn.copilot.jQuery,e=function(){amzn.copilot.setup(c.extend({isContinuedSession:!0},a))};a.CSSUrls&&
c.each(a.CSSUrls[0],function(a,c){var b=f.createElement("link");b.type="text/css";b.rel="stylesheet";b.href=c;h.appendChild(b)});a.CSSTag&&s(a.CSSTag);if(a.JSUrls){var d=l("forceSynchronousJS"),b=a.JSUrls[0];c.each(b,function(a,c){a===b.length-1?p(c,d,e):p(c,d)})}a.JSTag&&(t(a.JSTag),P.when("CSCoPilotPresenterAsset").execute(function(){e()}))},t=function(a){var c=f.createElement("div");c.innerHTML=a;a=0;for(var b=c.children.length;a<b;a++){var d=f.createElement("script");d.type="text/javascript";
d.innerHTML=c.children[a].innerHTML;h.appendChild(d)}},s=function(a){var b=f.createElement("div");b.innerHTML=a;a=0;for(var e=b.children.length;a<e;a++)h.appendChild(b.children[a])},p=function(a,b,e){var d=f.createElement("script");d.type="text/javascript";d.src=a;d.async=b?!1:!0;e&&(d.onload=e);h.appendChild(d)}})();

amzn.copilot.checkCoPilotSession();

</script>

仔细看返回的代码,显然这些问答是js动态加载的,你需要自己解析这些js

37,719

社区成员

发帖
与我相关
我的任务
社区描述
JavaScript,VBScript,AngleScript,ActionScript,Shell,Perl,Ruby,Lua,Tcl,Scala,MaxScript 等脚本语言交流。
社区管理员
  • 脚本语言(Perl/Python)社区
  • IT.BOB
加入社区
  • 近7日
  • 近30日
  • 至今

试试用AI创作助手写篇文章吧