Python批量抓取词典数据时URL地址变化怎么办？

NA_QUEEN 2017-08-23 02:10:50

Python小白一枚，最近工作需要，想从某个在线词典上批量抓取所查询词的内容，用Python写了个简单的爬虫工具，但是在抓取的时候，发现一个难点，就是查询每个词之后新生成的URL地址是变化的，摸不着规律，请求各位大侠、高手、老司机、上神们，这种情况怎么办呢？
简单举个栗子：
首页：
http://www.zdic.net/
查询‘怰’字后，url变成：
http://www.zdic.net/z/19/js/6030.htm
查询‘颴’字后，url变成：
http://www.zdic.net/z/28/js/98B4.htm

懵了，完全不知道去哪里查看这个url是怎么生成的了。。。
高手们快看过来啊~~~~~~~~~~~~~~~~~~~~~~~~~
指点指点思路也行~~
在此谢过各位了，鞠躬~~~~~~~~~~~~~~~~~~~~~~~~~~

...全文

436 10 打赏收藏转发到动态举报

写回复

用AI写文章

10 条回复

切换为时间正序

请发表友善的回复…

发表回复

LinDRon 2017-08-23

打赏
举报

用一个抓包软件去抓取post信息，看里面都传了哪些数据，模拟数据发送就可以了

混沌鳄鱼 2017-08-23

打赏
举报



<!doctype html>
<html lang="en">
 <head>
  <meta charset="UTF-8">
  <meta name="Generator" content="EditPlus®">
  <meta name="Author" content="">
  <meta name="Keywords" content="">
  <meta name="Description" content="">
  <title>Document</title>
 </head>
 <body>
  <div class="secpan">
<div class="sec_m">
<span id="tp1">条目</span><div>|</div>
<span id="tp2">字典</span><div>|</div>
<span id="tp3">词典</span><div>|</div>
<span id="tp4">成语</span><div>|</div>
<span id="tp5">全站</span>
</div>
  <form method="post" action="/sousuo/" id="f1" name="f1">
  <div class="tp_c" id="tp_ts1">请直接输入汉字或词语进行查询，支持拼音查询，例：“han”;“han4”;“han yu”;“han4 yu3”。</div>
  <div class="tp_c" id="tp_ts2">
    <label><input name="lb_a" type="radio" value="hp" checked>汉字或拼音</label>
    <label><input name="lb_a" type="radio" value="bis">笔顺</label>
    <label><input name="lb_a" type="radio" value="wb86">五笔编码</label>
    <label><input name="lb_a" type="radio" value="cj">仓颉编码</label>
    <label><input name="lb_a" type="radio" value="fc">四角号码</label>
    <label><input name="lb_a" type="radio" value="uno">unicode</label>
  </div>
  <div class="tp_c" id="tp_ts3">
    <label><input name="lb_b" type="radio" value="mh" checked>模糊搜索</label>
    <label><input name="lb_b" type="radio" value="jq">精确搜索</label>
  </div>
  <div class="tp_c" id="tp_ts4">
    <label><input name="lb_c" type="radio" value="mh" checked>模糊搜索</label>
    <label><input name="lb_c" type="radio" value="jq">精确搜索</label>  
  </div>
    <input name="tp" id="tp" type="hidden" value="tp1">
    <div class="secpan_qb">
    <DIV class="query">
    <input class="q" id="q" name="q" type="text" value=""  maxlength="30">
    </DIV>
    <BUTTON class="btn" type="submit"></BUTTON>
    </div>
  </form>
  <div class="tp_c" id="tp_ts5"></div>
<form name="f2" id="cse-search-box" onsubmit="return g(this)" style="display:none" target="_blank">
<input name=ie type=hidden value=utf-8>
    <div class="secpan_qb">
    <DIV class="query">      
      <input class="q" type="text" name="word" value=""/>
    </div>
    <BUTTON class="btn" name="sa" value="搜索" type="submit"></BUTTON>
	</div>
<input name=tn type=hidden value="bds">
<input name=cl type=hidden value="3">
<input name=ct type=hidden>
<div class="chide">
<input name=si type=hidden value="www.zdic.net">
<input name=s type=radio> 互联网
<input name=s type=radio checked> www.zdic.net
</div>
</form>
<div class="tp_c" id="tp_tx1"><a href="/z/zxjs/">【汉字拆分】</a> | <a href="/z/kxzd/">【康熙字典】</a> | <a href="/z/swjz/">【說文解字】</a> | <a href="/z/jbs/">【字典部首】</a> | <a href="/c/cibs/">【词典部首】</a></div>
    <div class="tp_c" id="tp_tx2"><a href="/z/zxjs/">【汉字拆分】</a> | <a href="/z/kxzd/">【康熙字典】</a> | <a href="/z/swjz/">【說文解字】</a> | <a href="/z/jbs/">【部首索引】</a> | <a href="/z/pyjs/">【拼音索引】</a></div>
    <div class="tp_c" id="tp_tx3"><a href="/c/cibs/">【词典部首】</a> | <a href="/c/cipy/">【词典拼音】</a></div>
    <div class="tp_c" id="tp_tx4"><a href="/c/cybs/">【成语部首】</a> | <a href="/c/cypy/">【成语拼音】</a></div>
    <div class="tp_c" id="tp_tx5"></div>
  <script type="text/javascript" src="http://img.zdic.net/zdicpic/js/cse.js"></script>
</div>  
 </body>
</html>

NA_QUEEN 2017-08-23

打赏
举报

引用 7 楼 chuifengde的回复:

通过抓包，你可以用这个地址来查关键字"飞翔"，改变后面的关键字就可以到达字或词的页面 http://www.zdic.net/sousuo?cnzz_eid=864159807-1503468926-&ntime=1503479726lb_a=hp&lb_b=mh&lb_c=mh&tp=tp1&q=飞翔

多谢大神！你的sousuo给了我思路，终于找到了表单~

chuifengde 2017-08-23

打赏
举报

通过抓包，你可以用这个地址来查关键字"飞翔"，改变后面的关键字就可以到达字或词的页面 http://www.zdic.net/sousuo?cnzz_eid=864159807-1503468926-&ntime=1503479726lb_a=hp&lb_b=mh&lb_c=mh&tp=tp1&q=飞翔

NA_QUEEN 2017-08-23