请教一下有关scrapy中crawlspider跟进网页的问题

qq_25498407 2015-07-31 11:24:27
这段时间在学习scrapy来作爬虫完,粗粗看完里tutorials之后就拿小米的招聘网站练手,一开始用最简单的scrapy.spider爬第一页多时候很简单,但是在改成crawlspider爬完所有的招聘信息的时候就不知道为什么进行不下去了,不仅不跟进,第一个网页也不爬了,一开始怀疑是rules除了问题,但是用shell测试过可以得到后续多链接,所以我很疑惑,希望能有大神指教。

spider部分的代码如下

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from xiaomihr.items import XiaomihrItem

class XiaomihrSpider(CrawlSpider):
name = "xiaomihr"
allowed_domains = ["http://hr.xiaomi.com"]
start_urls = ["http://hr.xiaomi.com/job/list"]

rules = [
Rule(LinkExtractor(allow = ('/job/list/.*',),restrict_xpaths=('//a[@number="numbers last"]')), callback='parse_item') #提取下一页中的链接,跟进到下一页
]

def parse_item(self, response):
self.logger.info('Now is spidering in this page: %s', response.url)
base = response.xpath('//div[@class="bd bd-table-list"]//tr')
for sel in base:
item = XiaomihrItem()
item['work'] = sel.xpath('td[1]/a/text()').extract()
item['worktype'] = sel.xpath('td[2]/text()').extract()
item['location'] = sel.xpath('td[3]/text()').extract()
item['detail_link'] = sel.xpath('td[1]/a/@href').extract()
return item



然后运行之后的部分信息

2015-07-31 23:16:32 [scrapy] INFO: Scrapy 1.0.1 started (bot: xiaomihr)
2015-07-31 23:16:32 [scrapy] INFO: Optional features available: ssl, http11
2015-07-31 23:16:32 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'xiaomihr.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['xiaomihr.spiders'], 'FEED_URI': 'items.csv', 'BOT_NAME': 'xiaomihr'}
2015-07-31 23:16:32 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected.

2015-07-31 23:16:32 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-31 23:16:32 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-31 23:16:32 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-31 23:16:32 [scrapy] INFO: Enabled item pipelines:
2015-07-31 23:16:32 [scrapy] INFO: Spider opened
2015-07-31 23:16:32 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-31 23:16:32 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-31 23:16:32 [scrapy] DEBUG: Crawled (200) <GET http://hr.xiaomi.com/job/list> (referer: None)
2015-07-31 23:16:32 [scrapy] INFO: Closing spider (finished)
...全文
552 2 打赏 收藏 转发到动态 举报
写回复
用AI写文章
2 条回复
切换为时间正序
请发表友善的回复…
发表回复
yjpSunshine 2016-02-26
  • 打赏
  • 举报
回复
follow 是一个布尔(boolean)值,指定了根据该规则从response提取的链接是否需要跟进。 如果callback 为None, follow 默认设置为 True ,否则默认为 False
hujizhangkun 2015-11-12
  • 打赏
  • 举报
回复
你好,最后你这个问题解决了吗?

37,719

社区成员

发帖
与我相关
我的任务
社区描述
JavaScript,VBScript,AngleScript,ActionScript,Shell,Perl,Ruby,Lua,Tcl,Scala,MaxScript 等脚本语言交流。
社区管理员
  • 脚本语言(Perl/Python)社区
  • IT.BOB
加入社区
  • 近7日
  • 近30日
  • 至今

试试用AI创作助手写篇文章吧