这段时间在学习scrapy来作爬虫完,粗粗看完里tutorials之后就拿小米的招聘网站练手,一开始用最简单的scrapy.spider爬第一页多时候很简单,但是在改成crawlspider爬完所有的招聘信息的时候就不知道为什么进行不下去了,不仅不跟进,第一个网页也不爬了,一开始怀疑是rules除了问题,但是用shell测试过可以得到后续多链接,所以我很疑惑,希望能有大神指教。
spider部分的代码如下
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from xiaomihr.items import XiaomihrItem
class XiaomihrSpider(CrawlSpider):
name = "xiaomihr"
allowed_domains = ["http://hr.xiaomi.com"]
start_urls = ["http://hr.xiaomi.com/job/list"]
rules = [
Rule(LinkExtractor(allow = ('/job/list/.*',),restrict_xpaths=('//a[@number="numbers last"]')), callback='parse_item') #提取下一页中的链接,跟进到下一页
]
def parse_item(self, response):
self.logger.info('Now is spidering in this page: %s', response.url)
base = response.xpath('//div[@class="bd bd-table-list"]//tr')
for sel in base:
item = XiaomihrItem()
item['work'] = sel.xpath('td[1]/a/text()').extract()
item['worktype'] = sel.xpath('td[2]/text()').extract()
item['location'] = sel.xpath('td[3]/text()').extract()
item['detail_link'] = sel.xpath('td[1]/a/@href').extract()
return item
然后运行之后的部分信息
2015-07-31 23:16:32 [scrapy] INFO: Scrapy 1.0.1 started (bot: xiaomihr)
2015-07-31 23:16:32 [scrapy] INFO: Optional features available: ssl, http11
2015-07-31 23:16:32 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'xiaomihr.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['xiaomihr.spiders'], 'FEED_URI': 'items.csv', 'BOT_NAME': 'xiaomihr'}
2015-07-31 23:16:32 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected.
2015-07-31 23:16:32 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-31 23:16:32 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-31 23:16:32 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-31 23:16:32 [scrapy] INFO: Enabled item pipelines:
2015-07-31 23:16:32 [scrapy] INFO: Spider opened
2015-07-31 23:16:32 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-31 23:16:32 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-31 23:16:32 [scrapy] DEBUG: Crawled (200) <GET http://hr.xiaomi.com/job/list> (referer: None)
2015-07-31 23:16:32 [scrapy] INFO: Closing spider (finished)