scrapy爬取京东中黑茶类商品的所有评论

蔡艺君小朋友 2018-04-10 08:49:21

请教一下
我参考了https://blog.csdn.net/xiaoquantouer/article/details/51841016#python下的代码，改成python3下京东黑茶类4800+多商品的评论，没有使用原来代码中的代理ip，去掉了一些我不需要的字段，比如价格，省份等。我运行爬虫文件jd_goods，没有报错，但也没有爬取任何结果
爬虫文件jd_goods代码：
class jdsql(Spider):
name = "jd"
allowed_domains = ['jd.com']
start_urls = []
for i in range(1, 10): # 这里需要自己设置页数，目前只能抓取黑茶分类下前9页的商品
url = 'https://list.jd.com/list.html?cat=1320,12202,12212&page=' + str(i)
start_urls.append(url)

def parse_getCommentnum(self, response):
# meta参数的作用是传递信息给下一个函数,meta只接受字典类型的赋值,value1=response.meta['key1']
# https://www.zhihu.com/question/54773510介绍meta
item1 = response.meta['item']
# response.body是一个json格式的
js = json.loads(str(response.body))
item1['score1count'] = js['CommentsCount'][0]['Score1Count']
item1['score2count'] = js['CommentsCount'][0]['Score2Count']
item1['score3count'] = js['CommentsCount'][0]['Score3Count']
item1['score4count'] = js['CommentsCount'][0]['Score4Count']
item1['score5count'] = js['CommentsCount'][0]['Score5Count']
item1['comment_num'] = js['CommentsCount'][0]['CommentCount']
return item1

def parse_detail(self, response):
item1 = response.meta['item']
sel = Selector(response)
temp = response.body.split('commentVersion:')
# commentVersion后的数字为评论数
pattern = re.compile("[\'](\d+)[\']")
if len(temp) < 2:
item1['commentVersion'] = -1
else:
match = pattern.match(temp[1][:10])
item1['commentVersion'] = match.group()
url = "https://club.jd.com/clubservice.aspx?method=GetCommentsCount&referenceIds=" + str(item1['ID'][0])
yield scrapy.Request(url, meta={'item': item1}, callback=self.parse_getCommentnum)

def parse(self, response): # 解析搜索页
sel = Selector(response) # Xpath选择器
# 每个页面60个商品
goods = sel.xpath('//li[@class="gl-item"]')
for good in goods:
item1 = goodsItem()
item1['ID'] = good.xpath('//div/@data-sku_temp').extract()
item1['name'] = good.xpath('//div/div[@class="p-name"]/a/em/text()').extract()
item1['shop_name'] = good.xpath('//div/div[@class="p-shop"]/span/a/text()').extract()
item1['link'] = good.xpath('//div/div[@class="p-img"]/a[@href]').extract()
# items.py和数据库中不止这几个字段，这会导致没有任何结果吗
# 该链接对应//div/div[@class="p-img"]/a[@href]复制href的内容即//item.jd.com/商品id.html得到的第3个为评论数
url = "https:" + item1['link'][0] + "#comment"
yield scrapy.Request(url, meta={'item': item1}, callback=self.parse_detail)

...全文

503 3 打赏收藏转发到动态举报

写回复

用AI写文章

3 条回复

切换为时间正序

请发表友善的回复…

发表回复

蔡艺君小朋友 2018-04-11

打赏
举报

去掉sel = Selector(response.text) 中.text 再运行能打印很多很多个测试然后我中途ctrl+c退出了，那请问能打印出测试是哪里出错导致爬不到呢

半吊子Py全栈工程师 2018-04-10

打赏
举报

然后看看能不能输出测试

半吊子Py全栈工程师 2018-04-10

打赏
举报

def parse(self, response): # 解析搜索页 sel = Selector(response) # Xpath选择器 # 每个页面60个商品 goods = sel.xpath('//li[@class="gl-item"]') for good in goods: item1 = goodsItem() 你把这个改成 def parse(self, response): # 解析搜索页 sel = Selector(response.text) # Xpath选择器 # 每个页面60个商品 goods = sel.xpath('//li[@class="gl-item"]') for good in goods: print("测试") item1 = goodsItem()