#Python求助# SCRAPY pineline 报错 (带代码和注释,简单易懂)

Naza_051038 2017-04-03 10:47:15
由于不太清楚传输的机制,卡在SCRAPY传输的这个问题上近半个月,翻阅了好多资料,还是不懂,基础比较差所以上来求助各位老师!
不涉及自定义就以SCRAPY默认的格式为例
spider return的东西需要什么样的格式?
dict?
{a:1,b:2,.....}

还是
[{a:1,aa:11},{b:2,bb:22},{......}]

return的东西传去哪了?
是不是下面代码的item?
class pipeline :
def process_item(self, item, spider):

我真的是很菜,但是我很想学希望能得到各位老师的帮助!下面是我的代码,希望能指出缺点

spider:
# -*- coding: utf-8 -*-
import scrapy
from pm25.items import Pm25Item
import re


class InfospSpider(scrapy.Spider):
name = "infosp"
allowed_domains = ["pm25.com"]
start_urls = ['http://www.pm25.com/rank/1day.html', ]

def parse(self, response):
item = Pm25Item()
re_time = re.compile("\d+-\d+-\d+")
date = response.xpath("/html/body/div[4]/div/div/div[2]/span").extract()[0] #单独解析出DATE
# items = []

selector = response.selector.xpath("/html/body/div[5]/div/div[3]/ul[2]/li") #从response里确立解析范围
for subselector in selector: #通过范围逐条解析
try: #防止[0]报错
rank = subselector.xpath("span[1]/text()").extract()[0]
quality = subselector.xpath("span/em/text()")[0].extract()
city = subselector.xpath("a/text()").extract()[0]
province = subselector.xpath("span[3]/text()").extract()[0]
aqi = subselector.xpath("span[4]/text()").extract()[0]
pm25 = subselector.xpath("span[5]/text()").extract()[0]
except IndexError:
print(rank,quality,city,province,aqi,pm25)

item['date'] = re_time.findall(date)[0]
item['rank'] = rank
item['quality'] = quality
item['province'] = city
item['city'] = province
item['aqi'] = aqi
item['pm25'] = pm25
# items.append(item)

yield item #这里不懂该怎么用,出来的是什么格式,
#有的教程会return items,所以希望能得到指点


pipeline:
import time

class Pm25Pipeline(object):

def process_item(self, item, spider):
today = time.strftime("%y%m%d",time.localtime())
fname = str(today) + ".txt"

with open(fname,"a") as f:
for tmp in item: #不知道这里是否写的对,
#个人理解是spider return出来的item是yiled dict
#[{a:1,aa:11},{b:2,bb:22},{......}]
f.write(tmp["date"] + '\t' +
tmp["rank"] + '\t' +
tmp["quality"] + '\t' +
tmp["province"] + '\t' +
tmp["city"] + '\t' +
tmp["aqi"] + '\t' +
tmp["pm25"] + '\n'
)
f.close()
return item


items:
import scrapy

class Pm25Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
date = scrapy.Field()
rank = scrapy.Field()
quality = scrapy.Field()
province = scrapy.Field()
city = scrapy.Field()
aqi = scrapy.Field()
pm25 = scrapy.Field()
pass


部分运行报错代码:
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '30',
'city': '新疆',
'date': '2017-04-02',
'pm25': '13 ',
'province': '伊犁哈萨克州',
'quality': '优',
'rank': '357'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28',
'city': '西藏',
'date': '2017-04-02',
'pm25': '11 ',
'province': '林芝',
'quality': '优',
'rank': '358'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28',
'city': '云南',
'date': '2017-04-02',
'pm25': '11 ',
'province': '丽江',
'quality': '优',
'rank': '359'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '27',
'city': '云南',
'date': '2017-04-02',
'pm25': '15 ',
'province': '玉溪',
'quality': '优',
'rank': '360'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '26',
'city': '云南',
'date': '2017-04-02',
'pm25': '10 ',
'province': '楚雄州',
'quality': '优',
'rank': '361'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '24',
'city': '云南',
'date': '2017-04-02',
'pm25': '11 ',
'province': '迪庆州',
'quality': '优',
'rank': '362'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '22',
'city': '云南',
'date': '2017-04-02',
'pm25': '9 ',
'province': '怒江州',
'quality': '优',
'rank': '363'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-03 10:23:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 328,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 38229,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 4, 3, 2, 23, 14, 972356),
'log_count/DEBUG': 2,
'log_count/ERROR': 363,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 4, 3, 2, 23, 13, 226730)}
2017-04-03 10:23:14 [scrapy.core.engine] INFO: Spider closed (finished)
...全文
576 7 打赏 收藏 转发到动态 举报
写回复
用AI写文章
7 条回复
切换为时间正序
请发表友善的回复…
发表回复
ziggyff 2018-01-25
  • 打赏
  • 举报
回复
https://www.cnblogs.com/kongzhagen/p/6549053.html 我是看这个写着玩的
ziggyff 2018-01-25
  • 打赏
  • 举报
回复
返回的就是你在items.py里定义好的那个结构的实例吧,然后再在pipelines.py里使用它,具体怎么传递的应该是scrapy这个框架自动完成的,咱们只需要负责yield取得数据然后再去使用就行吧。要是答错的话勿喷,python刚接触一天,而且上来就看的这个
Naza_051038 2017-04-10
  • 打赏
  • 举报
回复
引用 4楼屎克螂 的回复:
yield item #这里不懂该怎么用,出来的是什么格式, #有的教程会return items,所以希望能得到指点 yield 生成器 合理用内存, 比如说数组里面有100个 占100内存,而机器只有10内存 那直接 return range(100)就把内存撑爆,所有yield 一个一个来 即每次用1内存。 for tmp in item: #不知道这里是否写的对, #个人理解是spider return出来的item是yiled dict #[{a:1,aa:11},{b:2,bb:22},{......}] 对了一半,从报错信息来看,item有一部分是字符串类型一部分dict。 你需过对tmp做个类型判断 再做后缀操作
谢谢,虽然还是不太明白~但是非常感谢
屎克螂 2017-04-05
  • 打赏
  • 举报
回复
yield item #这里不懂该怎么用,出来的是什么格式, #有的教程会return items,所以希望能得到指点 yield 生成器 合理用内存, 比如说数组里面有100个 占100内存,而机器只有10内存 那直接 return range(100)就把内存撑爆,所有yield 一个一个来 即每次用1内存。 for tmp in item: #不知道这里是否写的对, #个人理解是spider return出来的item是yiled dict #[{a:1,aa:11},{b:2,bb:22},{......}] 对了一半,从报错信息来看,item有一部分是字符串类型一部分dict。 你需过对tmp做个类型判断 再做后缀操作
师傅不姓唐 2017-04-03
  • 打赏
  • 举报
回复
楼主可以参考一下:http://blog.csdn.net/zsl10/article/details/52691505 楼主我最近也在学习scrapy,能说说你这个项目的目的吗?我也想写一个试试
师傅不姓唐 2017-04-03
  • 打赏
  • 举报
回复
引用 2 楼 sinat_36561734 的回复:
没什么目的啊,就是检测雾霾指标然后制图呗。
好的,谢谢,我也试试
Naza_051038 2017-04-03
  • 打赏
  • 举报
回复
没什么目的啊,就是检测雾霾指标然后制图呗。

37,719

社区成员

发帖
与我相关
我的任务
社区描述
JavaScript,VBScript,AngleScript,ActionScript,Shell,Perl,Ruby,Lua,Tcl,Scala,MaxScript 等脚本语言交流。
社区管理员
  • 脚本语言(Perl/Python)社区
  • IT.BOB
加入社区
  • 近7日
  • 近30日
  • 至今

试试用AI创作助手写篇文章吧