python 大文本文件 解析、入库并转换后写入新文件

tim_spac 2011-12-20 10:42:50
#!/usr/bin/python2.7
# encoding: utf-8

''' 处理某大文本文件,结果同时写入数据库及文本文件 '''

import re

from mylib.dbi import DataBaseInterface
from app_common import config

class Klass(object):

FMT = '%(id)d, %(name)s'

def __init__(self, **kwg):
self.id = dict(kwg).get('id')
self.name = dict(kwg).get('name')

def __str__(self):
''' 按指定格式将对象属性格式化为字符串 '''
return self.FMT % self.__dict__

def sqltuple(self):
''' 按指定的顺序输出对象属性元组 '''
return tuple([self.id, self.name])

patt = re.compile(r'^(?P<id>\d+)\t(?P<name>.*)[\s\r\n]+?$', re.I|re.X|re.U)

def process(line):
''' 按预定的格式解析行,生成对象实例 '''
m = patt.match(line)
return None if not m else Klass(**m.groupdict())

dbi = DataBaseInterface(**config)
dbi.open()

# dbi.batch是DataBaseInterface的方法:
# 用dbi.conn.executemany执行批量数据操作
# 支持 with 自动初始化-关闭,支持缓冲空间自动控制
buff = dbi.batch(insertsql)
src = open(srcfilename,'r')
wrt = open(wrtfilename, 'w')

with src, wrt, buff:
for ln in handle:
instance = process(ln)
if instance:
wrt.write('%s\n'%instance)
buff.append(instance.sqltuple())

dbi.close()


以上代码从目前可以正常工作的程序中简化而来。
在其运行完后,由其它方式将新写的文件移入压缩包。
现在想进一步优化一下:新文件直接写入压缩包内(zip/gz/..均可);
我知道的一种方式是先将要写入的文本保存与内存,然后一次性写入压缩包中;
... 问题在于文件很大,不宜如此操作。

有什么方式可以这样:
wrt = SomePackageMethord.open(wrtfilename, 'w', compressmode)
或:
package = SomePackageMethord(packagefile, 'w', compressmode)
wrt = package.write(arc_filename)
?


特请支援。
...全文
491 4 打赏 收藏 转发到动态 举报
写回复
用AI写文章
4 条回复
切换为时间正序
请发表友善的回复…
发表回复
iambic 2011-12-20
  • 打赏
  • 举报
回复
zipfile不支持。看下:
http://stackoverflow.com/questions/297345/create-a-zip-file-from-a-generator-in-python
tim_spac 2011-12-20
  • 打赏
  • 举报
回复
zipfile的write方法是将一个已存在的文件filename写到压缩包中并重命名为arcname;
zipfile的writestr方法是将一个缓冲区bytes的数据写入到压缩包中并命名为arcname;

1. 第一种方法目前在用(是另外的模块),将本脚本运行后生成的文件打包,现在想不在磁盘上生成文件就直接写入压缩包中;
2. 第二种方法是一次性写入,若第二次写入则先前的内容将被丢掉;若将数据先保存到内存,处理完成后一次写入可以实现预期,但当数据量超大时 ... :(
askandstudy 2011-12-20
  • 打赏
  • 举报
回复
[Quote=引用楼主 tim_spac 的回复:]
以上代码从目前可以正常工作的程序中简化而来。
在其运行完后,由其它方式将新写的文件移入压缩包。
现在想进一步优化一下:新文件直接写入压缩包内(zip/gz/..均可);
我知道的一种方式是先将要写入的文本保存与内存,然后一次性写入压缩包中;
... 问题在于文件很大,不宜如此操作。

有什么方式可以这样:
wrt = SomePackageMethord.open(wrtfilename, 'w', compressmode)
或:
package = SomePackageMethord(packagefile, 'w', compressmode)
wrt = package.write(arc_filename)
[/Quote]

没怎么看懂楼主的意思,我用过一下zipfile模块,它有write和writestr方法,还有些别的方法,不知道是不是你需要的。


write(filename, [arcname, [compress_type]])
Write the file named filename to the archive, giving it the archive name arcname (by default, this will
be the same as filename, but without a drive letter and with leading path separators removed). If given,
compress_type overrides the value given for the compression parameter to the constructor for the new entry.
The archive must be open with mode ’w’ or ’a’ – calling write() on a ZipFile created with mode ’r’
will raise a RuntimeError. Calling write() on a closed ZipFile will raise a RuntimeError.
Note: There is no official file name encoding for ZIP files. If you have unicode file names, you must
convert them to byte strings in your desired encoding before passing them to write(). WinZip interprets
all file names as encoded in CP437, also known as DOS Latin.
Note: Archive names should be relative to the archive root, that is, they should not start with a path
separator.
Note: If arcname (or filename, if arcname is not given) contains a null byte, the name of the file in
the archive will be truncated at the null byte.
writestr(zinfo_or_arcname, bytes, [compress_type])
Write the string bytes to the archive; zinfo_or_arcname is either the file name it will be given in the archive,
or a ZipInfo instance. If it’s an instance, at least the filename, date, and time must be given. If it’s a
name, the date and time is set to the current date and time. The archive must be opened with mode ’w’ or
’a’ – calling writestr() on a ZipFile created with mode ’r’ will raise a RuntimeError. Calling
writestr() on a closed ZipFile will raise a RuntimeError.
If given, compress_type overrides the value given for the compression parameter to the constructor for the
new entry, or in the zinfo_or_arcname (if that is a ZipInfo instance).
Note: When passing a ZipInfo instance as the zinfo_or_acrname parameter, the compression method
used will be that specified in the compress_type member of the given ZipInfo instance. By default, the
ZipInfo constructor sets this member to ZIP_STORED. Changed in version 2.7: The compression_type
argument.


具体描述看库文档吧
tim_spac 2011-12-20
  • 打赏
  • 举报
回复
谢谢各位。决定采用gzip.

37,719

社区成员

发帖
与我相关
我的任务
社区描述
JavaScript,VBScript,AngleScript,ActionScript,Shell,Perl,Ruby,Lua,Tcl,Scala,MaxScript 等脚本语言交流。
社区管理员
  • 脚本语言(Perl/Python)社区
  • IT.BOB
加入社区
  • 近7日
  • 近30日
  • 至今

试试用AI创作助手写篇文章吧