python 大文本文件解析、入库并转换后写入新文件

tim_spac 2011-12-20 10:42:50

#!/usr/bin/python2.7

# encoding: utf-8



''' 处理某大文本文件，结果同时写入数据库及文本文件 '''



import re



from mylib.dbi import DataBaseInterface

from app_common import config



class Klass(object):



    FMT = '%(id)d, %(name)s'



    def __init__(self, **kwg):

        self.id = dict(kwg).get('id')

        self.name = dict(kwg).get('name')



    def __str__(self):

        ''' 按指定格式将对象属性格式化为字符串 '''

        return self.FMT % self.__dict__



    def sqltuple(self):

        ''' 按指定的顺序输出对象属性元组 '''

        return tuple([self.id, self.name])



patt = re.compile(r'^(?P<id>\d+)\t(?P<name>.*)[\s\r\n]+?$', re.I|re.X|re.U)



def process(line):

    ''' 按预定的格式解析行，生成对象实例 '''

    m = patt.match(line)

    return None if not m else Klass(**m.groupdict())



dbi = DataBaseInterface(**config)

dbi.open()



# dbi.batch是DataBaseInterface的方法:

# 用dbi.conn.executemany执行批量数据操作

# 支持 with 自动初始化-关闭，支持缓冲空间自动控制

buff = dbi.batch(insertsql) 

src = open(srcfilename,'r')

wrt = open(wrtfilename, 'w')



with src, wrt, buff:

    for ln in handle:

        instance = process(ln)

        if instance:

            wrt.write('%s\n'%instance)

            buff.append(instance.sqltuple())



dbi.close()

以上代码从目前可以正常工作的程序中简化而来。
在其运行完后，由其它方式将新写的文件移入压缩包。
现在想进一步优化一下：新文件直接写入压缩包内(zip/gz/..均可)；
我知道的一种方式是先将要写入的文本保存与内存，然后一次性写入压缩包中；
... 问题在于文件很大，不宜如此操作。

有什么方式可以这样:
wrt = SomePackageMethord.open(wrtfilename, 'w', compressmode)
或：
package = SomePackageMethord(packagefile, 'w', compressmode)
wrt = package.write(arc_filename)
?

特请支援。

...全文

633 4 打赏收藏转发到动态举报

写回复

用AI写文章

4 条回复

切换为时间正序

请发表友善的回复…

发表回复

iambic 2011-12-20

打赏
举报

zipfile不支持。看下：
http://stackoverflow.com/questions/297345/create-a-zip-file-from-a-generator-in-python

tim_spac 2011-12-20

打赏
举报

zipfile的write方法是将一个已存在的文件filename写到压缩包中并重命名为arcname;
zipfile的writestr方法是将一个缓冲区bytes的数据写入到压缩包中并命名为arcname;

1. 第一种方法目前在用(是另外的模块)，将本脚本运行后生成的文件打包，现在想不在磁盘上生成文件就直接写入压缩包中；
2. 第二种方法是一次性写入，若第二次写入则先前的内容将被丢掉；若将数据先保存到内存，处理完成后一次写入可以实现预期，但当数据量超大时 ... :(

askandstudy 2011-12-20

打赏
举报

[Quote=引用楼主 tim_spac 的回复:]
以上代码从目前可以正常工作的程序中简化而来。
在其运行完后，由其它方式将新写的文件移入压缩包。
现在想进一步优化一下：新文件直接写入压缩包内(zip/gz/..均可)；
我知道的一种方式是先将要写入的文本保存与内存，然后一次性写入压缩包中；
... 问题在于文件很大，不宜如此操作。

有什么方式可以这样:
wrt = SomePackageMethord.open(wrtfilename, 'w', compressmode)
或：
package = SomePackageMethord(packagefile, 'w', compressmode)
wrt = package.write(arc_filename)
[/Quote]

没怎么看懂楼主的意思，我用过一下zipfile模块，它有write和writestr方法，还有些别的方法，不知道是不是你需要的。



write(ﬁlename, [arcname, [compress_type]])

Write the ﬁle named ﬁlename to the archive, giving it the archive name arcname (by default, this will

be the same as ﬁlename, but without a drive letter and with leading path separators removed).  If given,

compress_type overrides the value given for the compression parameter to the constructor for the new entry.

The archive must be open with mode ’w’ or ’a’ – calling write() on a ZipFile created with mode ’r’

will raise a RuntimeError. Calling write() on a closed ZipFile will raise a RuntimeError.

Note:    There is no ofﬁcial ﬁle name encoding for ZIP ﬁles.  If you have unicode ﬁle names, you must

convert them to byte strings in your desired encoding before passing them to write(). WinZip interprets

all ﬁle names as encoded in CP437, also known as DOS Latin.

Note:    Archive names should be relative to the archive root, that is, they should not start with a path

separator.

Note:  If arcname (or filename, if arcname is not given) contains a null byte, the name of the ﬁle in

the archive will be truncated at the null byte.

writestr(zinfo_or_arcname, bytes, [compress_type])

Write the string bytes to the archive; zinfo_or_arcname is either the ﬁle name it will be given in the archive,

or a ZipInfo instance.  If it’s an instance, at least the ﬁlename, date, and time must be given.  If it’s a

name, the date and time is set to the current date and time. The archive must be opened with mode ’w’ or

’a’ – calling writestr() on a ZipFile created with mode ’r’ will raise a RuntimeError. Calling

writestr() on a closed ZipFile will raise a RuntimeError.

If given, compress_type overrides the value given for the compression parameter to the constructor for the

new entry, or in the zinfo_or_arcname (if that is a ZipInfo instance).

Note:   When passing a ZipInfo instance as the zinfo_or_acrname parameter, the compression method

used will be that speciﬁed in the compress_type member of the given ZipInfo instance.  By default, the

ZipInfo constructor sets this member to ZIP_STORED.  Changed in version 2.7: The compression_type

argument.