python dict 的内存控制

yunfeifan 2009-11-24 10:17:40

下边是我写的一段代码，随机生成100万条4列数据，并放到dict里边，然后把内容写到txt文件中



#!/usr/bin/python2.5

import random

import time



time.sleep(5)

shapes = {}

print "--------------START---------------------------"

for i in range(1000000):

  shape_id = str(random.randint(0, 10000))

  r1 = str(random.randint(0, 1000000))

  r2 = str(random.randint(0, 1000000))

  r3 = str(random.randint(0, 1000000))

  r4 = str(random.randint(0, 1000000))

  shapes.setdefault(shape_id, []).append([r1, r2, r3, r4])

print '---------------Dictionary---------------------'

time.sleep(5)



f = open("test.txt", "w")

for shape_id, res in shapes.items():

  for (r1, r2, r3, r4) in res:

    f.write(shape_id + ',' + r1 + ',' + r2 + ',' + r3 + ',' + r4 + '\n')

f.close()

当我执行的时候，生成了shapes的时候内存占用到达了200MB多，但是生成文件的时候只有30MB多一点，这是为什么能？这个办法能否有优化，因为我可能要导入500M的文件，像这样的话可能需要3-4g的内存空间来存储它，但是本地的内存显然不够，不知道大家有没有办法来优化它？

谢谢

...全文

820 11 打赏收藏转发到动态举报

写回复

用AI写文章

11 条回复

切换为时间正序

请发表友善的回复…

发表回复

yunfeifan 2009-12-22

打赏
举报

我发现值保存为int 或FLOAT 比str能节省将近一半的空间

bladesoft 2009-12-09

打赏
举报

shapes.setdefault(shape_id, []).append([r1, r2, r3, r4])
总感觉这句话不对为什么键值也用随机那么大的数,而且还容易重复

海楓 2009-12-01

打赏
举报

試下mmap

thy38 2009-11-25

打赏
举报

键值是整数为什么不考虑直接用顺序表？，会少用一半左右的内存。

angel_su 2009-11-24

打赏
举报

应该不关dict的事，python变量本来占用就比较大，你改用其它类型可能更大，应为dict会剔除重复的键值。数据量过大若不改用文件形式操作，想一次载入，估计得改用其它较基础的语言。

notax 2009-11-24

打赏
举报

Why?

Dictionary implementation

Dictionaries use a similar expandable model, though they are hashtables, and their structure is thus a bit more complex. Essentially, Python dictionaries today use table probing instead of chains of items at hash table slots, along with hashing algorithms tailored for common Python usage patterns. According to Python 2.6's dictobject.c file: "This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4. Open addressing is preferred over chaining since the link overhead for chaining would be substantial (100% with typical malloc overhead)."

In terms of memory, dictionary tables also begin small or presized, and may grow or shrink over time; they double or quadruple in size when they become 2/3 full, and may shrink as items are removed. Also from dictobject.c: "If fill >= 2/3 size, adjust size. Normally, this doubles or quaduples the size, but it's also possible for the dict to shrink [...] Quadrupling the size improves average dictionary sparseness (reducing collisions) at the cost of some memory and iteration speed (which loops over every possible entry). It also halves the number of expensive resize operations in a growing dictionary. Very large dictionaries (over 50K items) use doubling instead. This may help applications with severe memory constraints." In other words, dictionaries are already more efficient than you or I could probably make them.

如用 bsddb 要先變str，
除了bsddb還有其他的方法，ZODB BTrees，

anyway, you are on your own.

yunfeifan 2009-11-24

打赏
举报

好像bsDDB 只能保存String。
我的KEY 是一个ID ，对应着一个list,list 中有很多tuple，例如
123=>[(2,3,4),(3,4,5)...]
234=>[(232,433,493),(384,987,239)...]
而且要对于每个KEY的value 进行排序

我还没发现BSDDB怎么能完成。
我只是想知道dict为什么把一个200MB的文件能需要1g多的内存？这个有没有办法解决

notax 2009-11-24

打赏
举报

可以通过bsddb把dict 放在文件上，具体可以参考bsddb

>>> import bsddb
>>> db = bsddb.btopen('/tmp/spam.db', 'c')
>>> for i in range(10): db['%d'%i] = '%d'% (i*i)
...
>>> db['3']
'9'
>>> db.keys()
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
>>> db.first()
('0', '0')
>>> db.next()
('1', '1')
>>> db.last()
('9', '81')
>>> db.set_location('2')
('2', '4')
>>> db.previous()
('1', '1')
>>> for k, v in db.iteritems():
... print k, v
0 0
1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64
9 81
>>> '8' in db
True
>>> db.sync()
0

angel_su 2009-11-24

打赏
举报

数据量大的话，用整数的键值会节省内存吧。非要用长长的字符串而数据量又过大，那就改键值里的值是数据在文件里的位置，这样就可以随机读取。

yunfeifan 2009-11-24

打赏
举报

谢谢楼上的
我的问题是20mb的数据导入到DICT中，内存占用了100多MB
有办法解决这个问题吗？

notax 2009-11-24

打赏
举报

#-----------------------------------------------------
dict_test1.py

#!/usr/bin/python2.5
import random
import time

shapes = {}
for i in xrange(1000000):
shape_id = str(random.randint(0, 10000))
r1 = str(random.randint(0, 1000000))
r2 = str(random.randint(0, 1000000))
r3 = str(random.randint(0, 1000000))
r4 = str(random.randint(0, 1000000))
shapes.setdefault(shape_id, []).append([r1, r2, r3, r4])

f = open("test.txt", "w")
for shape_id, res in shapes.items():
for (r1, r2, r3, r4) in res:
f.write(shape_id + ',' + r1 + ',' + r2 + ',' + r3 + ',' + r4 + '\n')

f.close()

#-----------------------------------------------------
dict_test5.py

#!/usr/bin/python2.5
#
from __future__ import generators

from random import randint as r

def gen_random(shapes,shape_id=None,nums=None):
for i in xrange(1000000):
shape_id = r(0, 10000)
nums = r(0, 1000000),r(0, 1000000),r(0, 1000000),r(0, 1000000)
shapes[shape_id] = nums
yield shapes[shape_id]

shapes = {}
shape_id=0
nums=()
randoms = gen_random(shapes,shape_id,nums)

f = open("test5.txt", "w")
s = ''
for g in randoms:
s='%s,%s,%s,%s\n'%(g[0],g[1],g[2],g[3])
f.write(s)
f.close()

#-----------------------------------------------------

results:
$ time python dict_test1.py

real 0m20.350s
user 0m19.765s
sys 0m0.176s

$ time python dict_test5.py

real 0m12.766s
user 0m12.417s
sys 0m0.068s