调用sklearn中的RandomForestClassifier出现memory error

flyinghorse_2012 2016-01-18 08:54:34
最近在参加大数据比赛,在本地跑数据,用的是win32系统,python开发环境,sklearn的机器学习包。
调用sklearn中的RandomForestClassifier出现memory error,查看资源管理器,也确实是内存陡增,请问大神,针对这个问题该肿么办?
代码如下:
import time
import numpy as np
from sklearn.ensemble import RandomForestClassifier

ISOTIMEFORMAT = '%Y-%m-%d %X'

print 'Begin:'
print time.strftime(ISOTIMEFORMAT,time.localtime())

resutlfeaturespath = r"C:\Users\robbert\Desktop\bigDataCompetition\tianyi\data\resultfeaturesday"
featuresdatapath = r"C:\Users\robbert\Desktop\bigDataCompetition\tianyi\data\week6and7features.txt"
traindata = np.loadtxt(featuresdatapath,delimiter = ',',dtype = np.int)

trainfeatures = traindata[:,:6]
trainlabel = traindata[:,6]

clf = RandomForestClassifier(n_estimators = 100)
clf.fit(trainfeatures,trainlabel)

print 'model constructed:'
print time.strftime(ISOTIMEFORMAT,time.localtime())

for i in range(7):
resultfp = resutlfeaturespath + str(i + 1) + ".txt"
testdata = np.loadtxt(resultfp,delimiter = ',',dtype = np.int)
testfeatures = testdata[:,:6]
testlabel = clf.predict(testfeatures)
resultpath = resutlfeaturespath + str(i + 1) + "result.txt"
np.save(resultpath, delimiter = ',')

print 'Done:'
print time.strftime(ISOTIMEFORMAT,time.localtime())

输出结果如下:
Begin:
2016-01-18 20:36:41
model constructed:
2016-01-18 20:37:08
Traceback (most recent call last):
File "C:\Users\robbert\Desktop\bigDataCompetition\tianyi\pythonproject2\train_randomForest\train_randomForest.py", line 32, in <module>
testlabel = clf.predict(testfeatures)
File "C:\Python27\lib\site-packages\sklearn\ensemble\forest.py", line 498, in predict
proba = self.predict_proba(X)
File "C:\Python27\lib\site-packages\sklearn\ensemble\forest.py", line 547, in predict_proba
for e in self.estimators_)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 804, in __call__
while self.dispatch_one_batch(iterator):
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 662, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 570, in _dispatch
job = ImmediateComputeBatch(batch)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 183, in __init__
self.results = batch()
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Python27\lib\site-packages\sklearn\ensemble\forest.py", line 125, in _parallel_helper
return getattr(obj, methodname)(*args, **kwargs)
File "C:\Python27\lib\site-packages\sklearn\tree\tree.py", line 673, in predict_proba
proba = self.tree_.predict(X)
File "sklearn/tree/_tree.pyx", line 736, in sklearn.tree._tree.Tree.predict (sklearn\tree\_tree.c:8449)
File "sklearn/tree/_tree.pyx", line 738, in sklearn.tree._tree.Tree.predict (sklearn\tree\_tree.c:8321)
MemoryError
...全文
1510 3 打赏 收藏 转发到动态 举报
写回复
用AI写文章
3 条回复
切换为时间正序
请发表友善的回复…
发表回复
斯温jack 2016-09-30
  • 打赏
  • 举报
回复
sklearn随机森林算法默认n_eatimators是10 而且从你的代码来看 并没有使用并行 即n_jobs = 1 仅使用的CPU的一个核 如果使用多核形式n_jobs = -1 将使得内存的占用更为恶化, 你可以考虑用多个n_eatimator 比较小 的随机森林算法来估计 最后进行投票平均 或者干脆自己使用决策树算法来写随机森林 一棵树一棵树地估出结果,最后平均 保证没有并行 这样能省一些内存
weixin_33601960 2016-01-23
  • 打赏
  • 举报
回复
大数据比赛,牛,小白帮顶
flyinghorse_2012 2016-01-18
  • 打赏
  • 举报
回复
补充一下数据大小: traindata.shape = (369605,7) testdata.shape = (1486780,7) 以前在使用pandas的使用,也出现过memeory error,是不是我的内存太小了,还是啥问题?

37,721

社区成员

发帖
与我相关
我的任务
社区描述
JavaScript,VBScript,AngleScript,ActionScript,Shell,Perl,Ruby,Lua,Tcl,Scala,MaxScript 等脚本语言交流。
社区管理员
  • 脚本语言(Perl/Python)社区
  • IT.BOB
加入社区
  • 近7日
  • 近30日
  • 至今

试试用AI创作助手写篇文章吧