调用sklearn中的RandomForestClassifier出现memory error

flyinghorse_2012 2016-01-18 08:54:34

最近在参加大数据比赛，在本地跑数据，用的是win32系统，python开发环境，sklearn的机器学习包。
调用sklearn中的RandomForestClassifier出现memory error，查看资源管理器，也确实是内存陡增，请问大神，针对这个问题该肿么办？
代码如下：
import time
import numpy as np
from sklearn.ensemble import RandomForestClassifier

ISOTIMEFORMAT = '%Y-%m-%d %X'

print 'Begin:'
print time.strftime(ISOTIMEFORMAT,time.localtime())

resutlfeaturespath = r"C:\Users\robbert\Desktop\bigDataCompetition\tianyi\data\resultfeaturesday"
featuresdatapath = r"C:\Users\robbert\Desktop\bigDataCompetition\tianyi\data\week6and7features.txt"
traindata = np.loadtxt(featuresdatapath,delimiter = ',',dtype = np.int)

trainfeatures = traindata[:,:6]
trainlabel = traindata[:,6]

clf = RandomForestClassifier(n_estimators = 100)
clf.fit(trainfeatures,trainlabel)

print 'model constructed:'
print time.strftime(ISOTIMEFORMAT,time.localtime())

for i in range(7):
resultfp = resutlfeaturespath + str(i + 1) + ".txt"
testdata = np.loadtxt(resultfp,delimiter = ',',dtype = np.int)
testfeatures = testdata[:,:6]
testlabel = clf.predict(testfeatures)
resultpath = resutlfeaturespath + str(i + 1) + "result.txt"
np.save(resultpath, delimiter = ',')

print 'Done:'
print time.strftime(ISOTIMEFORMAT,time.localtime())

输出结果如下：
Begin:
2016-01-18 20:36:41
model constructed:
2016-01-18 20:37:08
Traceback (most recent call last):
File "C:\Users\robbert\Desktop\bigDataCompetition\tianyi\pythonproject2\train_randomForest\train_randomForest.py", line 32, in <module>
testlabel = clf.predict(testfeatures)
File "C:\Python27\lib\site-packages\sklearn\ensemble\forest.py", line 498, in predict
proba = self.predict_proba(X)
File "C:\Python27\lib\site-packages\sklearn\ensemble\forest.py", line 547, in predict_proba
for e in self.estimators_)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 804, in __call__
while self.dispatch_one_batch(iterator):
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 662, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 570, in _dispatch
job = ImmediateComputeBatch(batch)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 183, in __init__
self.results = batch()
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Python27\lib\site-packages\sklearn\ensemble\forest.py", line 125, in _parallel_helper
return getattr(obj, methodname)(*args, **kwargs)
File "C:\Python27\lib\site-packages\sklearn\tree\tree.py", line 673, in predict_proba
proba = self.tree_.predict(X)
File "sklearn/tree/_tree.pyx", line 736, in sklearn.tree._tree.Tree.predict (sklearn\tree\_tree.c:8449)
File "sklearn/tree/_tree.pyx", line 738, in sklearn.tree._tree.Tree.predict (sklearn\tree\_tree.c:8321)
MemoryError

...全文

1881 3 打赏收藏转发到动态举报

写回复

用AI写文章

3 条回复

切换为时间正序

请发表友善的回复…

发表回复

斯温jack 2016-09-30

打赏
举报

sklearn随机森林算法默认n_eatimators是10 而且从你的代码来看并没有使用并行即n_jobs = 1 仅使用的CPU的一个核如果使用多核形式n_jobs = -1 将使得内存的占用更为恶化，你可以考虑用多个n_eatimator 比较小的随机森林算法来估计最后进行投票平均或者干脆自己使用决策树算法来写随机森林一棵树一棵树地估出结果，最后平均保证没有并行这样能省一些内存