R语言中k均值聚类对数据量的要求，我有一个数据量大概200多万样本，变量七个，在确定k值的时候，总是提醒数据量太大如何破？

Watch_dou 2017-07-12 10:54:53

R语言中k均值聚类对数据量的要求，我有一个数据量大概200多万样本，变量七个，在确定k值的时候，总是提醒数据量太大如何破？
#图形确定最佳K
wssplot <- function(data,nc=15,seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data,centers = i)$withinss)
}
plot(1:nc,wss,type='b',xlab = 'Number of Clusters',
ylab = 'Whithin groups sum of squares')
}
wssplot(norm_data)
结果总是出现：
> wssplot(norm_data)
Error: cannot allocate vector of size 132.3 Mb
Called from: aperm.default(X, c(s.call, s.ans))

如何解决？？？？？

...全文

1088 3 打赏收藏转发到动态举报

写回复

用AI写文章

3 条回复

切换为时间正序

请发表友善的回复…

发表回复

wung888888 2017-08-13

打赏
举报

Consider whether you really need all this data explicitly, or can the matrix be sparse? There is good support in R (see Matrix package for e.g.) for sparse matrices. Keep all other processes and objects in R to a minimum when you need to make objects of this size. Use gc() to clear now unused memory, or, better only create the object you need in one session. If the above cannot help, get a 64-bit machine with as much RAM as you can afford, and install 64-bit R. If you cannot do that there are many online services for remote computing. If you cannot do that the memory-mapping tools like package ff (or bigmemory as Sascha mentions) will help you build a new solution. In my limited experience ff is the more advanced package, but you should read the High Performance Computing topic on CRAN Task Views. https://stackoverflow.com/questions/5171593/r-memory-management-cannot-allocate-vector-of-size-n-mb