是这样。
build random forest的时候,使用的InMemBuilder需要一个Dataset Path传入:
InMemBuilder(TreeBuilder treeBuilder, org.apache.hadoop.fs.Path dataPath, org.apache.hadoop.fs.Path datasetPath)
参考:http://stackoverflow.com/questions/24744831/apache-mahout-how-to-save-a-dataset-object-to-hdfs
的回复,把生成的Dataset使用json保存:
DataInfo = new DataSet(attrs, values, nb, regression);
DFUtils.storeWritable(CONF, InfoPath, new Text(DataInfo.toJSON()));
但是保存在hdfs上面的的Dataset由乱码开头:
所以程序报多个错误:
14/12/09 15:12:04 INFO mapreduce.Job: Task Id : attempt_1418092978489_0003_m_000001_1, Status : FAILED
Error: java.lang.RuntimeException: org.codehaus.jackson.JsonParseException: Unexpected character (' ' (code 65533 / 0xfffd)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.StringReader@150ff8e3; line: 1, column: 2]
at org.apache.mahout.classifier.df.data.Dataset.fromJSON(Dataset.java:375)
at org.apache.mahout.classifier.df.data.Dataset.load(Dataset.java:330)
at org.apache.mahout.classifier.df.mapreduce.Builder.loadDataset(Builder.java:224)
at org.apache.mahout.classifier.df.mapreduce.MapredMapper.setup(MapredMapper.java:61)
at org.apache.mahout.classifier.df.mapreduce.inmem.InMemMapper.setup(InMemMapper.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
如果手动去除掉前面的乱码,然后运行正常。
环境: ubuntu 14.04, hadoop-2.2.0
各位有何高见?或者有木有别的存Dataset的方式?