Spark输出ORC格式的文件
请问spark上RDD怎么输出到ORC文件里,我知道spark-sql可以做到,但是除了这种方法呢?
我想知道用saveAsNewAPIHadoopFile 这个API可以做到么?
我现在的代码是
ex.map(x => (NullWritable.get(), x)).repartition(partition.toInt).saveAsNewAPIHadoopFile("/user/tmp/orc", classOf[NullWritable], classOf[Model], classOf[OrcNewOutputFormat])
case class Model(uid: String, province: String, city: String)
里面的ex是RDD[Model]
运行报错
Model cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow at org.apache.hadoop.hive.ql.io.orc.OrcNewOutputFormat$OrcRecordWriter.write(OrcNewOutputFormat.java:37) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1113) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744)