Spark函数之distinct和first 2015-08-21 21:05

distinct

筛选出RDD中的唯一值。有一个可选参数,是目标RDD的分区个数。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
scala> val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[22] at parallelize at <console>:21

scala> c.distinct.collect
res11: Array[String] = Array(Dog, Cat, Gnu, Rat)

scala> a.distinct(2).partitions.length
res12: Int = 2

scala>

first

返回RDD中的第一个元素的值。

1
2
3
4
5
scala> val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[21] at parallelize at <console>:21

scala> c.first
res10: String = Gnu
Tags: #Spark    Post on Spark-API