Spark函数之count、countByKey和countByValue 2015-08-21 21:02

count

统计RDD中元素的个数。

1
2
3
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.count
res2: Long = 4

countByKey

与count类似,但是是以key为单位进行统计。

注意:此函数返回的是一个map,不是int。

1
2
3
4
5
val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2)
c.countByKey
res1: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)
c.countByKey.size
res2: Int = 2

countByValue

统计一个RDD中各个value的出现次数。返回一个map,map的key是元素的值,value是出现的次数。

1
2
3
4
5
scala> val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:21

scala> b.countByValue
res3: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 6, 6 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 8 -> 1, 4 -> 2)
Tags: #Spark    Post on Spark-API