Spark函数之countApproxDistinct和countApproxDistinctBy 2015-08-21 21:03

countApproxDistinct

对RDD集合内容进行去重统计。该统计是一个大约的统计,参数relativeSD控制统计的精确度。relativeSD越小,结果越准确。

注意:relativeSD不能为0。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
var a = sc.parallelize(1 to 10000, 20)
var b = sc.parallelize(1 to 10001, 20)
var c = a++a++a++a++a++b
c.countApproxDistinct(0.1)
res14: Long = 8224

b.countApproxDistinct(0.05)
res15: Long = 9760

b.countApproxDistinct(0.01)
res16: Long = 9949

b.countApproxDistinct(0.001)
res0: Long = 10000

countApproxDistinctBy

类似于countApproxDistinct,但它是以key进行统计。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
val a = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
val b = sc.parallelize(a.takeSample(true, 10000, 0), 20)
val c = sc.parallelize(1 to b.count().toInt, 20)
val d = b.zip(c)
d.countApproxDistinctByKey(0.1).collect
res15: Array[(String, Long)] = Array((Rat,2567), (Cat,3357), (Dog,2414), (Gnu,2494))

d.countApproxDistinctByKey(0.01).collect
res16: Array[(String, Long)] = Array((Rat,2555), (Cat,2455), (Dog,2425), (Gnu,2513))

d.countApproxDistinctByKey(0.001).collect
res0: Array[(String, Long)] = Array((Rat,2562), (Cat,2464), (Dog,2451), (Gnu,2521))
Tags: #Spark    Post on Spark-API