Spark函数之groupBy和groupByKey 2015-08-21 21:14

groupBy

将元素通过函数生成相应的Key,数据就转化为Key-Value格式,之后将Key相同的元素分为一组。。

1
2
3
4
5
6
7
val a = sc.parallelize(1 to 9)
a.groupBy(x => { if (x % 2 == 0) "even" else "odd" }).collect
res67: Array[(String, Iterable[Int])] = Array((even,CompactBuffer(2, 4, 6, 8)), (odd,CompactBuffer(1, 3, 5, 7, 9)))

a.groupBy(x => x % 3).collect
res68: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(3, 6, 9)), (2,CompactBuffer(2, 5, 8)), 
    (1,CompactBuffer(1, 4, 7)))

groupByKey

对Key-Value形式的RDD的操作。与groupBy类似。但是其分组所用的key不是由指定的函数生成的,而是采用元素本身中的key。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
scala> val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[93] at parallelize at <console>:21

scala> val b = a.keyBy(_.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[94] at keyBy at <console>:23

scala> b.collect
res69: Array[(Int, String)] = Array((3,dog), (5,tiger), (4,lion), (3,cat), (6,spider), (5,eagle))

scala> b.groupByKey.collect
res70: Array[(Int, Iterable[String])] = Array((4,CompactBuffer(lion)), (6,CompactBuffer(spider)), 
(3,CompactBuffer(dog, cat)), (5,CompactBuffer(tiger, eagle)))

scala>
Tags: #Spark    Post on Spark-API