Spark函数之coalesce和repartition 2015-08-20 21:05

coalesce

coalesce用于重新调整RDD的Partition数目。第一个参数是新的Partition数目。第二个参数是是否需要shuffle。如果重分区的数目大于原来的分区数,那么必须指定shuffle参数为true

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
var x = sc.parallelize(1 to 12, 6)

//打印出各partition的数据
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[Int] = {
  println("[partID:" +  index + ", val: " + iter.toList + "]")
  iter
}
x.mapPartitionsWithIndex(myfunc).collect

//变成2个分区
var y = x.coalesce(2, false)
y.mapPartitionsWithIndex(myfunc).collect

//如果重分区的数目大于原来的分区数,那么必须指定shuffle参数为true
//变成12个分区
var z = x.coalesce(12, true)
z.mapPartitionsWithIndex(myfunc).collect

repartition

repartition就相当于shuffle参数为true的coalesce

1
2
3
4
var x = sc.parallelize(1 to 12, 6)
// repartition就相当于shuffle参数为true的coalesce
var m = x.repartition(2)
m.mapPartitionsWithIndex(myfunc).collect
Tags: #Spark    Post on Spark-API