Spark函数之id、keyBy、keys、values和lookup 2015-08-24 21:01

id

顾名思义,就是RDD的id

intersection

求两个RDD中相同的部分:

1
2
3
4
5
val x = sc.parallelize(1 to 20)
val y = sc.parallelize(10 to 30)
val z = x.intersection(y)
z.sortBy(e => e,true).collect
res79: Array[Int] = Array(10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)

keyBy

为各个元素,按指定的函数生成key,形成key-value的RDD。

1
2
3
4
5
6
7
8
scala> val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[123] at parallelize at <console>:21

scala> val b = a.keyBy(_.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[124] at keyBy at <console>:23

scala> b.collect
res80: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,rat), (8,elephant))

keys

输出各key-value型RDD中的key,形成新的RDD。

1
2
3
4
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.keys.collect
res2: Array[Int] = Array(3, 5, 4, 3, 7, 5)

values

输出各key-value型RDD中的value,形成新的RDD。

lookup

从key-value型的RDD中,筛选出指定的key集合。返回的是Scala的sequence。

1
2
3
4
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.lookup(5)
res0: Seq[String] = WrappedArray(tiger, eagle)
Tags: #Spark    Post on Spark-API