Spark函数之persist、unpersist、cache和getStorageLevel 2015-08-21 21:12

说明

persist用于设置RDD的StorageLevel。

不带参数的persist和cache函数,相当于persist(StorageLevel.MEMORY_ONLY)。

unpersist用于删除磁盘、内存中的相关序列化对象。

getStorageLevel用于查询getStorageLevel信息。

注意:RDD的StorageLevel只能设置一次。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
scala> var a = sc.parallelize(1 to 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[81] at parallelize at <console>:21

scala> a.getStorageLevel
res56: org.apache.spark.storage.StorageLevel = StorageLevel(false, false, false, false, 1)

scala> a.getStorageLevel.description
res57: String = Serialized 1x Replicated

scala> a.persist
res58: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[81] at parallelize at <console>:21

scala> a.getStorageLevel.description
res59: String = Memory Deserialized 1x Replicated

scala> var b = sc.parallelize(1 to 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[82] at parallelize at <console>:21

scala> b.getStorageLevel.description
res60: String = Serialized 1x Replicated

scala> b.cache
res61: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[82] at parallelize at <console>:21

scala> b.getStorageLevel.description
res62: String = Memory Deserialized 1x Replicated

scala>
Tags: #Spark    Post on Spark-API