Spark MLlib入门实例K-means 2015-08-07 19:08

介绍

K-means算法:指定类别数量,将数据集自动分到多个类别中。

准备数据:

48.2
47.5
48.3
47.5
43
42.9
45.7
68.8
43
42.8
43
42.4
43.1
42.3
68.9
69.5
48.2
67.5
48.3

代码:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("file:///tmp/data/a.txt")
val parsedData = data.map(s => Vectors.dense(s.split('\n').map(_.toDouble))).cache()

// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)

parsedData.foreach(x => println(x + " belongs to cluster " + clusters.predict(x)))

// 打印出类别0的所有元素:
// parsedData.filter(clusters.predict(_)==0).foreach(x => println(x))

// Evaluate clustering by computing Within Set Sum of Squared Errors
//val WSSSE = clusters.computeCost(parsedData)
// println("Within Set Sum of Squared Errors = " + WSSSE)

结果:

[42.9] belongs to cluster 0
[45.7] belongs to cluster 0
[68.8] belongs to cluster 1
[43.0] belongs to cluster 0
[42.8] belongs to cluster 0
[43.0] belongs to cluster 0
[42.4] belongs to cluster 0
[43.1] belongs to cluster 0
[42.3] belongs to cluster 0
[68.9] belongs to cluster 1
[69.5] belongs to cluster 1
[48.2] belongs to cluster 0

参考文档

  1. 3 分钟学会调用 Apache Spark MLlib KMeans
Tags: #Spark #MLlib #MachineLearning    Post on Spark