使用SparkSQL分析CSDN泄露的用户数据 2015-08-20 21:12

说明

CSDN泄露的用户数据的格式如下:

aaaaaaa # bbbbbb # xxxxxx@hotmail.com
aaaaaaa # bbbbbb # xxxxxx@hotmail.com
aaaaaaa # bbbbbb # xxxxxx@hotmail.com
aaaaaaa # bbbbbb # xxxxxx@hotmail.com___csdn_1
aaaaaaa # bbbbbb # xxxxxx@hotmail.com

格式为:用户名、 密码、邮箱,字段之间使用" # “(星两边各有一个空格)进行分隔。

分析最多人使用的TOPn个密码

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
case class User(username: String, password: String, email: String)

var filePath = "/data/www.csdn.net.sql"
var linesRDD = sc.textFile(filePath)
var partsRDD = linesRDD.map(l => l.split(","))
var csdnRDD = partsRDD.map(r => User(username=r(0), password=r(1), email=r(2)))
var csdnDF = csdnRDD.toDF()
csdnDF.printSchema()
csdnDF.count()

csdnDF.registerTempTable("csdn")


var pwdSet = sqlContext.sql("SELECT password,COUNT(password) AS password_cnt 
FROM csdn GROUP BY password ORDER BY password_cnt DESC LIMIT 50")
pwdSet.map(r => "Password: " + r(0) + " Count: " + r(1)).collect().foreach(println)

csdnDF.groupBy("password").count().show()
Tags: #Spark #SparkSQL    Post on Spark