使用Hive分析CSDN泄露的用户数据 2015-08-09 23:00

准备

  1. 已经将csdn用户数据库文件上传到HDFS的/data/目录下,文件为是csdn.correct。
  2. 已经修正了原始文件部分记录末尾多出了"__csdn1"字样的问题
  3. 已经将原始文件中记录的分隔符由" # “改为”,“

创建表

连接hive:

/opt/hive/bin/beeline --color=true --fastConnect=true
!connect jdbc:hive2://ctrl:10000
username:root

创建表:

1
2
3
4
5
6
7
8
CREATE EXTERNAL TABLE csdn(
    username STRING,
    password STRING,
    email STRING    
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

LOAD DATA INPATH '/data/csdn.correct' OVERWRITE INTO TABLE csdn;

分析

查询总记录数

1
SELECT COUNT(*) from csdn;

分析最多人使用的TOPn个密码

1
SELECT password,COUNT(password) AS password_cnt FROM csdn GROUP BY password ORDER BY password_cnt DESC LIMIT 50;

注意: 以下命令不支持(UDF不能直接放在ORDER BY中,要使用别名)

1
SELECT password,COUNT(password) FROM csdn GROUP BY password ORDER BY COUNT(password) DESC LIMIT 50;

统计使用纯数字作为密码的人数

1
2
SELECT * FROM csdn WHERE password RLIKE '^\\d+$' LIMIT 20;
SELECT COUNT(1) FROM csdn WHERE password RLIKE '^\\d+$';

统计使用纯字母作为密码的人数

1
2
SELECT * FROM csdn WHERE password RLIKE '^[a-zA-Z]+$' LIMIT 20;
SELECT COUNT(1) FROM csdn WHERE password RLIKE '^[a-zA-Z]+$';

统计使用人数最多的Email服务商

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
SELECT lower(split(email, '@')[1]) AS emailProvider,COUNT(1) AS cnt 
    FROM csdn WHERE email LIKE '%@%' 
    GROUP BY lower(split(email, '@')[1]) 
    ORDER BY cnt DESC 
    LIMIT 50;

SELECT t.emailProvider,COUNT(*) AS cnt 
    FROM (SELECT lower(split(email, '@')[1]) AS emailProvider
          FROM csdn WHERE email LIKE '%@%') t
    GROUP BY t.emailProvider
    ORDER BY cnt DESC
    LIMIT 50;

注意: 以下命令不支持

1
2
3
4
SELECT lower(split(email, '@')[1]) AS emailProvider,COUNT(1) AS cnt 
    FROM csdn WHERE email LIKE '%@%' 
    GROUP BY emailProvider 
    ORDER BY cnt DESC LIMIT 50;

分析用户名密码相同的用户

1
2
SELECT * FROM csdn WHERE username=password LIMIT 50;
SELECT COUNT(*) FROM csdn WHERE username=password;

分析简单密码的用户(长度小于8,只有数字和字母)

1
2
SELECT * FROM csdn WHERE length(password)<8 AND password RLIKE "^\\w+$" LIMIT 50;
SELECT COUNT(*) FROM csdn WHERE length(password)<8 AND password RLIKE "^\\w+$";

分析密码长度分布

1
2
3
4
SELECT length(password),COUNT(*) as cnt
FROM csdn
GROUP BY length(password)
ORDER BY cnt DESC;
Tags: #Hive    Post on Hadoop