Hive TRANSFORM示例 2015-05-07 22:30

说明

Hive的UDF、UDAF需要通过java语言编写。Hive提供了另一种方式,达到自定义UDF和UDAF的目的,但使用方法更简单。这就是TRANSFORM。TRANSFORM语言支持通过更种语言,实现类似于UDF的功能。

Hive还提供了MAP和REDUCE这两个关键字。但MAP和REDUCE一般可理解为只是TRANSFORM的别名。并不代表一般是在map阶段或者是在reduce阶段调用。详见官网说明。

数据准备

创建表:

1
2
3
4
5
6
7
8
CREATE TABLE u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

下载数据:

1
2
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip

加载数据:

1
2
3
4
LOAD DATA LOCAL INPATH '<path>/u.data'
OVERWRITE INTO TABLE u_data;

SELECT COUNT(*) FROM u_data;

准备转换函数(Python)

1
2
3
4
5
6
7
8
import sys
import datetime

for line in sys.stdin:
  line = line.strip()
  userid, movieid, rating, unixtime = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print '\t'.join([userid, movieid, rating, str(weekday)])

备注:
Note that columns will be transformed to string and delimited by TAB before feeding to the user script, and the standard output of the user script will be treated as TAB-separated string columns.

注意: 不管Hive的表本身的分隔符是否是\t。TRANSFORM本身都要求提供返回的数据通过\t进行分隔

创建新表:

1
2
3
4
5
6
7
CREATE TABLE u_data_new (
  userid INT,
  movieid INT,
  rating INT,
  weekday INT)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t';

添加转换函数所在的文件,并使用转换函数:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
ADD FILE /tmp/test/weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_new
SELECT
  TRANSFORM (userid, movieid, rating, unixtime)
  USING 'python weekday_mapper.py'
  AS (userid, movieid, rating, weekday)
FROM u_data;

SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;

参考文档

  1. 官方文档
Tags: #Hive    Post on Hadoop