Pig软件安装(0.14.0) 2015-02-26 16:00

说明

Pig是一个客户端软件,安装在任意机器上,不需要安装在Hadoop集群中。

安装过程

安装

1
2
3
4
wget http://mirror.bit.edu.cn/apache/pig/pig-0.14.0/pig-0.14.0.tar.gz
tar -zxvf pig-0.14.0.tar.gz
mv pig-0.14.0 /opt/hadoop/client/pig
cd /opt/hadoop/client/pig

配置

将pig的bin目录加入$PATH中, /etc/profile:

1
export PATH=$PATH:/opt/hadoop/client/pig/bin

运行

1
2
cd bin
./pig -x mapreduce #-x mapreduce指定为集群模式,而非本地模式

pig根据HADOOP_HOME环境变量连接Hadoop集群环境。

简单例子

假如有一份成绩单,有学号、语文成绩、数学成绩,属性之间用|分隔,如下:

20130001|80|90
20130002|85|96
20130003|60|70
20130004|74|86
20130005|65|98

上传文件到HDFS

hadoop dfs -put ~/score.txt /tmp/score.txt
hadoop dfs -ls /tmp
hadoop dfs -cat /tmp/score.txt

载入原始数据,使用LOAD

grunt> scores = LOAD 'hdfs://ctrl:9000/tmp/score.txt' USING PigStorage('|') AS (num:int,Chinese:int,Math:int);

输入文件是:'hdfs://ctrl:9000/tmp/score.txt'
表名(Bag):scores
从输入文件读取数据(Tuple)时以 | 分隔
读取的Tuple包含3个属性,分别为学号(num)、语文成绩(Chinese)和数学成绩(Math),这三个属性的数据类型都为int.

查看表的结构

grunt> DESCRIBE scores;
scores: {num: int,Chinese: int,Math: int}

假如我们需要过滤掉学号为20130005的记录

grunt> filter_scores = FILTER scores BY num != 20130005;

查看过滤后的记录

grunt> dump filter_scores;
(20130001,80,90)
(20130002,85,96)
(20130003,60,70)
(20130004,74,86)

计算每个人的总分

grunt> totalScore = FOREACH scores GENERATE num,Chinese+Math;
grunt> dump totalScore;
(20130001,170)
(20130002,181)
(20130003,130)
(20130004,160)
(20130005,163)

将每个人的总分结果输出到文件

grunt> store totalScore into 'hdfs://ctrl:9000/tmp/out/result' using PigStorage('|');
Tags: #Pig    Post on Hadoop