hive中一些DML的操作

DML（数据操纵语言）主要指数据的增删查改

数据导入

有5种导入数据的方法，最常用的是 Load 和 Insert

Load

从文件系统中导入数据

load data [local] inpath '/xxxxxx' 
[overwrite] into table tablename [partition (partcol1=val1,…)];

local: 指本地文件系统，否则为HDFS
overwrite: 指覆盖表中已有数据，否则表示追加
partition: 表示上传到指定分区

Insert

通过查询语句导入数据

# 覆盖
insert overwrite table tablename [partition(partcol1=val1,partclo2=val2)] select_statement;
# 追加
insert into table tablename [partition(partcol1=val1,partclo2=val2)] select_statement;

As Select

查询语句中创建表并导入数据

create table if not exists tablename
as select id, name from student;

Location

创建表时通过Location指定数据的路径，再直接put数据到hdfs上

hive (default)> dfs -put /xxxxx /user/hive/warehouse/tablename;

Import

只能导入export导出的数据

import table tablename partition(month='201909') from '/user/hive/warehouse/export/student';

数据导出

最常用的是 Insert 和 Hadoop

Insert

# 将数据导入本地（并格式化处理），不加local就是导入HDFS
insert overwrite local directory '/xxxxx'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from tablename;

Hadoop

# hadoop命令导出
hive > dfs -get /user/hive/warehouse/student/month=201909/000000_0 /xxxxxxx;

Export

这个导出的数据除了数据还有元数据，可用Import导入

hive > export table tablename to '/user/hive/warehouse/export/student';

数据清除

只能清除内部表的数据

hive > truncate table tablename;

查询

查询的关键字较多，要知道它们的顺序（重点）

写的顺序：

select ... from ... join on ... where ... group by ... having ... order by ... limit ...

执行顺序：大体思路是限定(where)，分组，限定（having），选择，排序

from -> join on -> where -> group by -> having -> select -> order by -> limit

select…where…limit

# 简单查询
select empno, ename from emp;

select count(*) cnt from emp;

select * from emp where sal >1000 limit 5;

group by 和 having

group by 语句通常会和聚合函数一起使用，按照一个或者多个列队结果进行分组，然后对每个组执行聚合操作

# 计算emp表每个部门的平均工资
select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;

where 作用在分组（group by）和聚合（sum等）计算之前，选取哪些行，也就是在查询前筛选；having 对分组后计算的数据进行过滤。它只用于group by分组统计语句。

# 求每个部门的平均薪水大于2000的部门
select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;

join

Hive支持通常的SQL JOIN语句，但是只支持等值连接，不支持非等值连接。且连接谓词中不支持or。
这个非等值连接可以从以前学的reducejoin的流程思索原因，reducejoin是在shuffer时将数据按关联值相等的（on的条件）分为一组，再在reducer阶段进行处理。

# 合并员工表和部门表
select * from emp e join dept d on e.deptno = d.deptno;

排序

全局排序 Order By

全局排序，只一个Reducer。全排很明显最后生成一个总的排序文件，1个reducer

# 查询员工信息按工资降序排列
select * from emp order by sal desc;

按reducer排序 sort by

每个reducer端都会做排序，出来的数据是有序的。假如有n个Reducer，就会生成n个有序文件。当n=1时，它就是Order By。

扩展一波，Reducer个数默认按原始数据256M一个，当然也可手动设置其个数。

分区排序 Distribute By…Sort By

先分区，后排序。这个分区类型mapreduce的分区，多少个分区，就有多少个reduce任务，后面就生成多少个文件。说白了这个和上面的区别就是它通过Distribute By指定怎么分区，即指定怎么分reducer。

# 设置reduce个数
set mapreduce.job.reduces=3;
# 先按照部门编号分区，再按照员工编号降序排序
insert overwrite local directory '/Users/zxy/IdeaProjects/bigdata-learning/hive-learning/data/output'
select * from emp distribute by deptno sort by empno desc;

分桶 Cluster By

当distribute by 和 sorts by 字段相同时，可以使用cluster by方式。但排序只能是升序排序。

可以从取名看出，我没用分桶排序。你可以理解 Cluster 就是把数据分区，然后每个分区生成一个文件，这样就好解释为啥只能升序排序，我理解它压根就不需要排序，只是把数据分到不同区就ok。

# 以下两种写法等价
select * from emp cluster by deptno;
select * from emp distribute by deptno sort by deptno;

细节一：来波小结理一下分区表分区排序分桶(前两个分区意思截然不同)

partition(month='201909') 这个是分区表，针对的是数据的存储路径
Distribute By...Sort By 这个是分区排序，和MR中的分区概念一样，多少个分区，就有多少个reduce任务，后面就生成多少个文件，分区之后对区里的数据进行排序。
Cluster By 分桶，针对的是数据文件，将大的数据集分区。

所以将 Cluster By 理解为分区， Distribute By…Sort By 理解为分区排序，岂不美哉

细节二：注意导入数据到分桶中，要用insert，且设置hive.enforce.bucketing=true和 hive.enforce.bucketing=true

细节三：分桶抽样查询

select * from tablename tablesample(bucket x out of y on id);

x 表示从哪个bucket开始抽取

y 表示抽样间隔，共抽取总数/y 个桶，且x的值必须小于等于y的值

举例：如果 x = 1, y = 4 ，共16个桶，那么将抽取16/4个桶，分别是 1、5、9、13

行转列、列转行

行转列：将不同行的聚合到一起

collect_set(col)：函数只接受基本数据类型，它的主要作用是将某字段的值进行去重汇总(不去重用list)，产生array类型字段

例子：

select
    base,
    concat_ws("|", collect_set(name)) name
from
(select
    name,
    concat(constellation, ",", blood_type) base
from
    constellation) t1
group by
    base;

列转行：将列拆分成多行。

explode(col) 将hive一列中复杂的array或者map结构拆分成多行

LATERAL VIEW udtf(expression) tableAlias AS columnAlias 用于和split, explode等UDTF一起使用，它能够将一列数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合。

例子：

select
    movie,
    category_name
from 
    movie_info lateral view explode(category) table_tmp as category_name;

窗口函数

基本结构：函数 over(范围) 。用前面的函数处理over中的规定的数据

除了count、sum等一些常用函数，还有只能配合over使用的函数：

lag(col,n)：往前第n行数据
lead(col,n)：往后第n行数据
ntile(n)：给数据编号，编号从1开始，对于每一行，NTILE返回此行所属的组的编号。
rank() 排序相同时会重复，总数不会变，如12225668
dense_rank() 排序相同时会重复，总数会减少，如12223445
row_rank() 单纯顺序计算，如12345678

over里面可以规定窗口范围：

()：全部数据
(partition by xxx order by xxx)：分区有序
(rows between xxxx and xxxx)：手动指定范围
- current row：当前行
- n preceding：往前n行数据
- n following：往后n行数据
- unbounded preceding：从起点开始
- unbounded following：到终点结束

一些例子：

business.name	business.orderdate	business.cost
jack	2017-01-01	10
tony	2017-01-02	15
jack	2017-02-03	23
tony	2017-01-04	29
jack	2017-01-05	46
jack	2017-04-06	42
tony	2017-01-07	50
jack	2017-01-08	55
mart	2017-04-08	62
mart	2017-04-09	68
neil	2017-05-10	12
mart	2017-04-11	75
neil	2017-06-12	80
mart	2017-04-13	94

#（1）查询在2017年4月份购买过的顾客及总人数，over()针对groupby后的全部数据
select
    name,
    count(*) over () 
from business 
where 
    substring(orderdate,1,7) = '2017-04' 
group by 
    name;

#（2）查询顾客的购买明细 并 让cost按照日期进行累加
select
    name,
    orderdate,
    cost,
    sum(cost) over(partition by name order by orderdate) 
from
    business;

#（3）查看顾客上次的购买时间，在窗口中分区排序
select
    name,
    orderdate,
    cost, 
    lag(orderdate, 1, '1970-01-01') over(partition by name order by orderdate)
from business;

#（4）查询前20%时间的订单信息

# 加分组号
select
    name,
    orderdate,
    cost, 
    ntile(5) over(order by orderdate) sorted
from business; t1

# 过滤出组号为1的数据
select
    name,
    orderdate,
    cost
from 
(select
    name,
    orderdate,
    cost, 
    ntile(5) over(order by orderdate) sorted
from business) t1
where 
    sorted = 1;

#（5）计算每个人消费的排名
select
    name,
    orderdate,
    cost,
    rank() over(partition by name order by cost desc)
from
    business;