1

Hive CookBook

 2 years ago
source link: http://chen-tao.github.io/2017/10/17/hive-tips/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Hive CookBook

发表于 2017-10-17

| 分类于 Hive

some tips for hive to handle data

设置任务名称 set mapreduce.job.name=xxx_yourname;

指定任务队列 set mapreduce.job.queuename=offline.data;

任务结果输出header信息 set hive.cli.print.header=true;

设置优先级 set mapreduce.job.priority=HIGH;

优先级说明:
配置 mapreduce.job.priority 来调整优先级,支持的值:

对应YARN优先级 说明
NORMAL(默认) 1 普通任务
HIGH 2 nearline 任务
VERY_HIGH 3 在线任务
CRITICAL 6 关键任务
VERY_CRITICAL 9 非常关键的任务

reduce阶段内存配置:

set mapreduce.reduce.java.opts=-Xmx3072m;
set mapreduce.reduce.memory.mb=4096;

map阶段内存配置:

set mapreduce.map.java.opts=-Xmx3072m;
set mapreduce.map.memory.mb=4096;

map数过多:

set mapred.min.split.size=536870912;
set mapred.max.split.size=536870912;

使用 Rank() 计算分组 TopN 或 Top Percentile

文档:Rank()

Rank() 可以为原始数据的每一行生成新的一列,值为这一行所在分组的排序 rank; 这样就可以根据 rank 值来保留所需要的 Top N 的行,也可以参考所在分组的总行数,保留 Top n% 的行。

Example:

SELECT
SELECT
RANK() OVER (
PARTITION BY gid
ORDER BY
) AS gir
is_daily
WHERE
`date=` "20170803"
AND is = "f"
) AS tmp_t
WHERE
gir < 10
LIMIT

使用 NDV() 实现快速计数

文档: NDV()

相比 count distinct 更快、更节省内存,但存在误差 (HyperLogLog). 在查询很大的 count distinct,且对准确性要求量级正确即可时,e.g. 月活跃用户(MAU),可以考虑用 NDV 代替 count distinct.

NDV() 只能对单列做计数,所以如果需要多列组合计数,需要做一个拼接转换。

Example:

SELECT
concat(
cast(uid AS STRING),
cast(user_uid AS STRING)
default.isd
WHERE
`date=` "20170803"
AND is = "f"
SELECT
count(distinct uid, user_uid)
default.isd
WHERE
`date=` "20170803"
AND is = "f"

使用lateral view和explode展开array嵌套结构

对字符串split得到array,再把array展开

示例:it是字符串,取值样例’refresh,push’

select
t LATERAL VIEW explode(split(it, ',')) tb as it
where
`date = ` '20161229'
limit

使用lateral view和explode展开map嵌套结构

对value值进行过滤,选取符合要求的keys

示例:lt_lk_keyword是Map,取值样例{123: 0.1234, 234: 0.2345}

select
value
x LATERAL VIEW explode(lt_lk_keyword) tb as key,
value
where
`date = ` '20151229'

使用lateral view 和 json_tuple 处理json

SELECT
select
x LATERAL VIEW json_tuple(extra, "gs", "uid") t1 as uid,
JOIN default.web_article_item_dict as i on g.date = i.date
and g.kid = iid
where
g.date = "20171016"
limit

保存查询结果到hive表

直接以查询结果建表

CREATE TABLE test.tb_abc AS
SELECT distinct uit, uid
FROM default.isd
WHEREdate="20150803"

创建空Parquet分区表

CREATE EXTERNAL TABLE test.tmp_users1(
ut tinyint,
uid bigint
PARTITIONED BY (
`date string`
STORED AS PARQUET;

基于JSON数据创建外部数据表

文档:CreateTable

Example: 创建包含2个字段的临时外部数据表

HDFS数据:xx.json。目录中可包含多个文件。json文件内容:

{'ut': 12, 'uid': 8}
{'ut': 14, 'uid': 12345}
CREATE EXTERNAL TABLE test.tmp_users1(
ut tinyint,
uid bigint
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ('ignore.malformed.json'='true')
STORED AS TEXTFILE
LOCATION '/locate';

基于CSV数据创建外部数据表(数据在本地)

不将数据存储在hdfs,直接从本地load进表

-- 12,674930234
-- 14,759234924
-- step1: 建表
create external table csv_load_test(
ut bigint,
uid bigint
row format
delimited fields terminated by ','
stored as textfile
-- step2: load数据
load data local inpath './test.csv' overwrite into table csv_load_test

基于CSV数据创建外部数据表(数据在hdfs)

文档:CreateTable

Example: 创建包含2个字段的临时外部数据表

HDFS数据:/users.csv。目录中可包含多个文件。csv文件内容,以\t分隔:

将数据表从hive同步到impala, 在impala执行: invalidate metadata test.source_list_focus;

-- 12,8
-- 14,12345
CREATE EXTERNAL TABLE test.tmp_users2(
ut tinyint,
uid bigint
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/locate';

创建包含map和array结构的临时表:

create external table test.impala_complex_type_json(
`uid string,`
`properties map,`
ids` array<int`>)
ROW FORMAT DELIMITED
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
STORED AS TEXTFILE;
-- local file
u1 {"name":"xiaoming","age":"18","gender":"male"} [1,2,3]
u2 {"name":"dagou","age":"18","gender":"female"} [4,5,6]
-- cli - command
load data local inpath './a' overwrite into table test.impala_complex_type_json

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK