数据仓库开发技巧

该系列包含三篇:

1、批量刷数据的写法,建议使用动态分区,尽量保证存量脚本和增量脚本是同一个,如果数据量少可以一次重刷,如果数据量多可以设置按月重刷
按月重刷的推荐写法:

1
2
select * from tabel_name
where ds between '${pre1day}' and last_day('${pre1day}') and ds<current_date()

2、计算占比

1
2
3
4
5
6
7
8
select
campaign_id
,cnt
,sum(cnt) over() as total
,1.0000*cnt/sum(cnt) over() as percent
from (
select campaign_id,count(1) cnt from bdl_jumplog_dsp where ds='2020-12-25' group by campaign_id order by count(1)
)

3、计算分位

1
2
3
4
select
dept
,PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2
FROM lxw1234;

4、计算每条记录之间的时间间隔

  • LAG函数
  • LEAD函数
  • 自关联,rn=rn+1
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    select
    t.ds
    ,t1.ds
    ,case when t1.ds is not null and t.ds<='2017-09-30' and t1.ds<='2018-01-01' then 1
    else datediff(coalesce(t1.ds,'${pre1day}'),t.ds)
    end as diff
    ,t1.terminal as end_terminal
    ,t1.app_tag as end_app_tag
    from stage1_idl_xxx_zipper_his_uc t
    left join stage1_idl_xxx_zipper_his_uc t1 on t.appd=t1.appd and t.userid=t1.userid and t1.rn=t.rn+1

5、计算月标,统计月活,需要把当月截止当前的数据刷到当前日的分区中(重点是用到了一个日期维表)
完美的解法应该是 count(distinct userid) over(partition by appd,active_terminal_type order by ds asc)。但是旧版本hive的count窗口函数不支持distinct,所以只能用下面不完美的写法
还有一种解法就是先group by,然后把数据写到min_ds中,但是这样出来的数据和需求是有出入的。实际需求数据是每日增长,这样统计出来的数据是仅当日的活跃,前端用的时候在做累加展示(实际还是需要在关联日期维表)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
insert overwrite table adl_xxx_class partition (ds)
select
t.f_ds
,t1.f_month
,t1.appd
,t1.mon_diff
,t1.active_terminal_type
,count(distinct if(t.f_ds>=t1.min_ds,t1.userid,null)) as active_user_cnt
,count(distinct if(t.f_ds>=t1.min_ds and t1.is_student=1,t1.userid,null)) as active_student_cnt
,t.f_ds as ds
from (
select ds as f_ds from dim_ds_crm where ds<='${pre1day}' and ds>=concat(substr('${pre1day}',1,8),'01')
) t
left join (
-- 其实这样刷是有问题的,假如某个用户2020.12.20在班,那么2020.12.19之前都算在在班里面
select
concat(substr('${pre1day}',1,8),'01') f_month
,appd
,userid
,active_terminal_type
,start_mon_diff mon_diff
,max(coalesce(is_student,0)) is_student
,min(ds) min_ds
,max(ds) max_ds
from idl_xxx_class
where ds<='${pre1day}' and ds>=concat(substr('${pre1day}',1,8),'01')
group by concat(substr('${pre1day}',1,8),'01'),appd,start_mon_diff,active_terminal_type,userid
) t1 on 1=1
group by t.f_ds,t1.f_month,t1.appd,t1.mon_diff,t1.active_terminal_type

6、统计注册之后前7天的浏览数 / 统计注册之后30日的订单数

  • 刷当日的很简单
  • 刷历史数据
    • 借助维表
    • 使用lateral view
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- 借助维表
insert overwrite table adl_xxx_class partition (ds)
select
t.f_ds
,t1.f_month
,t1.appd
,t1.mon_diff
,t1.active_terminal_type
,count(distinct if(t.f_ds>=t1.min_ds,t1.userid,null)) as active_user_cnt
,count(distinct if(t.f_ds>=t1.min_ds and t1.is_student=1,t1.userid,null)) as active_student_cnt
,t.f_ds as ds
from (
select ds as f_ds from dim_ds_crm where ds<='${pre1day}' and ds>=concat(substr('${pre1day}',1,8),'01')
) t
left join (
select
concat(substr('${pre1day}',1,8),'01') f_month
,appd
,userid
,active_terminal_type
,start_mon_diff mon_diff
,max(coalesce(is_student,0)) is_student
,min(ds) min_ds
,max(ds) max_ds
from idl_xxx_class
where ds<='${pre1day}' and ds>=concat(substr('${pre1day}',1,8),'01')
group by concat(substr('${pre1day}',1,8),'01'),appd,start_mon_diff,active_terminal_type,userid
) t1 on 1=1
group by t.f_ds,t1.f_month,t1.appd,t1.mon_diff,t1.active_terminal_type
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- lateral view实现
select ds1,userid,count(distinct class_id)
from (
select
'intro' qk_type,
life_type,
start_date,
user_id,
root_groupid,
server_date,
class_id,
duration,
pv,
ds,
date_add(ds,detaday) as ds1
from (
select
life_type,
start_date,
user_id,
root_groupid,
server_date,
class_id,
duration,
1 as pv,
ds
from idl_xxxx_class
where ds>'2021-02-10'
and user_id>0
) t2
lateral view explode(array(0,1,2,3,4,5,6)) dual as detaday
) tt
group by ds1,userid

7、做拉链表(间隔表),参考idl_xxx_zipper_v2_uc
这种表有点复杂,当日的脚本还算好写,主要是刷历史数据的脚本,重点在于拆分sql,拆分逻辑。这种表一般历史数据的脚本和当天的脚本不可能写成一个,写完要多对下数据,保证历史脚本和每日脚本的执行结果是一样的