CarbonData在BI场景中的测试.doc
-CarbonData在BI场景中的测试测试目的CarbonData格式与Parquet格式在BI场景下的性能比照测试版本集群配置cpu:Intel(R) Core(TM) i7-4820K CPU 3.70GHz 1U *4Coremem: 64G节点个数:5测试素材使用*BI工程中的事实表作为测试表,建表DDL语句如下:create table isdzd_b0_10y_c(userid_ string, btype_ string, shtag_ string, shtag2_ string, option_ string, f2 double, f3 double, f4 double, f5 double, f6 double, f7 double, f8 double, f9 double, f10 double, f11 double, f12 double, f13 double, f14 double, f15 double, f16 double, f17 double, f18 double, f19 double, f20 double, f21 double, f22 double, f23 double, f24 double, f25 double, f26 double, f27 double, f28 double, f29 double, f30 double, f31 double, f32 double, f33 double, f34 double, f35 double, f36 double, f37 double, f38 double, f39 double, f40 double, f41 double, f42 double, f43 double, f44 double, f45 double, f46 double, f47 double, f48 double, f49 double, f50 double, f51 double, f52 double, f53 double, f54 double, f55 double, f56 double, f57 double, f58 double, f59 double, f60 double, f61 double, f62 double, f63 double, f64 double, f65 double, f66 double, f67 double, f68 double, f69 double, f70 double, f71 double, f72 double, f73 double, f74 double, f75 double, f76 double, f77 double, f78 double, f79 double, f80 double, f81 double, f82 double, f83 double, f84 double, f85 double, f86 double, f87 double, f88 double, f89 double, f90 double, f91 double, f92 double, f93 double, f94 double, f95 double, f96 double, f97 double, f98 double, f99 double, f100 double, f101 double, f102 double, f103 double, f104 double, f105 double, f106 double, f107 double, f108 double, f109 double, f110 double, f111 double, f112 double, f113 double, f114 double, nsrsbh string, nsrmc string, hydm string, djzcl* string, *zqh string, zzsjnfs string, zzsyhzc string, upid_ string,bbq_ string ) STORED BY 'carbondata'测试方法分别测试在2亿和10亿的数据量下,CarbonData与parquet格式在ad-hoc场景下的性能优劣。测试用例场景用例sql用例即席分析_1个维、1个指标select sum(a.F7) as F70, SUBSTR(a.*ZQH, 1, 2) as *ZQH0from isdzd_b0_10y_c awhere (a.HYDM IS NOT NULL)group by SUBSTR(a.*ZQH, 1, 2) order by true limit 600;即席分析_ 2个维、5个指标select sum(a.F7) as F70,sum(a.F8) as F80,sum(a.F9) as F90,sum(a.F12) as F120,sum(a.F13) as F130, SUBSTR(a.*ZQH,1,2) as *ZQH0, SUBSTR(a.DJZCL*,1,1) as DJZCL*0from ISDZD_B0_10Y_c awhere (a.HYDM IS NOT NULL)group by SUBSTR(a.*ZQH,1,2), SUBSTR(a.DJZCL*,1,1) order by true limit 600;一般报表_1个维,5个指标select count(a.NSRSBH) as C2,sum(a.F15) as D2,sum(a.F25) as E2,sum(a.F30) as F2,sum(a.F33) as G2,sum(a.F10) as H2, SUBSTR(a.*ZQH, 1, 2) as B2from isdzd_b0_10y_c awhere (a.HYDM IS NOT NULL) AND (a.BBQ_ like '2021%') AND (2>1)group by SUBSTR(a.*ZQH, 1, 2)order by sum(a.F10) desc nulls last limit 1000;一般报表_1个维,20个指标select C2 as C2,D2 as D2,E2 as E2,F2 as F2,G2 as G2,H2 as H2,I2 as I2,J2 as J2,K2 as K2,L2 as L2,M2 as M2,N2 as N2,O2 as O2,P2 as P2,Q2 as Q2,R2 as R2,S2 as S2,T2 as T2,U2 as U2,V2 as V2,B2 as B2,B2_sortn as B2_sortnfrom (select sum(a.F15) as C2,sum(a.F25) as D2,sum(a.F30) as E2,sum(a.F33) as F2,sum(a.F10) as G2,sum(a.F37) as H2,sum(a.F42) as I2,sum(a.F48) as J2,sum(a.F49) as K2,sum(a.F50) as L2,sum(a.F52) as M2,sum(a.F53) as N2,sum(a.F54) as O2,sum(a.F55) as P2,sum(a.F56) as Q2,sum(a.F57) as R2,sum(a.F59) as S2,sum(a.F60) as T2,sum(a.F60) as U2,sum(a.F58) as V2, SUBSTR(a.*ZQH, 1, 2) as B2,row_number() over ( order by sum(a.F10) desc nulls last) as B2_sortnfrom isdzd_b0_10y_c awhere (a.HYDM IS NOT NULL) AND (a.BBQ_ like '2021%') AND (2>1)group by SUBSTR(a.*ZQH, 1, 2) awhere (B2_sortn<=40)order by B2_sortn nulls first limit 1000;多级浮动_ 2个维,5个指标select C3 as C3,D3 as D3,E3 as E3,F3 as F3,G3 as G3,C2 as C2,D2 as D2,E2 as E2,F2 as F2,G2 as G2,B3 as B3,B2 as B2,B2_sortn as B2_sortnfrom (select a.C3 as C3,a.D3 as D3,a.E3 as E3,a.F3 as F3,a.G3 as G3,b.C2 as C2,b.D2 as D2,b.E2 as E2,b.F2 as F2,b.G2 as G2,a.B3 as B3,a.B2 as B2,row_number() over (partition by a.B3 order by b.G2 desc nulls last) as B2_sortnfrom (select sum(a.F15) as C3,sum(a.F25) as D3,sum(a.F30) as E3,sum(a.F33) as F3,sum(a.F10) as G3, SUBSTR(a.DJZCL*, 1, 1) as B3, SUBSTR(a.*ZQH, 1, 2) as B2from isdzd_b0_10y_c awhere (a.HYDM IS NOT NULL) AND (a.BBQ_ like '2021%') AND (2>1)group by SUBSTR(a.*ZQH, 1, 2), SUBSTR(a.DJZCL*, 1, 1) aleft join (select sum(a.F15) as C2,sum(a.F25) as D2,sum(a.F30) as E2,sum(a.F33) as F2,sum(a.F10) as G2, SUBSTR(a.*ZQH, 1, 2) as B2from isdzd_b0_10y_c awhere (a.HYDM IS NOT NULL) AND (a.BBQ_ like '2021%') AND (2>1)group by SUBSTR(a.*ZQH, 1, 2) bon (a.B2=b.B2) awhere (B2_sortn<=40)order by B3 nulls first,B2_sortn nulls first limit 1000;多级浮动_ 2个维,20个指标select C3 as C3,D3 as D3,E3 as E3,F3 as F3,G3 as G3,H3 as H3,I3 as I3,J3 as J3,K3 as K3,L3 as L3,M3 as M3,N3 as N3,O3 as O3,P3 as P3,Q3 as Q3,R3 as R3,S3 as S3,T3 as T3,U3 as U3,V3 as V3,C2 as C2,D2 as D2,E2 as E2,F2 as F2,G2 as G2,H2 as H2,I2 as I2,J2 as J2,K2 as K2,L2 as L2,M2 as M2,N2 as N2,O2 as O2,P2 as P2,Q2 as Q2,R2 as R2,S2 as S2,T2 as T2,U2 as U2,V2 as V2,B3 as B3,B2 as B2,B2_sortn as B2_sortnfrom (select a.C3 as C3,a.D3 as D3,a.E3 as E3,a.F3 as F3,a.G3 as G3,a.H3 as H3,a.I3 as I3,a.J3 as J3,a.K3 as K3,a.L3 as L3,a.M3 as M3,a.N3 as N3,a.O3 as O3,a.P3 as P3,a.Q3 as Q3,a.R3 as R3,a.S3 as S3,a.T3 as T3,a.U3 as U3,a.V3 as V3,b.C2 as C2,b.D2 as D2,b.E2 as E2,b.F2 as F2,b.G2 as G2,b.H2 as H2,b.I2 as I2,b.J2 as J2,b.K2 as K2,b.L2 as L2,b.M2 as M2,b.N2 as N2,b.O2 as O2,b.P2 as P2,b.Q2 as Q2,b.R2 as R2,b.S2 as S2,b.T2 as T2,b.U2 as U2,b.V2 as V2,a.B3 as B3,a.B2 as B2,row_number() over (partition by a.B3 order by b.G2 desc nulls last) as B2_sortnfrom (select sum(a.F15) as C3,sum(a.F25) as D3,sum(a.F30) as E3,sum(a.F33) as F3,sum(a.F10) as G3,sum(a.F37) as H3,sum(a.F42) as I3,sum(a.F48) as J3,sum(a.F49) as K3,sum(a.F50) as L3,sum(a.F52) as M3,sum(a.F53) as N3,sum(a.F54) as O3,sum(a.F55) as P3,sum(a.F56) as Q3,sum(a.F57) as R3,sum(a.F59) as S3,sum(a.F60) as T3,sum(a.F60) as U3,sum(a.F58) as V3, SUBSTR(a.DJZCL*, 1, 1) as B3, SUBSTR(a.*ZQH, 1, 2) as B2from isdzd_b0_10y_c awhere (a.HYDM IS NOT NULL) AND (a.BBQ_ like '2021%') AND (2>1)group by SUBSTR(a.*ZQH, 1, 2), SUBSTR(a.DJZCL*, 1, 1) aleft join (select sum(a.F15) as C2,sum(a.F25) as D2,sum(a.F30) as E2,sum(a.F33) as F2,sum(a.F10) as G2,sum(a.F37) as H2,sum(a.F42) as I2,sum(a.F48) as J2,sum(a.F49) as K2,sum(a.F50) as L2,sum(a.F52) as M2,sum(a.F53) as N2,sum(a.F54) as O2,sum(a.F55) as P2,sum(a.F56) as Q2,sum(a.F57) as R2,sum(a.F59) as S2,sum(a.F60) as T2,sum(a.F60) as U2,sum(a.F58) as V2, SUBSTR(a.*ZQH, 1, 2) as B2from isdzd_b0_10y_c awhere (a.HYDM IS NOT NULL) AND (a.BBQ_ like '2021%') AND (2>1)group by SUBSTR(a.*ZQH, 1, 2) bon (a.B2=b.B2) awhere (B2_sortn<=40)order by B3 nulls first,B2_sortn nulls first limit 1000;图表展现_单图单表select C2 as C2,D2 as D2,B2 as B2,B2_sortn as B2_sortnfrom (select sum(a.F15) as C2,sum(a.F33) as D2, SUBSTR(a.*ZQH, 1, 2) as B2,row_number() over ( order by sum(a.F33) desc nulls last) as B2_sortnfrom isdzd_b0_10y_c awhere (a.HYDM IS NOT NULL) AND (a.BBQ_ like '2021%') AND (2>1)group by SUBSTR(a.*ZQH, 1, 2) awhere (B2_sortn<=10)order by B2_sortn nulls first limit 1000;图表展现_双图双表select * from (select C2 as C2,D2 as D2,B2 as B2,B2_sortn as B2_sortnfrom (select sum(a.F15) as C2,sum(a.F33) as D2, SUBSTR(a.HYDM, 1, 1) as B2,row_number() over ( order by sum(a.F33) desc nulls last) as B2_sortnfrom isdzd_b0_10y_c awhere (a.HYDM IS NOT NULL) AND (a.BBQ_ like '2021%') AND (2>1)group by SUBSTR(a.HYDM, 1, 1) awhere (B2_sortn<=10)order by B2_sortn nulls first limit 1000 ) aunion allselect * from (select C2 as C2,D2 as D2,B2 as B2,B2_sortn as B2_sortnfrom (select sum(a.F15) as C2,sum(a.F33) as D2, SUBSTR(a.*ZQH, 1, 2) as B2,row_number() over ( order by sum(a.F33) as B2_sortnfrom isdzd_b0_10y_c awhere (a.HYDM IS NOT NULL) AND (a.BBQ_ like '2021%') AND (2>1)group by SUBSTR(a.*ZQH, 1, 2) awhere (B2_sortn<=10)order by B2_sortn nulls first limit 1000) b;统计报表下钻 _折叠展开select * from (select sum(a.F15) as B2,sum(a.F25) as C2,sum(a.F30) as D2,sum(a.F33) as E2,sum(a.F10) as F2,sum(a.F37) as G2,sum(a.F42) as H2,sum(a.F48) as I2,sum(a.F49) as J2,sum(a.F50) as K2,sum(a.F52) as L2,sum(a.F53) as M2,sum(a.F54) as N2,sum(a.F55) as O2,sum(a.F56) as P2,sum(a.F57) as Q2,sum(a.F59) as R2,sum(a.F60) as S2,sum(a.F60) as T2,sum(a.F58) as U2,'000000' as A2from isdzd_b0_10y_c awhere (a.HYDM IS NOT NULL) AND (a.BBQ_ like '2021%') AND (2>1) order by true limit 1000) aunion allselect * from (select sum(a.F15) as B2,sum(a.F25) as C2,sum(a.F30) as D2,sum(a.F33) as E2,sum(a.F10) as F2,sum(a.F37) as G2,sum(a.F42) as H2,sum(a.F48) as I2,sum(a.F49) as J2,sum(a.F50) as K2,sum(a.F52) as L2,sum(a.F53) as M2,sum(a.F54) as N2,sum(a.F55) as O2,sum(a.F56) as P2,sum(a.F57) as Q2,sum(a.F59) as R2,sum(a.F60) as S2,sum(a.F60) as T2,sum(a.F58) as U2, SUBSTR(a.*ZQH, 1, 2) as A2from isdzd_b0_10y_c awhere (a.HYDM IS NOT NULL) AND (a.BBQ_ like '2021%') AND (2>1) AND (2>1)group by SUBSTR(a.*ZQH, 1, 2) order by true limit 1000 ) b;统计报表下钻 _跳转select sum(a.F15) as C2,sum(a.F25) as D2,sum(a.F30) as E2,sum(a.F33) as F2,sum(a.F10) as G2,sum(a.F37) as H2,sum(a.F42) as I2,a.DJZCL* as B2,ma*(b.caption_) as mc_B20from isdzd_b0_10y_c aleft join DIM_2021NSSZLD_NSRDJZCL* bon (a.DJZCL*=b.id_)where (a.HYDM IS NOT NULL) AND (a.BBQ_ like '2021%') AND (2>1)group by a.DJZCL*order by sum(a.F33) desc nulls last limit 1000;图表联动select sum(a.F30) as B3,sum(a.F80) as C3,sum(a.F102) as D3, SUBSTR(a.*ZQH, 1, 2) as A3from isdzd_b0_10y_c awhere (a.HYDM IS NOT NULL) AND (a.BBQ_ like '2021%') AND (2>1)group by SUBSTR(a.*ZQH, 1, 2) order by true limit 1000;测试结果场景用例2亿10亿CarbonDataParquetCarbonDataParquet即席分析_1个维、1个指标19.3s17.3s27.8s25.1s即席分析_ 2个维、5个指标18.3s17.2s45.8s44.8s一般报表_1个维,5个指标10.9s13s13.3s42.9s一般报表_1个维,20个指标19.5s19.9s46s50s多级浮动_ 2个维,5个指标14.1s16.5s17.1s48s多级浮动_ 2个维,20个指标22.7s24.8s61.7s76.7s图表展现_单图单表14.9s15.8s22.2s18.9s图表展现_双图双表12.5s14.5s17.6s29.6s统计报表下钻 _折叠展开24.6s46.4s42.2s67.9s统计报表下钻 _跳转17.1s15.6s21.9s33.3s图表联动12.2s10.8s17.8s42.7s测试结论1. CarbonData与Parquet在全表扫描方式下的性能根本相当,没有明显差异。参考:用例1,22. 对于含过滤条件的筛选查询,当数据规模到一定量级时,CD较之Parquet有较明显的优势,主要是CarbonData的多级索引发挥了作用,而且数据量越大,性能差异越明显,从本次的测试用例可以看出,2亿行级的数据量两者差异很小,10亿行级的数据量两者的性能差距较明显。参考:用例3,5,113. CarbonData很好地利用了索引扫描,使得查询耗时的增加幅度明显低于数据量的膨胀速度,从测试结果我们不难发现,当数据量从2亿增长到10亿时,parquet格式的查询响应时间也成倍增加。而CarbonData格式的增加幅度相对平稳得多,这得益于索引扫描替代了全表扫描。参考:用例3,5,10,114. 含过滤条件的查询,假设使用开窗函数,CarbonData的性能优势不复存在,笔者推测是因为开窗函数通常需要走全表扫描,导致CarbonData的索引无法被利用参考:用例4,65. CarbonData上的union all操作比parquet要快,原因还在分析中。参考:用例8,96. 从实测结果来看,CarbonData存储格式适合BI的应用场景. z.