翻译自?https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization#LanguageManualJoinOptimization-AutoConversiontoSMBMapJoin
目录结构
Join Optimization ----Join 调优 Improvements to the Hive Optimizer ----Hive的优化 Star Join Optimization ----星型结构的优化 Star Schema Example ----例子 Prior Support for MAPJOIN ----MapJoin的预备知识 Limitations of Prior Implementation ----使用限制 Enhancements for Star Joins ----优化星型join Optimize Chains of Map Joins ----优化一连串的Map join Current and Future Optimizations ----当前以及未来的优化方向 Optimize Auto Join Conversion ----自动join转换 Current Optimization ----当前的优化方向 Auto Conversion to SMB Map Join ----自动转换成SMS join Generate Hash Tables on the Task Side ----在task side生成hash表 Pros and Cons of Client-Side Hash Tables ----优缺点 Task-Side Generation of Hash Tables ----在task siade 生成hash表 Further Options for Optimization----优化方向
?
Improvements to the Hive Optimizer
?
Hive automatically recognizes various use cases and optimizes for them. Hive 0.11 improves the optimizer for these cases: ? hive可以自动优化,在0.11里面改进了一些优化用例
- Joins where one side fits in memory. In the new optimization: ?join的一边适合放进内存,有新的优化方案
- that side is loaded into memory as a hash table?把表按照hash表的形式读进内存
- only the larger table needs to be scanned?只扫描大表
- fact tables have a smaller footprint in memory?fact表只使用少量内存
- Star-schema joins 星型join
- Hints are no longer needed for many cases. 在很多情况下不再需要HINT
- Map joins are automatically picked up by the optimizer. Map join 自动优化
Star Join Optimization
Star Schema Example
?
Select count(*) cnt From store_sales ss join household_demographics hd on (ss.ss_hdemo_sk = hd.hd_demo_sk) join time_dim t on (ss.ss_sold_time_sk = t.t_time_sk) join store s on (s.s_store_sk = ss.ss_store_sk) Where t.t_hour = 8 t.t_minute >= 30 hd.hd_dep_count = 2 order by cnt;?
?
DW 常用的star schema,这个属于BI 基本概念了就不解释了。
Prior Support for MAPJOIN
?
The default value for hive.auto.convert.join was false in Hive 0.10.0. ?Hive 0.11.0 changed the default to true (HIVE-3297).hive.auto.convert.join 在0.10默认是fales,到了0.11就是变成了true
MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through. The prior implementation has this division of labor:
MAPJOINs ?把小表以hash map的形式读进内存,然后和大表匹配key,以下是各个阶段的分工
- Local work:本地
- read records via standard ta