日期:2014-05-16 浏览次数:20570 次
问题:
线上出现垃圾数据,由于数据进hive采用的是dynamic partition,这些垃圾数据有些称为partiition字段
like:
>show partitions dpdw_traffic_base;
hp_stat_time=234123417234719237491/hp_log_type=0/hp_host=__HIVE_DEFAULT_PARTITION__/hp_from_type=__HIVE_DEFAULT_PARTITION__
hp_stat_time=r_ready%3D407;r_load%3D641/hp_log_type=0/hp_host=__HIVE_DEFAULT_PARTITION__/hp_from_type=__HIVE_DEFAULT_PARTITION__
hp_stat_time=r_ready%3D438;r_load%3D797/hp_log_type=0/hp_host=__HIVE_DEFAULT_PARTITION__/hp_from_type=__HIVE_DEFAULT_PARTITION__
.......
写了一个脚本清除垃圾数据:
#!/bin/sh pname=`hadoop fs -ls /user/hive/warehouse/bi.db/dpdw_traffic_base | awk '{print $NF}' | awk '{gsub(/\/user\/hive\/warehouse\/bi.db\/dpdw_traffic_base\/hp_stat_time=/,"",$0); print $0}'` for i in $pname do len=`expr length $i` if [ $len -ne 10 -a $i != '__HIVE_DEFAULT_PARTITION__' -a $i != 'items' ] then name="hdfs://10.2.6.102/user/hive/warehouse/bi.db/dpdw_traffic_base/hp_stat_time="$i hadoop fs -rmr $name hive -e "use bi; alter table dpdw_traffic_base drop partition(hp_stat_time='$i')" fi done
hp_stat_time=r_ready%3D407;r_load%3D641
的partition无法删除
跟进后发现hive会自动对一些字符进行UTF-8编码,此处%3D解码后是'='
另外‘;’ 需要被转义
so:
dpdw_traffic_base的partition:
hp_stat_time=r_ready%3D91;r_load%3D351/hp_log_type=0/hp_host=http%3A%2F%2Fwww.dianping.com%2Fsearch%2Fkeyword%2F10%2F0_%25E9%259B%2581%25E8%258D%25A1%25E6%2583%2585/hp_from_type=__HIVE_DEFAULT_PARTITION__
正确的删除语句是:
> alter table dpdw_traffic_base drop partition(hp_stat_time='r_ready=91\;r_load=351');
Dropping the partition hp_stat_time=r_ready%3D91;r_load%3D351/hp_log_type=0/hp_host=http%3A%2F%2Fwww.dianping.com%2Fsearch%2Fkeyword%2F10%2F0_%25E9%259B%2581%25E8%258D%25A1%25E6%2583%2585/hp_from_type=__HIVE_DEFAULT_PARTITION__
OK