日期:2014-05-16  浏览次数:20570 次

hive删除partition遇到的编码、转义问题

问题:

线上出现垃圾数据,由于数据进hive采用的是dynamic partition,这些垃圾数据有些称为partiition字段

like:
>show partitions dpdw_traffic_base;

hp_stat_time=234123417234719237491/hp_log_type=0/hp_host=__HIVE_DEFAULT_PARTITION__/hp_from_type=__HIVE_DEFAULT_PARTITION__
hp_stat_time=r_ready%3D407;r_load%3D641/hp_log_type=0/hp_host=__HIVE_DEFAULT_PARTITION__/hp_from_type=__HIVE_DEFAULT_PARTITION__
hp_stat_time=r_ready%3D438;r_load%3D797/hp_log_type=0/hp_host=__HIVE_DEFAULT_PARTITION__/hp_from_type=__HIVE_DEFAULT_PARTITION__
.......


写了一个脚本清除垃圾数据:

#!/bin/sh

pname=`hadoop fs -ls /user/hive/warehouse/bi.db/dpdw_traffic_base | awk '{print $NF}' | awk  '{gsub(/\/user\/hive\/warehouse\/bi.db\/dpdw_traffic_base\/hp_stat_time=/,"",$0); print $0}'`

for i in $pname
do 
        len=`expr length $i`
        if [ $len -ne 10 -a $i != '__HIVE_DEFAULT_PARTITION__' -a $i != 'items' ]
        then 
                name="hdfs://10.2.6.102/user/hive/warehouse/bi.db/dpdw_traffic_base/hp_stat_time="$i
                hadoop fs -rmr $name
                hive -e "use bi; alter table dpdw_traffic_base drop partition(hp_stat_time='$i')"
        fi
done

出现问题,类似于:

hp_stat_time=r_ready%3D407;r_load%3D641

的partition无法删除

跟进后发现hive会自动对一些字符进行UTF-8编码,此处%3D解码后是'='

另外‘;’ 需要被转义

so:

dpdw_traffic_base的partition:
hp_stat_time=r_ready%3D91;r_load%3D351/hp_log_type=0/hp_host=http%3A%2F%2Fwww.dianping.com%2Fsearch%2Fkeyword%2F10%2F0_%25E9%259B%2581%25E8%258D%25A1%25E6%2583%2585/hp_from_type=__HIVE_DEFAULT_PARTITION__

正确的删除语句是:

> alter table dpdw_traffic_base drop partition(hp_stat_time='r_ready=91\;r_load=351');   
Dropping the partition hp_stat_time=r_ready%3D91;r_load%3D351/hp_log_type=0/hp_host=http%3A%2F%2Fwww.dianping.com%2Fsearch%2Fkeyword%2F10%2F0_%25E9%259B%2581%25E8%258D%25A1%25E6%2583%2585/hp_from_type=__HIVE_DEFAULT_PARTITION__
OK