【研究】httpclient 个别汉字乱码,该怎么解决-Java教程-爱易网页

【研究】httpclient 个别汉字乱码,该怎么解决

日期：2014-05-17　浏览次数：20809 次

【研究】httpclient 个别汉字乱码
发现httpclient访问网页时，汉字“埇”是乱码，其他的汉字都正常。
不过，访问百度网页时正常，访问百度知道却出现乱码问题。
怎么才能不出现乱码，请各位帮忙看下，先谢过，可加分。
环境：jdk1.5.0_14，httpclient-4.1.2.jar
附上代码：

Java code

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils;

public class Test {
    public static void main(String[] args) {
        try {
            String[] urlAry = new String[]{
                    //百度网页，"埇"字正常
                    "http://www.baidu.com/s?cl=3&wd=%CB%DE%88%AC%D6%B4",
                    //百度知道，"埇"字乱码
                    "http://zhidao.baidu.com/q?word=%CB%DE%88%AC%D6%B4&lm=0&fr=search&ct=17&pn=0&tn=ikaslist&rn=10"
            };
            for (String queryURL : urlAry) {
                DefaultHttpClient client = new DefaultHttpClient();
                HttpGet httpget = new HttpGet(queryURL);
                HttpResponse response = client.execute(httpget);
                HttpEntity entity = response.getEntity();
                String returnText = EntityUtils.toString(entity,"gb2312");
                //网页代码
//                System.out.println(returnText);
                //通过正则表达式，摘出要比较的部分
                getTextByRule(".*?(宿.*?执).*",returnText);
                client.getConnectionManager().shutdown();
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void getTextByRule(String parttern, String str){
        Pattern p = Pattern.compile(parttern);
        Matcher matcher = p.matcher(str);
        if(matcher.find()) {
            System.out.println(matcher.group(1));
        } 
    }
}

------解决方案--------------------

Java code

String returnText = EntityUtils.toString(entity,"GBK");

------解决方案--------------------

有没有发现实际上这句中的编码根本没有起作用，defaultCharset只有在entity中未提供编码时才会起作用
String returnText = EntityUtils.toString(entity,"gb2312");
编码随便改，即使改成123也不会对结果有任何影响

public static String toString(HttpEntity entity,
                             String defaultCharset)
                      throws IOException,
                             ParseException

   Get the entity content as a String, using the provided default character set if none is found in the entity. If defaultCharset is null, the default "ISO-8859-1" is used.  

两个结果的差异是由各自的URL中带来的编码决定的，前者是GBK，后者是GB2312
因埇字在GB2312中无编码，在GBK中是88AC（十进制：34988），所以后者无法呈现。


------解决方案--------------------
探讨

有没有发现实际上这句中的编码根本没有起作用，defaultCharset只有在entity中未提供编码时才会起作用
String returnText = EntityUtils.toString(entity,"gb2312");
编码随便改，即使改成123也不会对结果有任何影响

public static String toString(HttpEntity entity,
  ……


------解决方案--------------------
探讨

有一点挺奇怪，这个网站charset虽然设为GB2312，却能在浏览器中正常显示gbk汉字，看来是浏览器自动提升为GBK或兼容GBK的encoding来解码的了

免责声明： 本文仅代表作者个人观点，与爱易网无关。其原创性以及文中陈述文字和内容未经本站证实，对本文以及其中全部或者部分内容、文字的真实性、完整性、及时性本站不作任何保证或承诺，请读者仅作参考，并请自行核实相关内容。

【研究】httpclient 个别汉字乱码,该怎么解决

相关资料更多>

推荐阅读更多>