大神救救！C#获取html源文件很不稳定，对同一个网页，有时候对，有时错-C#教程-爱易网页

大神救救！C#获取html源文件很不稳定，对同一个网页，有时候对，有时错

日期：2014-05-17　浏览次数：21367 次

大神救救！！！C#获取html源文件很不稳定，对同一个网页，有时对，有时错
附上源代码

C# code


private static string getHtml(string url, string charSet)//url是要访问的网站地址，charSet是目标网页的编码，如果传入的是null或者""，那就自动分析网页的编码 
        {
            charSet = "gb2312";
            WebClient myWebClient = new WebClient(); //创建WebClient实例myWebClient 
            // 需要注意的： 
            //有的网页可能下不下来，有种种原因比如需要cookie,编码问题等等 
            //这是就要具体问题具体分析比如在头部加入cookie 
            // webclient.Headers.Add("Cookie", cookie); 
            //这样可能需要一些重载方法。根据需要写就可以了 
            //获取或设置用于对向 Internet 资源的请求进行身份验证的网络凭据。 
            myWebClient.Credentials = CredentialCache.DefaultCredentials;
            //如果服务器要验证用户名,密码 
            //NetworkCredential mycred = new NetworkCredential(struser, strpassword); 
            //myWebClient.Credentials = mycred; 
            //从资源下载数据并返回字节数组。（加@是因为网址中间有"/"符号） 
            byte[] myDataBuffer = myWebClient.DownloadData(url);
            string strWebData = Encoding.GetEncoding("GB2312").GetString(myDataBuffer);
            //StreamWriter sw = new StreamWriter("D:/qq-utf-8.txt");
            //sw.Write(strWebData);
            //sw.Close();
            //获取网页字符编码描述信息 
            Match charSetMatch = Regex.Match(strWebData, "<meta([^<]*)charset=([^<]*)\"", RegexOptions.IgnoreCase | RegexOptions.Multiline);
            string webCharSet = charSetMatch.Value;

            int start = webCharSet.IndexOf("charset=");
            
            //if (start == -1)
            //{
            //    charSet = "utf-8";
            //    strWebData = Encoding.GetEncoding(charSet).GetString(myDataBuffer);
            //    charSetMatch = Regex.Match(strWebData, "<meta([^<]*)charset=([^<]*)\"", RegexOptions.IgnoreCase | RegexOptions.Multiline);
            //    webCharSet = charSetMatch.Value;
            //    start = webCharSet.IndexOf("charset=");
            //}
            int end = webCharSet.IndexOf("\"", start);
            webCharSet=webCharSet.Substring(start+8,end-(start+8));
            if(webCharSet!=charSet)
                strWebData = Encoding.GetEncoding(webCharSet).GetString(myDataBuffer);
            //if (charSet == null || charSet == "")
            //    charSet = webCharSet;
            
            //if (charSet != null && charSet != "" && Encoding.GetEncoding(charSet) != Encoding.GetEncoding("utf-8"))
            //    strWebData = Encoding.GetEncoding(charSet).GetString(myDataBuffer);
            return strWebData;
        }

症状如下：
1.对于同一个网页，如www.qq.com 我用gb2312解析，发现有时解析的htnl正确，有时解析的html是乱码，很奇葩。。
2.对于同一个url获得的字节数组，例如cn.msn.com(utf-8的编码),如果先用gb2312,后用utf-8还是乱码，而只有utf-8是对的，难道用gb2312编码后会修改网页的字节数组？？？？？
3.目前该段程序，仍然无法很好的解析所有的网页，尤其是utf-8的网页。而且稳定性极差。

------解决方案--------------------

你用dedault编码试试看

不要指定

免责声明： 本文仅代表作者个人观点，与爱易网无关。其原创性以及文中陈述文字和内容未经本站证实，对本文以及其中全部或者部分内容、文字的真实性、完整性、及时性本站不作任何保证或承诺，请读者仅作参考，并请自行核实相关内容。

大神救救！C#获取html源文件很不稳定，对同一个网页，有时候对，有时错

相关资料更多>

推荐阅读更多>