正则提取的中文是乱码,该如何处理-C#教程-爱易网页

正则提取的中文是乱码,该如何处理

日期：2014-05-18　浏览次数：21434 次

正则提取的中文是乱码
正则提取网页时编码是UTF-8,怎么实现提取的是中文？

------解决方案--------------------
正则跟编码没关系 UTF-8，中文都是可以的
------解决方案--------------------

你抓取过来网页源码时就编码乱了吗？

正则是肯定不会使你出现“乱码”的
------解决方案--------------------
[\u4e00-\u9fa5]+

提取中文

上你的代码
------解决方案--------------------

探讨

我看网页很好，提取出来发现是乱码，但是数字不是乱码，中文是乱码。

------解决方案--------------------

探讨

http://topic.csdn.net/u/20120225/22/b5912ce0-ed81-4932-8bb3-a456708d69d4.html

就是这个，我按照5楼的写的，提取出来是乱码。

------解决方案--------------------
我怎么抓就没乱码？？

随便提取出所有中文

using System.Net;
using System.IO;

C# code


 /// <summary>
        /// 得到整个网页的源码
        /// </summary>
        /// <param name="Url"></param>
        /// <returns></returns>
        public static string _GetHtml(string Url)
        {

            Stream MyInStream = null;
            string Html = "";
            try
            {
                HttpWebRequest MyRequest = (HttpWebRequest)WebRequest.Create(Url);
                HttpWebResponse MyResponse = (HttpWebResponse)MyRequest.GetResponse();

                MyInStream = MyResponse.GetResponseStream();

                Encoding encode = System.Text.Encoding.UTF8;
                StreamReader sr = new StreamReader(MyInStream, encode);

                Char[] read = new Char[256];
                int count = sr.Read(read, 0, 256);
                while (count > 0)
                {
                    String str = new String(read, 0, count);
                    Html += str;
                    count = sr.Read(read, 0, 256);
                }
            }
            catch (Exception)
            {
                Html = "错误";
            }
            finally
            {
                if (MyInStream != null)
                {
                    MyInStream.Close();
                }
            }
            return Html;
        }

        static void Main(string[] args)
        {

            string htmlStr = _GetHtml("http://topic.csdn.net/u/20120225/22/b5912ce0-ed81-4932-8bb3-a456708d69d4.html");

            Regex re = new Regex(@"[\u4e00-\u9fa5]+", RegexOptions.None);
            MatchCollection mc = re.Matches(htmlStr);
            foreach (Match ma in mc)
            {
                Console.WriteLine(ma.Value);
            }

          
            Console.ReadLine(); 
                  
        }

------解决方案--------------------
探讨

引用:
你的获取源码的代码。。。下面这两句肯定没弄对

Encoding encode = System.Text.Encoding.UTF8;
StreamReader sr = new StreamReader(MyInStream, encode);


这两句我没写，不懂怎么用，MyInStream是什么？

免责声明： 本文仅代表作者个人观点，与爱易网无关。其原创性以及文中陈述文字和内容未经本站证实，对本文以及其中全部或者部分内容、文字的真实性、完整性、及时性本站不作任何保证或承诺，请读者仅作参考，并请自行核实相关内容。

正则提取的中文是乱码,该如何处理

相关资料更多>

推荐阅读更多>