日期:2014-05-18  浏览次数:20718 次

一个网页抓数据的问题,高难度请指教。
HttpHelper类的主要代码如下:

C# code

        private CookieContainer cc;
        private string contentType = "application/x-www-form-urlencoded";
        private string accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/x-silverlight, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-ms-application, application/x-ms-xbap, application/vnd.ms-xpsdocument, application/xaml+xml, application/x-silverlight-2-b1, */*";
        private string userAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)";
        private Encoding encoding = Encoding.GetEncoding("gb2312");

        public string GetHtml(string url, CookieContainer cookieContainer)
        {
            HttpWebRequest httpWebRequest;

            httpWebRequest = (HttpWebRequest)HttpWebRequest.Create(url);
            httpWebRequest.CookieContainer = cookieContainer;
            httpWebRequest.ContentType = contentType;
            httpWebRequest.Referer = url;
            httpWebRequest.Accept = accept;
            httpWebRequest.UserAgent = userAgent;
            httpWebRequest.Method = "GET";

            HttpWebResponse httpWebResponse;
            httpWebResponse = (HttpWebResponse)httpWebRequest.GetResponse();
            Stream responseStream = httpWebResponse.GetResponseStream();
            StreamReader streamReader = new StreamReader(responseStream, encoding);
            string html = streamReader.ReadToEnd();
            streamReader.Close();
            responseStream.Close();

            return html;
        }



调用该方法的代码如下
C# code

            HttpHelper helper = new HttpHelper();
            string ss = helper.GetHtml("http://bill.finance.sina.com.cn/bill/detail.php?stock_code=sh600550&bill_size=40000");



我现在要抓取的页面是http://bill.finance.sina.com.cn/bill/detail.php?stock_code=sh600550&bill_size=40000
如果抓取的页面是http://www.sina.com.cn,没有任何问题。
可是抓取上述页面就有问题,应该是上面这个页面做了什么限制或判断,不知哪位高手能给看一下?
谢谢!

------解决方案--------------------
用我这个方法就可以了!我试过了的!

public string gethtml(string url)
{
string text2 = "";
WebClient client1 = new WebClient();
try
{
byte[] buffer1 = client1.DownloadData(url);

string text1 = Encoding.Default.GetString(buffer1);
text2 = text1;
}
catch
{
text2 = null;
}
return text2;
}


------解决方案--------------------
http://blog.csdn.net/jiang_jiajia10/archive/2008/11/18/3325407.aspx
------解决方案--------------------
网页经过deflate压缩的

System.IO.Compression.DeflateStream responseStream =new System.IO.Compression.DeflateStream( httpWebResponse.GetResponseStream(),System.IO.Compression.CompressionMode.Decompress);
  
*****************************************************************************
欢迎使用CSDN论坛专用阅读器 : CSDN Reader(附全部源代码) 

http://feiyun0112.cnblogs.com/
------解决方案--------------------
看这里。root_兄给我的方法: http://topic.csdn.net/u/20081215/23/28f9ae30-2fa4-4b8d-8f84-710b4b5ddb6e.html
------解决方案--------------------
对,就是这个 把流解压下再streamReader
探讨
网页经过deflate压缩的

System.IO.Compression.DeflateStream responseStream =new System.IO.Compression.DeflateStream( h