日期:2014-05-20  浏览次数:20771 次

高分求救,关于抓取的奇怪的问题
http://www.worldmetals.com.cn/search/metsearch.jsp?search=(铁矿石)%20and%20docchannel=(36)

该地址在IE中能得到正常的结果,但无论是用java.net来抓取,还是用Socket来抓取,都只能抓取出结果集为0的页面来。

分析头文件为:
HTTP/1.1   200   OK
Date:   Thu,   24   May   2007   03:42:59   GMT
Server:   IBM_HTTP_SERVER/1.3.19.3     Apache/1.3.20   (Win32)
Set-Cookie:   JSESSIONID=0000OQ1AS0ENLEHAPX2IXG4VCQY:vdebn6i3;Path=/
Cache-Control:   no-cache= "set-cookie,set-cookie2 "
Expires:   Thu,   01   Dec   1994   16:00:00   GMT
Transfer-Encoding:   chunked
Content-Type:   text/html;charset=gb2312
Content-Language:   zh

开始怀疑是chunked的问题,但发现http://www.worldmetals.com.cn/search/metsearch.jsp?search=(china)%20and%20docchannel=(36)能正常抓取

所以怀疑是中文传递的问题,对里面的url进行多次转码,可还是抓不到想要的结果集


------解决方案--------------------
用java.net包抓取的程序
public void getHtml(String url)
{
try
{

String sCurrentLine;

String sTotalString;

sCurrentLine= " ";

sTotalString= " ";

java.io.InputStream l_urlStream;

java.net.URL l_url = new java.net.URL(url);

java.net.HttpURLConnection l_connection = (java.net.HttpURLConnection) l_url.openConnection();

l_connection.connect();

l_urlStream = l_connection.getInputStream();

java.io.BufferedReader l_reader = new java.io.BufferedReader(new java.io.InputStreamReader(l_urlStream));

while ((sCurrentLine = l_reader.readLine()) != null)

{

sTotalString+=sCurrentLine+ "\n ";

}
System.out.println(sTotalString);
}
catch(Exception ex)
{
System.out.println(ex.toString());
}
}
------解决方案--------------------
抓个全部是英文字符没有转译的url试试,看看结果先
------解决方案--------------------
String condition = java.net.URLEncoder.encode("铁矿石", "UTF-8"); 
String url = "http://www.worldmetals.com.cn/search/metsearch.jsp?search=(" + condition + ")%20and%20docchannel=(36)";
getHtml(url);

暂时试了这个好像可以