高分求救,关于抓取的奇怪的问题
http://www.worldmetals.com.cn/search/metsearch.jsp?search=(铁矿石)%20and%20docchannel=(36)
该地址在IE中能得到正常的结果,但无论是用java.net来抓取,还是用Socket来抓取,都只能抓取出结果集为0的页面来。
分析头文件为:
HTTP/1.1 200 OK
Date: Thu, 24 May 2007 03:42:59 GMT
Server: IBM_HTTP_SERVER/1.3.19.3 Apache/1.3.20 (Win32)
Set-Cookie: JSESSIONID=0000OQ1AS0ENLEHAPX2IXG4VCQY:vdebn6i3;Path=/
Cache-Control: no-cache= "set-cookie,set-cookie2 "
Expires: Thu, 01 Dec 1994 16:00:00 GMT
Transfer-Encoding: chunked
Content-Type: text/html;charset=gb2312
Content-Language: zh
开始怀疑是chunked的问题,但发现http://www.worldmetals.com.cn/search/metsearch.jsp?search=(china)%20and%20docchannel=(36)能正常抓取
所以怀疑是中文传递的问题,对里面的url进行多次转码,可还是抓不到想要的结果集
------解决方案--------------------用java.net包抓取的程序
public void getHtml(String url)
{
try
{
String sCurrentLine;
String sTotalString;
sCurrentLine= " ";
sTotalString= " ";
java.io.InputStream l_urlStream;
java.net.URL l_url = new java.net.URL(url);
java.net.HttpURLConnection l_connection = (java.net.HttpURLConnection) l_url.openConnection();
l_connection.connect();
l_urlStream = l_connection.getInputStream();
java.io.BufferedReader l_reader = new java.io.BufferedReader(new java.io.InputStreamReader(l_urlStream));
while ((sCurrentLine = l_reader.readLine()) != null)
{
sTotalString+=sCurrentLine+ "\n ";
}
System.out.println(sTotalString);
}
catch(Exception ex)
{
System.out.println(ex.toString());
}
}
------解决方案--------------------抓个全部是英文字符没有转译的url试试,看看结果先
------解决方案-------------------- String condition = java.net.URLEncoder.encode("铁矿石", "UTF-8");
String url = "http://www.worldmetals.com.cn/search/metsearch.jsp?search=(" + condition + ")%20and%20docchannel=(36)";
getHtml(url);
暂时试了这个好像可以