java读取网页源代码
小弟因工作需要写一个类,用于获得网页源代码:要求传入的url是随机的,且返回的源代码不能有乱码。
小弟写了一天都没解决,跪求大侠们帮下忙。。。。。
(随便传入一个URL,都要能得到其源代码。不能打开的则返回一个字符串"页面不存在")
小弟没分了,求高手们别嫌少哈。
------解决方案--------------------public class URLSource {
//链表list用来存储相关的网页链接
static private List<String> list = new LinkedList<String>();
public static void downLoad(String eventName, int documentsNumber) {
String source=null;//存储网页源文件
//采用谷歌资讯搜索,网页内容按相关度排序
//String url = http://news.google.cn/archivesearch?q=躲猫猫&num=50&hl=zh-CN&ned=ccn&scoring=a
String url = "http://news.google.cn/archivesearch?q="+eventName+"&num="+documentsNumber+"&hl=zh-CN&ned=ccn&scoring=a";
source = getSource(url);
//抽取每个网页的正文内容
analyzer(source);
}
//抽取网页的源文件
private static String getSource(String link) {
String charset = "GBK";//网页默认编码设置为GBK
URLConnection connection = null;
try {
URL url = new URL(link);
//打开连接
connection = url.openConnection();
//如果网页无法打开
if(null == connection)
return null;
//下载裸源文件
byte[] buf = new byte[2048];
InputStream is = null;
ByteArrayOutputStream os = new ByteArrayOutputStream();
int count = 0;
try {
is = connection.getInputStream();
while ((count = is.read(buf)) >= 0)
{
os.write(buf, 0, count);
}
}catch (Exception e) {
e.printStackTrace();
if (os.size() == 0)
{
return null;
}
}
finally
{
try{is.close(); } catch(Exception e){}
}
//获取网页的编码格式
String content = os.toString();
int fromIndex = content.indexOf("charset=");
charset = content.substring(fromIndex+8, content.indexOf("\"", fromIndex));
return new String(os.toByteArray(), charset);
}catch (Exception e) {
e.printStackTrace();
}
return null;
}
}
------解决方案--------------------Java code
URL urlC = new URL(url);
URLConnection connection = urlC.openConnection();
InputStream ips = connection.getInputStream();
FileOutputStream fos = new FileOutputStream(htmlFileName);
challage(ips, fos);
ips.close();
fos.close();
private static void challage(InputStream ips, OutputStream ops) throws IOException {
byte[] contents = new byte[1024];
int len = 0;
while((len = ips.read(contents)) != -1){
ops.write(contents,0,len);
}
}
差不多就是这样吧。。