java读取网页源代码解决方案-Java教程-爱易网页

java读取网页源代码解决方案

日期：2014-05-20　浏览次数：20858 次

java读取网页源代码
小弟因工作需要写一个类，用于获得网页源代码：要求传入的url是随机的，且返回的源代码不能有乱码。
小弟写了一天都没解决，跪求大侠们帮下忙。。。。。
（随便传入一个URL，都要能得到其源代码。不能打开的则返回一个字符串"页面不存在"）

小弟没分了，求高手们别嫌少哈。

------解决方案--------------------
public class URLSource {

//链表list用来存储相关的网页链接
static private List<String> list = new LinkedList<String>();

public static void downLoad(String eventName, int documentsNumber) {
String source=null;//存储网页源文件

//采用谷歌资讯搜索，网页内容按相关度排序
//String url = http://news.google.cn/archivesearch?q=躲猫猫&num=50&hl=zh-CN&ned=ccn&scoring=a
String url = "http://news.google.cn/archivesearch?q="+eventName+"&num="+documentsNumber+"&hl=zh-CN&ned=ccn&scoring=a";
source = getSource(url);

//抽取每个网页的正文内容
analyzer(source);
}

//抽取网页的源文件
private static String getSource(String link) {
String charset = "GBK";//网页默认编码设置为GBK
URLConnection connection = null;
try {
URL url = new URL(link);
//打开连接
connection = url.openConnection();
//如果网页无法打开
if(null == connection)
return null;

//下载裸源文件
byte[] buf = new byte[2048];
InputStream is = null;
ByteArrayOutputStream os = new ByteArrayOutputStream();
int count = 0;

try {
is = connection.getInputStream();
while ((count = is.read(buf)) >= 0)
{
os.write(buf, 0, count);
}
}catch (Exception e) {
e.printStackTrace();
if (os.size() == 0)
{
return null;
}
}
finally
{
try{is.close(); } catch(Exception e){}
}

//获取网页的编码格式
String content = os.toString();
int fromIndex = content.indexOf("charset=");
charset = content.substring(fromIndex+8, content.indexOf("\"", fromIndex));

return new String(os.toByteArray(), charset);
}catch (Exception e) {
e.printStackTrace();
}

return null;
}
}
------解决方案--------------------

Java code


URL urlC = new URL(url);
        URLConnection connection = urlC.openConnection();
        InputStream ips = connection.getInputStream();
        FileOutputStream fos = new FileOutputStream(htmlFileName);
        challage(ips, fos);
        ips.close();
        fos.close();

private static void challage(InputStream ips, OutputStream ops) throws IOException {
        byte[] contents = new byte[1024];
        int len = 0;
        while((len = ips.read(contents)) != -1){
            ops.write(contents,0,len);
        }
    }
差不多就是这样吧。。

免责声明： 本文仅代表作者个人观点，与爱易网无关。其原创性以及文中陈述文字和内容未经本站证实，对本文以及其中全部或者部分内容、文字的真实性、完整性、及时性本站不作任何保证或承诺，请读者仅作参考，并请自行核实相关内容。

java读取网页源代码解决方案

相关资料更多>

推荐阅读更多>