日期:2014-05-16  浏览次数:20631 次

HtmlUnit抓取js渲染页面

需求:

需要采集js渲染的页面,有些网站的页面是js渲染的

实现:

基于HtmlUnit实现:

public static void getAjaxPage() throws Exception{
	WebClient webClient = new WebClient();
	webClient.setJavaScriptEnabled(true);
	webClient.setCssEnabled(false);
	webClient.setAjaxController(new NicelyResynchronizingAjaxController());
	webClient.setTimeout(Integer.MAX_VALUE);
	webClient.setThrowExceptionOnScriptError(false);
	HtmlPage rootPage = webClient.getPage("http://tt.mop.com/read_14304066_1_0.html");

	System.out.println(rootPage.asXml());
}

maven依赖:

<dependency>
	<groupId>net.sourceforge.htmlunit</groupId>
	<artifactId>htmlunit-core-js</artifactId>
	<version>2.9</version>
	<scope>compile</scope>
</dependency>
<dependency>
	<groupId>net.sourceforge.htmlunit</groupId>
	<artifactId>htmlunit</artifactId>
	<version>2.9</version>
	<scope>compile</scope>
</dependency>

说明:?

Nutch插件:nutch-htmlunit用于替换Nutch自身的Http Fetch组件

?