日期:2014-05-17  浏览次数:20751 次

htmlcleaner 使用示例.

htmlcleaner 使用示例.

编程的时候,有时数据源从html来。那就要对html分析提取数据。好在java社区里有好有相关库来解析html,经使用比较:个人觉得 htmlcleaner 比 htmlparser 好用。htmlcleaner 的 xpath特好用。也可能我对htmlparser不熟悉。

htmlcleaner 下载地址:htmlcleaner2_1.jar 源码下载:htmlcleaner2_1-all.zip

写一个测试用的html文件:html-clean-demo.html

?? 1. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">?
?? 2. <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-CN" dir="ltr">?
?? 3. <head>?
?? 4.???? <meta http-equiv="Content-Type" content="text/html; charset=GBK"/>?
?? 5.???? <meta http-equiv="Content-Language" content="zh-CN"/>?
?? 6.???? <title>html clean demo</title>?
?? 7. </head>?
?? 8. <body>?
?? 9. <div class="d_1">?
? 10.???? <ul>?
? 11.???????? <li>bar</li>?
? 12.???????? <li>foo</li>?
? 13.???????? <li>gzz</li>?
? 14.???? </ul>?
? 15. </div>?
? 16. <div>?
? 17.???? <ul>?
? 18.???????? <li><a name="my_href" href="1.html">text-1</a></li>?
? 19.???????? <li><a name="my_href" href="2.html">text-2</a></li>?
? 20.???????? <li><a name="my_href" href="3.html">text-3</a></li>?
? 21.???????? <li><a name="my_href" href="4.html">text-4</a></li>?
? 22.???? </ul>?
? 23. </div>?
? 24. </body>?
? 25. </html>?


模拟需求:取出title,name="my_href" 的链接,div的class="d_1"下的所有li内容。下面用htmlcleaner写代码,HtmlCleanerDemo.java

?

package com.chenlb;

import java.io.File;

import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;

/**
?* htmlcleaner 使用示例.
?*
?* @author chenlb 2008-11-26 下午02:12:02
?*/
public class HtmlCleanerDemo {

??? public static void main(String[] args) throws Exception {
??? ??? HtmlCleaner cleaner = new HtmlCleaner();

??? ??? TagNode node = cleaner.clean(new File("html/html-clean-demo.html"), "GBK");
??? ??? //按tag取.
??? ??? Object[] ns = node.getElementsByName("title", true);??? //标题

??? ??? if(ns.length > 0) {
??? ??? ??? System.out.println("title="+((TagNode)ns[0]).getText());
??? ??? }
??? ??? System.out.println("ul/li:");
??? ??? //按xpath取
??? ??? ns = node.evaluateXPath("//div[@class='d_1']//li");
??? ??? for(Object on : ns) {
??? ??? ??? TagNode n = (TagNode) on;
??? ??? ??? System.out.println("\ttext="+n.getText());
??? ??? }
??? ??? System.out.println("a:");
??? ??? //按属性值取
??? ??? ns = node.getElementsByAttValue("name", "my_href", true, true);
??? ??? for(Object on : ns) {
??? ??? ??? TagNode n = (TagNode) on;
??? ??? ??? System.out.println("\thref="+n.getAttributeByName("href")+", text="+n.getText());
??? ??? }
??? }
}

cleaner.clean()中的参数,可以是文件,可以是url,可以是字符串内容。个人认为:比较常用的应该是evaluateXPath、 getElementsByAttValue、getElementsByName方法了。另外说明下,htmlcleaner 对不规范的html兼容性比较好。