日期:2014-05-16  浏览次数:20570 次

采用Jsoup解析网络资源

Jsoup为一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。

场景如下:

1.获取京东的图书类目

2.以类目id为key,类目名称为value保存到map中

代码如下:

private static Map<String, String> getWareCategory() {
		Connection conn = Jsoup.connect(JDConstants.CATEGORY_URL_FORMAT).userAgent(
		        JDConstants.MOZILLA_AGENT).timeout(JDConstants.TIME_OUT);
		Map<String, String> categoryMap = new HashMap<String, String>();
		Document document = null;
		try {
			Connection.Response response = conn.execute();
			int statusCode = response.statusCode();
			if (statusCode != JDConstants.HTTP_OK_CODE) {
				return categoryMap;
			}
			document = conn.get();
			Elements tmp = document.select("div.left").select("#booksort").first().select(
			        "div.mc ul").first().select("li");
			for (int i = 0; i < tmp.size(); i++) {
				Element e = tmp.get(i);
				String url = e.select("a").attr("href");
				String name = e.select("a").text();
				String categoryId = StringUtils.isNotEmpty(url) ? (url.split("-").length == 3 ? url
				        .split("-")[1] : "") : "";
				categoryMap.put(categoryId, name);
			}
		} catch (Exception e) {
			LOG.error("getCategory response:" + document);
			LOG.error("getCategory error:" + e.getMessage());
		}
		LOG.info("***********categoryMap:" + categoryMap);
		return categoryMap;
	}

?其他常量变量如下:

public abstract class JDConstants {
	public static final int TIME_OUT = 1000 * 60 * 30;
	public static final String MOZILLA_AGENT = "Mozilla";
	public static final int HTTP_OK_CODE = 200;
	public static final String CATEGORY_URL_FORMAT = "http://www.360buy.com/products/1713-3269-000.html";
}

?评价:

操作非常方便