Jsoup为一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。
场景如下:
1.获取京东的图书类目
2.以类目id为key,类目名称为value保存到map中
代码如下:
private static Map<String, String> getWareCategory() { Connection conn = Jsoup.connect(JDConstants.CATEGORY_URL_FORMAT).userAgent( JDConstants.MOZILLA_AGENT).timeout(JDConstants.TIME_OUT); Map<String, String> categoryMap = new HashMap<String, String>(); Document document = null; try { Connection.Response response = conn.execute(); int statusCode = response.statusCode(); if (statusCode != JDConstants.HTTP_OK_CODE) { return categoryMap; } document = conn.get(); Elements tmp = document.select("div.left").select("#booksort").first().select( "div.mc ul").first().select("li"); for (int i = 0; i < tmp.size(); i++) { Element e = tmp.get(i); String url = e.select("a").attr("href"); String name = e.select("a").text(); String categoryId = StringUtils.isNotEmpty(url) ? (url.split("-").length == 3 ? url .split("-")[1] : "") : ""; categoryMap.put(categoryId, name); } } catch (Exception e) { LOG.error("getCategory response:" + document); LOG.error("getCategory error:" + e.getMessage()); } LOG.info("***********categoryMap:" + categoryMap); return categoryMap; }
?其他常量变量如下:
public abstract class JDConstants { public static final int TIME_OUT = 1000 * 60 * 30; public static final String MOZILLA_AGENT = "Mozilla"; public static final int HTTP_OK_CODE = 200; public static final String CATEGORY_URL_FORMAT = "http://www.360buy.com/products/1713-3269-000.html"; }
?评价:
操作非常方便