聽 鍦↗ava 绋嬪簭鍦ㄨВ鏋怘TML 鏂囨。鏃讹紝澶у搴旇鏅撳緱htmlparser 杩欎釜寮€婧愰」鐩紝鎴戜篃鏄娇鐢ㄨ繃锛屼笉杩囪繖涓▼搴忓埌浜?006骞村氨娌℃湁鏇存柊浜嗐€傜敱浜庢垜鐨勫熀纭€杈冨樊锛屽浜庢墿灞曡嚜瀹氫箟鐨勬爣绛捐繕鏄笉澶噦锛岃繕鏄湁瓒呮椂闂鍥版壈锛屽伓鐒剁殑 鏈轰細涓彂鐜版湁jsoup锛岃€屼笖鏇存柊鍒颁簡1.72鐗堬紝浣跨敤璧锋潵杩樻槸寰堝鏄撲笂鎵嬬殑銆備笅闈㈠啓浜涗娇鐢ㄥ績寰楋細
聽 聽 聽 聽聽jsoup聽is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
聽 聽 聽 聽jsuop鏄竴娆緅ava鐨刪tml瑙f瀽鍣紝鎻愪緵涓€濂楅潪甯哥渷鍔涚殑API锛岄€氳繃dom妯″瀷css鍜岀被浼间簬jquery鐨勬柟寮忔潵鑾峰彇鍜屾搷浣滄暟鎹€?/p>
聽 聽 聽 鍔熻兘锛?.瑙f瀽涓€涓狧tml鏂囨。锛?.瑙f瀽涓€涓猙ody鐗囨
聽 聽 聽 聽
- String聽html聽=聽"<html><head><title>First聽parse</title></head>"聽聽
- 聽聽+聽"<body><p>Parsed聽HTML聽into聽a聽doc.</p></body></html>";聽聽
- Document聽doc聽=聽Jsoup.parse(html);//鍒嗘瀽鏂囨。锛屼娇鐢╠oc.toString()鍙互杞负鏂囨湰聽聽
- Element聽body=doc.body();//鑾峰彇body鐗囨锛屼娇鐢╞ody.toString()鍙互杞负鏂囨湰聽聽
聽 聽 聽
聽 聽 聽 鑾峰彇鏂瑰紡锛?.浠庢湰鍦版枃浠跺姞杞?聽 2.鏍规嵁url鍦板潃鑾峰彇
聽
聽 聽 聽
- /**浣跨敤闈欐€伮燡soup.parse(File聽in,聽String聽charsetName,聽String聽baseUri)聽鏂规硶聽
- 聽*鍏朵腑baseUri鍙傛暟鐢ㄤ簬瑙e喅鏂囦欢涓璘RLs鏄浉瀵硅矾寰勭殑闂銆?/span>聽
- 聽*濡傛灉涓嶉渶瑕佸彲浠ヤ紶鍏ヤ竴涓┖鐨勫瓧绗︿覆銆?/span>聽
- 聽*/聽聽
- File聽input聽=聽new聽File("/tmp/input.html");聽聽
- Document聽doc聽=聽Jsoup.parse(input,聽"UTF-8",聽"http://example.com/");聽聽
聽聽
- /**聽
- 聽*鏍规嵁url鐩存帴鑾峰彇鍐呭锛屽彲浠ュ姞鍏ヨ秴鏃讹紝get鏂规硶涓嶈锛屽氨鐢╬ost鏂规硶聽
- 聽*鎴戝湪瀹為檯搴旂敤涓紝鍑虹幇404,405,504绛夐敊璇俊鎭?/span>聽
- 聽*灏唃et鏀逛负post灏卞彲浠ワ紝鎴栬€呭弽杩囨潵鏀?/span>聽
- 聽*濡傛灉绛変互鍚庡紕鏄庣櫧浜嗭紝鍐嶆潵瑙i噴娓呮聽
- 聽*/聽聽
- Document聽doc1聽=聽Jsoup.connect("http://www.hao123.com/").get();聽聽
- String聽title聽=聽doc1.title();聽//鑾峰彇缃戦〉鐨勬爣棰?/span>聽聽
- String聽content=doc1.toString();//灏嗙綉椤佃浆涓烘枃鏈?/span>聽聽
- 聽聽
- Document聽doc2聽=聽Jsoup.connect("http://www.hao123.com")聽聽
- 聽聽.data("query",聽"Java")//璇锋眰鍙傛暟聽聽
- 聽聽.userAgent("