日期:2014-05-20  浏览次数:20711 次

java正则表达式,如何剔除HTML注释<!--abc-->
下面这段html代码,我想剔除所有的标签、脚本、注释,只留下实际的文字内容,使用的java方法也贴出来了,可结果却不尽人意,剔除结果是剩下了几个字,经研究,发现是regEx_o = "<\\!--.*-->"导致的,由于<body></body>的前后都有<!---->注释,导致整个body都被截掉了,尝试这样写regEx_o = "<\\!--[^(*-->*)]-->";还是会存在问题,所有正能请教高人了!
HTML页面的内容:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=gb2312">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 12">
<meta name=Originator content="Microsoft Word 12">
<link rel=File-List href="54-01-01_3_2.files/filelist.xml">
<link rel=Edit-Time-Data href="54-01-01_3_2.files/editdata.mso">
<!--[if !mso]>
<style>
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style>
<![endif]-->
<title>气发〔2001〕×号</title>
<!--[if gte mso 9]><xml>
 <o:DocumentProperties>
  <o:Author>admin</o:Author>
 </o:DocumentProperties>
</xml><![endif]-->
<link rel=themeData href="54-01-01_3_2.files/themedata.thmx">
<link rel=colorSchemeMapping href="54-01-01_3_2.files/colorschememapping.xml">
<!--[if gte mso 9]><xml>
 <w:WordDocument>
</xml><![endif]-->
<style>
<!--
 /* Font Definitions */
div.Section1
{page:Section1;}
-->
</style>
<!--[if gte mso 10]>
<style>
 /* Style Definitions */
</style>
<![endif]--><!--[if gte mso 9]><xml>
 </o:shapelayout></xml><![endif]-->
</head>

<body lang=ZH-CN style='tab-interval:21.0pt;text-justify-trim:punctuation'>

<div class=Section1 style='layout-grid:30.8pt -.2pt;mso-layout-grid-char-alt:
-849'>

<p class=MsoNormal style='line-height:28.3pt;mso-line-height-rule:exactly'><!--[if gte vml 1]><v:line
 id="_x0000_s1029" style='position:absolute;left:0;text-align:left;z-index:-3;
 visibility:visible;mso-position-vertical-relative:page' from="-20.95pt,783.3pt"
 to="460.95pt,783.3pt" strokecolor="red" strokeweight="4.5pt">
 <v:stroke linestyle="thinThick"/>
</v:line><![endif]-->

</span><![endif]><!--[if gte vml 1]><v:line id="_x0000_s1027" style='position:absolute;
 left:0;text-align:left;z-index:-5;visibility:visible;
 mso-position-vertical-relative:page' from="-20.75pt,140.6pt" to="461.15pt,140.6pt"
 strokecolor="red" strokeweight="4.5pt">
 <v:stroke linestyle="thickThin"/>
</v:line><![endif]--><![if !vml]><span style='mso-ignore:vglayout;position:
absolute;z-index:-5;left:0px;margin-left:-31px;margin-top:184px;width:649px;
height:7px'><img width=649 height=7 src="54-01-01_3_2.files/image002.gif"
v:shapes="_x0000_s1027"></span><![endif]></p>

</span><![endif]></p>