求教java文本过滤处理
小弟初学文本处理
要处理的文件是亚马逊上的购物产品日志
对于单个产品记录 格式如下 整个日志有数十万条这样的产品记录 (整个文件1G)
我现在 想用java 读入这个文件 然后 只保每个记录的 ID 号 (如15) 和 其对应的group (如Book)
然后 再把 ID 号 (如15)和 其对应的group (Book)写入一个新的文件
不知道该怎么处理 求高手指导啊
Id: 15
ASIN: 1559362022
title: Wake Up and Smell the Coffee
group: Book
salesrank: 518927
similar: 5 1559360968 1559361247 1559360828 1559361018 0743214552
categories: 3
|Books[283155]|Subjects[1000]|Literature & Fiction[17]|Drama[2159]|United States[2160]
|Books[283155]|Subjects[1000]|Arts & Photography[1]|Performing Arts[521000]|Theater[2154]|General[2218]
|Books[283155]|Subjects[1000]|Literature & Fiction[17]|Authors, A-Z[70021]|( B )[70023]|Bogosian, Eric[70116]
reviews: total: 8 downloaded: 8 avg rating: 4
2002-5-13 cutomer: A2IGOA66Y6O8TQ rating: 5 votes: 3 helpful: 2
2002-6-17 cutomer: A2OIN4AUH84KNE rating: 5 votes: 2 helpful: 1
2003-1-2 cutomer: A2HN382JNT1CIU rating: 1 votes: 6 helpful: 1
2003-6-7 cutomer: A2FDJ79LDU4O18 rating: 4 votes: 1 helpful: 1 2003-6-27
cutomer: A39QMV9ZKRJXO5 rating: 4 votes: 1 helpful: 1 2004-2-17
cutomer: AUUVMSTQ1TXDI rating: 1 votes: 2 helpful: 0 2004-2-24
cutomer: A2C5K0QTLL9UAT rating: 5 votes: 2 helpful: 2 2004-10-13
cutomer: A5XYF0Z3UH4HB rating: 5 votes: 1 helpful: 1
------解决方案--------------------用Pattern matcher,找到想要的,写到一个新文件中不就可以了吗
------解决方案--------------------正则表达式 和 String类的一些方法结合
------解决方案--------------------不知道楼主提供的日志文件中的每个ID是否都会有一个GROUP相对应。如果是的话,假设源数据文件内容为如下:
Id: 1
ASIN: 0827229534
title: Patterns of Preaching: A Sermon Sampler
group: Book
salesrank: 396585
Id: 2
ASIN: 0738700797
title: Candlemas: Feast of Flames
group: Book
salesrank: 168596
similar: 5 0738700827 1567184960 1567182836 0738700525 0738700940
Id: 3
ASIN: 0486287785
title: World War II Allied Fighter Planes Trading Cards
group: Book
salesrank: 1270652
similar: 0
其它的内容因为篇幅省略,放在D盘的DATA.TXT文件中。之后程序如下:
public static void main(String[]args) throws
IOException{
File inFile = new File("D:"+File.separator+"data.txt");
File outFile = new File("D:"+File.separator+"data2.txt");
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outFile)));
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inFile)));
Pattern pattern = Pattern.compile("(Id:){1}\\s*\\w+
------解决方案--------------------(group:)\\s*\\w+");
String str = "";
Matcher matcher;
while((str = reader.readLine()) !=null){
matcher= pattern.matcher(str.trim());
if(matcher.matches()){
if(str.contains("Id")){
String[] idStrings = str.trim().split(":\\s*");
writer.write(idStrings[idStrings.length - 1]+"\t");
}else if(str.contains("group")){
String[] groupStrings = str.split(":\\s*");
writer.write(groupStrings[groupStrings.length - 1]+"\n");
}
}
}
reader.close();
writer.flush();
writer.close();
System.out.println("文本过滤完毕");
}
你所要的结果就会写在DATA2.TXT中