日期:2014-05-16 浏览次数:20577 次
???? 数据库表描述:
???? Table Name:NewsFromWeb
???? Column Name 1:_id
???? Column Name 2:url
???? Column Name 3:title
???? Other Columns ....
????????????????????????????? ....
???? 功能实现的前提:如果数据库表当中有相同的url,则认定是重复的记录.
???? 功能实现的思路:通过Group by一样的查询,找到有相同的记录,做出列表,并且把记录的第二条以后的数据,进行删除,只保留第一条的记录.
???? 从SQL实现上类似:select sum(url) as urlCount from NewsFromWeb group by url.
???? 如下是实现方法:
@Test public void checkRepeat(){ //first group by key BasicDBObject key = new BasicDBObject(); key.put("url", true); BasicDBObject initial = new BasicDBObject(); initial.put("urlCount", 0); BasicDBObject cond = new BasicDBObject(); cond.put("url", 1); String reduce = "function(obj,prev) { prev.urlCount++; }"; DBObject objMap = mongoDAO.getCollection().group(key, new BasicDBObject(), initial, reduce); Set set = objMap.keySet(); Iterator it = set.iterator(); while(it.hasNext()){ String str_key = (String) it.next(); String recordStr = objMap.get(str_key).toString(); DBObject doc = (DBObject) JSON.parse(recordStr); if (Double.valueOf(doc.get("urlCount").toString())>1){ BasicDBObject condDel = new BasicDBObject(); cond.put("url", doc.get("url").toString()); DBCursor map = mongoDAO.getCollection().find(cond); int j=0; while(map.hasNext()){ DBObject obj = map.next(); if (j>0){ System.out.println("正在删除:"+obj); mongoDAO.getCollection().remove(obj); } j++; } } } }?
???? 本人通过测试10万条数据处理大约2分钟左右,只是一个大概的数据,其实跟硬件和重复记录有很大关系,感觉速度还能接受(CPU老赛扬1.6,Window环境)