日期:2014-05-16  浏览次数:20530 次

Mongodb的一些运维经验
这里记录一些mongodb在日常使用中遇到的一些问题

1.MongoDB做了replica sets之后,secondary节点出现recovering状态
官方的解释:

You don't need to repair, simply perform a full resync.

On the secondary, you can:

    stop the failed mongod
    delete all data in the dbpath (including subdirectories)
    restart it and it will automatically resynchronize itself

Follow the instructions here.

What's happened in your case is that your secondaries have become stale, i.e. there is no common point in their oplog and that of the oplog on the primary. Look at this document, which details the various statuses. The writes to the primary member have to be replicated to the secondaries and your secondaries couldn't keep up until they eventually went stale. You will need to consider resizing your oplog.

Regarding oplog size, it depends on how much data you insert/update over time. I would chose a size which allows you many hours or even days of oplog.

Additionally, I'm not sure which O/S you are running. However, for 64-bit Linux, Solaris, and FreeBSD systems, MongoDB will allocate 5% of the available free disk space to the oplog. If this amount is smaller than a gigabyte, then MongoDB will allocate 1 gigabyte of space. For 64-bit OS X systems, MongoDB allocates 183 megabytes of space to the oplog and for 32-bit systems, MongoDB allocates about 48 megabytes of space to the oplog.

How big are records and how many do you want? It depends on whether this data insertion is something typical or something abnormal that you were merely testing.

For example, at 2000 documents per second for documents of 1KB, that would net you 120MB per minute and your 5GB oplog would last about 40 minutes. This means if the secondary ever goes offline for 40 minutes or falls behind by more than that, then you are stale and have to do a full resync.

I recommend reading the Replica Set Internals document here. You have 4 members in your replica set, which is not recommended. You should have an odd number for the voting election (of primary) process, so you either need to add an arbiter, another secondary or remove one of your secondaries.

Finally, here's a detailed document on RS administration.

也就是说,需要三步可以修复,出现这个问题的原因主要是secondary节点同步oplog的速度追不上primary几点的速度,造成一直处于recovering状态。解决办法就是首先停掉mongod进程,然后删除data目录下面所有的数据,然后重启mongod进程,这里有一点需要注意,arbiter的mongod进程也需要停掉,启动的时候,先启动replSet的mongod进程,再启动arbiter的mongod进程,启动之后,会自动由recovering状态切换为secondary状态


2.dbpath目录下的log日志过大
dbpath下log目录下mongodb.log日志,会随着时间的推移,越来越大,而mongodb本身是不会做任何操作去控制日志大小的。所以需要我们自己去控制。方法是:use admin库,然后执行db.runCommand("logRotate"),这样mongodb会将原来的mongodb.log生成为mongodb.log.2012-10-10T01-11-01的形式,并会再创建一个mongodb.log写入日志,这样就可以解决日志占用磁盘空间过大的问题。
当然还可以停止掉mongod进程,删除mongodb.log,再重启进程,这个就会造成一段时间不能正常使用
官方文档:http://www.mongodb.org/display/DOCS/Logging#Logging-Rotatingthelogfiles


3.dbpath下数据文件过多,造成磁盘空间报警
mongodb在存储数据的时候,会采用多个文件进行存储。有个命名空间$freelist,他记录不再使用的盘区(被删除的collection或索引),所以如果你的collection没有被删除,而只是靠remove去删除数据的话,就会造成磁盘的碎片,直到硬盘被占满。解决的办法有几种。一种是可以试用capped collection,第二种是可以repairdatabase,第三种可以是先使用mongodump将db或者collection导出,然后drop db或者collection,再使用mongorestore恢复数据。第二种和第三种方法所需要的时间都比较长,其实第二种方法也是通过将数据导入和导出来释放磁盘空间