日期:2014-05-16 浏览次数:21704 次
最近,线上一些内存比较内存占用敏感的应用,在访问峰值的时候,偶尔会被kill掉,导致服务重启。这是Linux一个叫out-of-memory kiiler的机制:
http://linux-mm.org/OOM_Killer
oom kiiler会在内存紧张的时候,会依次kill内存占用较高的进程,发送Signal 15(SIGTERM)。并在/var/log/message中进行记录。里面会记录一些如pid,process name,cpu mask,trace等信息,通过监控可以发现类似问题。今天特意分析了一下oom killer相关的选择机制,挖了一下代码,感觉该机制简单粗暴,不过效果还是挺明显的,给大家分享出来。
#define block (1024L*1024L*MB) #define MB 64L unsigned long total = 0L; for(;;) { // malloc big block memory and ZERO it !! char* mm = (char*) malloc(block); usleep(100000); if (NULL == mm) continue; bzero(mm,block); total += MB; fprintf(stdout,"alloc %lum mem\n",total); }
/proc/sys/vm/lowmem_reserve_ratio来查看当前low大小和阀值low大小。低于阀值时候才会触发oom killer,所以这里block的分配小雨默认的256M,否则如果每次申请512M(大于128M),malloc可能会被底层的brk这个syscall阻塞住,内核触发page cache回写或slab回收。
测试:
gcc big_mm.c -o big_mm ; ./big_mm & ./big_mm & ./big_mm &
(同时启动多个big_mm进程争抢内存)
启动后,部分big_mm被killed,在/var/log/message下tail -n 1000 | grep -i oom 看到:
Apr 18 16:56:16 v125000100.bja kernel: : [22254383.898423] Out of memory: Kill process 24894 (big_mm) score 277 or sacrifice child Apr 18 16:56:16 v125000100.bja kernel: : [22254383.899708] Killed process 24894, UID 55120, (big_mm) total-vm:2301932kB, anon-rss:2228452kB, file-rss:24kB Apr 18 16:56:18 v125000100.bja kernel: : [22254386.738942] big_mm invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 Apr 18 16:56:18 v125000100.bja kernel: : [22254386.738947] big_mm cpuset=/ mems_allowed=0 Apr 18 16:56:18 v125000100.bja kernel: : [22254386.738950] Pid: 24893, comm: big_mm Not tainted 2.6.32-220.23.2.ali878.el6.x86_64 #1 Apr 18 16:56:18 v125000100.bja kernel: : [22254386.738952] Call Trace: Apr 18 16:56:18 v125000100.bja kernel: : [22254386.738961] [<ffffffff810c35e1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 Apr 18 16:56:18 v125000100.bja kernel: : [22254386.738968] [<ffffffff81114d70>] ? dump_header+0x90/0x1b0 Apr 18 16:56:18 v125000100.bja kernel: : [22254386.738973] [<ffffffff810e1b2e>] ? __delayacct_freepages_end+0x2e/0x30 Apr 18 16:56:18 v125000100.bja kernel: : [22254386.738979] [<ffffffff81213ffc>] ? security_real_capable_noaudit+0x3c/0x70 Apr 18 16:56:18 v125000100.bja kernel: : [22254386.738982] [<ffffffff811151fa>] ? oom_kill_process+0x8a/0x2c0 Apr 18 16:56:18 v125000100.bja kernel: : [22254386.738985] [<ffffffff81115131>] ? select_bad_process+0xe1/0x120 Apr 18 16:56:18 v125000100.bja kernel: : [22254386.738989] [<ffffffff811