日期:2014-05-16  浏览次数:20495 次

JOB突然停止工作了
  本文通过一次Oracle Job任务异常案例诊断,分析其原因及解决过程,从内部揭示Oracle Job任务调度及内部计时机制。
  
  问题及环境
  
  接到研发人员报告,数据库定时任务未正常执行,导致某些操作失败。
  
  开始介入处理该事故.
  
  系统环境:
  
  SunOS DB 5.8 Generic_108528-21 sun4u sparc SUNW,Ultra-4
  Oracle9i Enterprise Edition Release 9.2.0.3.0 - Production
  
  解决过程
  
  首先介入检查数据库任务
  
  $ sqlplus "/ as sysdba"
  SQL*Plus: Release 9.2.0.3.0 - Production on Wed Nov 17 20:23:53 2004
  Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
  Connected to:
  Oracle9i Enterprise Edition Release 9.2.0.3.0 - Production
  With the Partitioning, OLAP and Oracle Data Mining options
  JServer Release 9.2.0.3.0 - Production
  SQL> select job,last_date,last_sec,next_date,next_sec,broken,failures from
  dba_jobs;
  JOB LAST_DATE LAST_SEC NEXT_DATE NEXT_SEC B FAILURES
  INTERVAL
  ---------- --------- ---------------- --------- ---------------- - ----------
  ----------------------------
  31 16-NOV-04 01:00:02 17-NOV-04 01:00:00 N 0
  trunc(sysdate+1)+1/24
  27 16-NOV-04 00:00:04 17-NOV-04 00:00:00 N 0
  TRUNC(SYSDATE) + 1
  35 16-NOV-04 01:00:02 17-NOV-04 01:00:00 N 0
  trunc(sysdate+1)+1/24
  29 16-NOV-04 00:00:04 17-NOV-04 00:00:00 N 0
  TRUNC(SYSDATE) + 1
  30 01-NOV-04 06:00:01 01-DEC-04 06:00:00 N 0
  trunc(add_months(sysdate,1),’MM’)+6/24
  65 16-NOV-04 04:00:03 17-NOV-04 04:00:00 N 0
  trunc(sysdate+1)+4/24
  46 16-NOV-04 02:14:27 17-NOV-04 02:14:27 N 0
  sysdate+1
  66 16-NOV-04 03:00:02 17-NOV-04 18:14:49 N 0
  trunc(sysdate+1)+3/24
  8 rows selected.
  
  发现JOB任务是都没有正常执行,最早一个应该在17-NOV-04 01:00:00执行。但是没有执行。
  
  建立测试JOB
  
  create or replace PROCEDURE pining
  
  IS
  BEGIN
  NULL;
  END;
  /
  variable jobno number;
  variable instno number;
  begin
  select instance_number into :instno from v$instance;
  dbms_job.submit(:jobno, ’pining;’, trunc(sysdate+1/288,’MI’),
  ’trunc(SYSDATE+1/288,’’MI’’)’, TRUE, :instno);
  end;
  /
  
  发现同样的,不执行。
  
  但是通过dbms_job.run(<job>)执行没有任何问题。
  
  进行恢复尝试
  
  怀疑是CJQ0进程失效,首先设置JOB_QUEUE_PROCESSES为0,Oracle会杀掉CJQ0及相应job进程
  
  SQL> ALTER SYSTEM SET JOB_QUEUE_PROCESSES = 0;
  
  等2~3分钟,重新设置
  
  SQL> ALTER SYSTEM SET JOB_QUEUE_PROCESSES = 5;
  
  此时PMON会重起CJQ0进程
  
  Thu Nov 18 11:59:50 2004
  
  ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;
  Thu Nov 18 12:01:30 2004
  ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY;
  Thu Nov 18 12:01:30 2004
  Restarting dead background process CJQ0
  CJQ0 started with pid=8
  但是Job仍然不执行,而且在再次修改的时候,CJQ0直接死掉了。
  Thu Nov 18 13:52:05 2004
  ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;
  Thu Nov 18 14:09:30 2004
  ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY;
  Thu Nov 18 14:10:27 2004
  ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;
  Thu Nov 18 14:10:42 2004
  ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY;
  Thu Nov 18 14:31:07 2004
  ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;
  Thu Nov 18 14:40:14 2004
  ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY;
  Thu Nov 18 14:40:28 2004
  ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;
  Thu Nov 18 14:40:33 2004
  ALTER SYSTEM SET job_queue_processes=1 SCOPE=MEMORY;
  Thu Nov 18 14:40:40 2004
  ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY;
  Thu Nov 18 15:00:42 2004
  ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;
  
  Thu Nov 18 15:01:36 2004
  ALTER SYSTEM SET job_queue_processes=15 SCOPE=MEMORY;
  
  尝试重起数据库,这个必须在晚上进行: