FastStartFailoverPmyShutdown参数测试
top测试环境
SQL> SELECT * FROM v$version; BANNER -------------------------------------------------------------------------------- Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production PL/SQL Release 11.1.0.7.0 - Production CORE 11.1.0.7.0 Production TNS for Linux: Version 11.1.0.7.0 - Production NLSRTL Version 11.1.0.7.0 - Production
top参数FastStartFailoverPmyShutdown的作用
在10G的时候,当Primary与Standby和Observer同时失去联系了之后(Primary被隔离了),它自己会自动关闭数据库,因为这个时候Primary并不知道这个Standby是否已经自动转换为了新的Primary,自动关闭目的是减少同时两个Primary存在所造成的数据不一致。这个行为在10G的时候是不可控的,从11G开始,Oracle提供了参数FastStartFailoverPmyShutdown来确定在Primary被隔离之后Primary应该采取什么操作,下面是文档中关于FastStartFailoverPmyShutdown的说明:
9.2.13 FastStartFailoverPmyShutdown
The FastStartFailoverPmyShutdown configuration property causes the primary database to shut down if Fast-start failover is enabled and V$DATABASE.FS_FAILOVER_STATUS indicates the primary has been STALLED for longer than FastStartFailoverThreshold seconds. A value of TRUE helps to ensure that an isolated primary database cannot satisfy user queries.
This property cannot be used to prevent the primary database from shutting down if a Fast-start failover occurred because a user configuration condition was detected or was requested by an application by calling the DBMS_DG.INITIATE_FS_FAILOVER function
第一句话说的是当FastStartFailoverPmyShutdown设置为true时,当Fast-start Failover启用了之后,如果Primary数据库的V$DATABASE.FS_FAILOVER_STATUS显示的状态为STALLED的时间超过FastStartFailoverThreshold所指定的时间之后,Primary将会自动的关闭。
FastStartFailoverThreshold比较好理解,关键是这里的STALLED状态是什么意思呢?关于V$DATABASE.FS_FAILOVER_STATUS的STALLED状态,文档在这里是这么描述的:
- STALLED
- Displays on the primary database after loss of connectivity to the target standby database and the change to the UNSYNCHRONIZED state (maximum availability mode) or to the TARGET OVER LAG LIMIT state (maximum performance mode) cannot be confirmed by either the target standby database or the observer. Note that the value of the FastStartFailoverPmyShutdown configuration property must be FALSE for the primary to stall indefinitely under these conditions. With a value of TRUE for this property, the primary will shut down after being stalled for the number of seconds specified by the FastStartFailoverThreshold property.
It shuts down or stalls because it is likely a failover has occurred.
Note: this state also occurs on the primary during startup when Fast-start failover is possible and neither the target standby database nor the observer are present to confirm it is okay to continue opening the database.
从描述可以看到,当Primary数据库与Standby数据库失去联系之后,如果Primary数据库不能得到Standby数据库或Observer数据的确认而将V$DATABASE.FS_FAILOVER_STATUS设置为UNSYNCHRONIZED状态(最大可用模式)或是TARGET OVER LAG LIMIT状态(最大性能模式)时,V$DATABASE.FS_FAILOVER_STATUS将会被设置为STALLED状态。
现在我们可以得到下面的结论:当参数FastStartFailoverPmyShutdown设置为true时,如果Primary数据库失去与Standby数据库联系之后,如果它不能通过Observer来确定V$DATABASE.FS_FAILOVER_STATUS的时间超过FastStartFailoverThreshold秒之后,Primary数据库会自动关闭,接下就来测试下。
top测试
测试之前,Fast-start Failover的设置如下:
DGMGRL> SHOW FAST_START FAILOVER; Fast-Start Failover: ENABLED Threshold: 60 seconds Target: orcl_a Observer: observer01.mycompany.com Lag Limit: 30 seconds (not in use) Shutdown Primary: TRUE Auto-reinstate: TRUE
当前Primary的Fast-start Failover的状态是SYNCHRONIZED:
SQL> set lines 145 SQL> col FS_FAILOVER_STATUS for a20 SQL> col FS_FAILOVER_CURRENT_TARGET for a20 SQL> col FS_FAILOVER_THRESHOLD for 99999 col FS_FAILOVER_OBSERVER_PRESENT for a10 SQL> SQL> col FS_FAILOVER_OBSERVER_HOST for a40 SQL> select FS_FAILOVER_STATUS,FS_FAILOVER_CURRENT_TARGET, 2 FS_FAILOVER_THRESHOLD,FS_FAILOVER_OBSERVER_PRESENT, 3 FS_FAILOVER_OBSERVER_HOST 4 from v$database; FS_FAILOVER_STATUS FS_FAILOVER_CURRENT_ FS_FAILOVER_THRESHOLD FS_FAILOVE FS_FAILOVER_OBSERVER_HOST -------------------- -------------------- --------------------- ---------- ---------------------------------------- UNSYNCHRONIZED orcl_b 60 YES observer01.mycompany.com
现在先等到Standby数据库上,发出shutdown abort命令,接着将Observer停止,然后通过查看Primary数据库的Fast-start Failover的状态变成了STALLED,
SQL> / FS_FAILOVER_STATUS FS_FAILOVER_CURRENT_ FS_FAILOVER_THRESHOLD FS_FAILOVE FS_FAILOVER_OBSERVER_HOST -------------------- -------------------- --------------------- ---------- ---------------------------------------- STALLED orcl_b 60 YES observer01.mycompany.com
再查看机器上面的drcorcl.log可以看到下面的内容:
DG 2010-03-12-06:34:39 0 2 0 FSFP: Failed to connect to remote database orcl_b. Error is ORA-12514 DG 2010-03-12-06:34:40 0 2 0 LGWR: FSFO SetState(2, 0x2) operation requires an ack DG 2010-03-12-06:34:40 0 2 0 Primary shutdown is possible if ack is not satisfied within 57 seconds DG 2010-03-12-06:34:54 0 2 0 FSFP: Failed to connect to remote database orcl_b. Error is ORA-12514 DG 2010-03-12-06:34:54 0 2 0 FSFP: Failed to connect to remote database orcl_b. Error is ORA-12514 DG 2010-03-12-06:34:55 0 2 0 LGWR: still awaiting FSFO ack after 15 seconds DG 2010-03-12-06:35:02 0 2 0 RSM0: HEALTH CHECK ERROR: ORA-16827: Flashback Database is disabled DG 2010-03-12-06:35:02 0 2 0 RSM0: HEALTH CHECK WARNING: ORA-16817: unsynchronized Fast-start failover configuration DG 2010-03-12-06:35:02 0 2 304017975 Operation CTL_GET_STATUS continuing with warning, status = ORA-16817 DG 2010-03-12-06:35:02 0 2 304017975 Operation CTL_GET_STATUS continuing with warning, status = ORA-16817 DG 2010-03-12-06:35:03 0 2 0 NSV1: Failed to connect to remote database orcl_b. Error is ORA-12514 DG 2010-03-12-06:35:03 0 2 0 NSV1: Failed to send message to site orcl_b. Error code is ORA-12514. DG 2010-03-12-06:35:03 0 2 304017975 DMON: Database orcl_b returned ORA-12514 DG 2010-03-12-06:35:03 0 2 304017975 for opcode = CTL_GET_STATUS, phase = BEGIN, req_id = 1.1.304017975 DG 2010-03-12-06:35:09 0 2 0 FSFP: Failed to connect to remote database orcl_b. Error is ORA-12514 DG 2010-03-12-06:35:09 0 2 0 FSFP: Failed to connect to remote database orcl_b. Error is ORA-12514 DG 2010-03-12-06:35:11 0 2 0 LGWR: still awaiting FSFO ack after 30 seconds DG 2010-03-12-06:35:24 0 2 0 FSFP: Failed to connect to remote database orcl_b. Error is ORA-12514 DG 2010-03-12-06:35:24 0 2 0 FSFP: Failed to connect to remote database orcl_b. Error is ORA-12514 DG 2010-03-12-06:35:26 0 2 0 LGWR: still awaiting FSFO ack after 45 seconds DG 2010-03-12-06:35:38 0 2 0 LGWR: Notifying CRS to disable services and monitoring for Primary Shutdown on Failover DG 2010-03-12-06:35:38 0 2 0 LGWR: Primary has heard from neither observer nor target standby DG 2010-03-12-06:35:38 0 2 0 within FastStartFailoverThreshold seconds. It is DG 2010-03-12-06:35:38 0 2 0 likely an automatic failover has already occurred. DG 2010-03-12-06:35:38 0 2 0 The primary is shutting down. DG 2010-03-12-06:35:38 0 2 0 LGWR: FSFO storing state flags=0x40023, version=138, obid=0x716797a1 (1902614433), threshold=60, laglim=30, obslim=30 DG 2010-03-12-06:35:38 0 2 0 LGWR: Notifying CRS to teardown database
得到这些信息出现之后,再看看Primary数据库,已经关掉了。
SQL> / select FS_FAILOVER_STATUS,FS_FAILOVER_CURRENT_TARGET, * ERROR at line 1: ORA-03135: connection lost contact Process ID: 4262 Session ID: 534 Serial number: 1
这里在注意下日志文件中输出的倒数第二行
LGWR: FSFO storing state flags=0x40023, version=138, obid=0x716797a1 (1902614433), threshold=60, laglim=30, obslim=30
这里的threshold对应FastStartFailoverThreshold,laglim对应FastStartFailoverLagLimit,obslim没有找到对应的参数,但是应该值得是Observer和数据库通讯的时间间隔,说这个的原因是做这个实验并不一定总会成功,因为很有可能Observer在你准备停止之前已经应答了,所以做这个实验还是要手快点,当然如果有条件拔网线测试最理想了。
另外我们先停止Observer然后在短时间内关闭Standby也能得到同样的效果,就是Primary也会因为(基本)同时无法和Standby数据库和Observer通讯而自动关闭,这个各位可以自行测试。
top后记
上面说的FastStartFailoverPmyShutdown参数所控制的行为是我们可以测试到的,但实际中还碰到过在Fast-start启用的情况下,因为意外导致Standby数据库在Broker中显示为DISABLED状态,同时Observer也因为意外停止了,然后就是Primary数据库起来之后就会自动关闭,这里根本就没有什么通讯的时间限制,每次Primary数据库一启动就会自动关掉,最后只有使用DISABLE FAST_START FAILOVER FORCE强行停掉Fast-start Failover之后才停止。后来也因为没有模拟出类似的情形而未能找出根本原因。
作者:马齿苋 | 链接:http://www.dbabeta.com/2010/test-on-faststartfailoverpmyshutdown.html
‘FastStartFailoverPmyShutdown’属性在10.2.0.4以上版本就支持了, 不是到11g版本才支持的.