Arriving to my client on Monday, I see the local DBA already with two people behind and him concerned about something. He was doing a new failover of a critical database that had problems during the weekend due to a change on a firewall. The timeline was somehow like this:
- Friday afternoon – firewall change
- Saturday morning – many syslog errors and almost impossible to use the DB (due to audit to OS level)
- Saturday afternoon – failover of the database to secondary site + set syslogd to restart every 5 minutes
- Sunday morning – reinstate old primary as standby (firewall change was fixed/rolledback)
- Monday morning – failover of database to primary site + reinstate DB on secondary site
All looked back fine by end of the morning, we were happy that the reinstate worked fine, even though it was necessary to do it using sqlplus, as the broker was giving errors.
Monday midday the users complaint they are receiving ORA-01555 for some queries.
Mon Aug 20 11:21:35 2018
ORA-01555 caused by SQL statement below (SQL ID: cg61q9avtk3ta, Query Duration=1 sec, SCN: 0x01a9.3ae6da9e):
delete from child_table where parent_id = :parent and child_id in (:child1)
We check alertlog and see the ORA-01555 on two queries filtering on the same key-value pair. Probably a wrong plan was generated and undo exploded, I think. But the undo tablespace is not even half used. And query duration is 1 second, on the alertlog message.
This is strange. Also more strange is that few time after we also get ORA-00600 on the alertlog:
Mon Aug 20 16:16:54 2018
Errors in file /u00/app/oracle/diag/rdbms/db1_site1/DB1/trace/DB1_ora_13217.trc (incident=1543593):
ORA-00600: internal error code, arguments: [ktbdchk1: bad dscn], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u00/app/oracle/diag/rdbms/db1_site1/DB1/incident/incdir_1543593/DB1_ora_13217_i1543593.trc
This is case for Metalink. And there we find the reason:
ALERT Bug 22241601 ORA-600 [kdsgrp1] / ORA-1555 / ORA-600 [ktbdchk1: bad dscn] / ORA-600 [2663] due to Invalid Commit SCN in INDEX (Doc ID 1608167.1)
Index corruption, nice. It could be worse, the workaround is to recreate the problematic indexes.
Ok, then we should run DB Verify to validate all indexes. There is this new option to run DBV on specific segments, which is great, and DBV was even enhanced to find this specific corruption:
Bug 7517208 – DBV enhanced to identify Logical SCN Block corruptions (Doc ID 7517208.8)
But trying to do on datafiles on ASM is not easy, does not seem to work… until we discover that:
Bug 18353157 – DBVERIFY (DBV) does not show SCN mismatch for datafiles in ASM (Doc ID 18353157.8)
Next step is to try to do with a simple ANALYZE INDEX. After some minutes the alert log starts showing a lot of:
Tue Aug 21 10:36:43 2018
ORA-01555 caused by SQL statement below (SQL ID: 6c3gh913k55pr, SCN: 0x01a9.3b4ea9e1):
ANALYZE INDEX USER1.IX_ABC VALIDATE STRUCTURE ONLINE
ORA-01555 caused by SQL statement below (SQL ID: 6c3gh913k55pr, SCN: 0x01a9.3b4ea9ee):
ANALYZE INDEX USER1.IX_ABC VALIDATE STRUCTURE ONLINE
…
Unfortunately the errors do not appear on sqlplus, so we need to have a tail -f on the alertlog, keep the eyes open and then cancel the ANALYZE INDEX.
As there seemed to be corruption to several indexes of USER1, we decided that it was just easier to directly rebuild all of them. The index rebuild need to be ONLINE. Not because we can’t support having locks on the tables of the application, but because only the ONLINE rebuild will do a full table scan to recreate the index. A normal rebuild uses the index itself, which is not good when it has some corruption.
One mistake we did at some point was to ANALYZE the primary key of one table and start an INDEX REBUILD ONLINE of an index on the same table. The ANALYZE and INDEX REBUILD, were locking each other (or advancing very slowly). So it is better not to perform both at the same time.
When we tried to rebuild online the primary key, we got the error:
ORA-08108: may not build or rebuild this type of index online
First we tought it would be due the fact it is a Primary Key, but soon we tumbled upon Jonnathan Lewis blog entry about this: https://jonathanlewis.wordpress.com/2012/09/04/online-rebuild-2/
Our Primary key is DEFERRABLE! Oups! The only option is now to change or deactivate the constraint, rebuild the index and go on. But this is not possible with the application running. So it implies an emergency change and a lot of actions. Here the technical summary on this Oracle Restart environment:
-- Stop service:
$ srvctl stop service -d db1_site1 -s db1_service_rw
$ srvctl status service -d db1_site1
-- Kill sessions using service and/or from users:
set pages 0
spool /tmp/to_kill.sql
select 'alter system disconnect session '''||sid||','||serial#||''' immediate;' from v$session where service_name='DB1_SERVICE_RW';
spool off
$ vi /tmp/to_kill.sql
SQL> @/tmp/to_kill.sql
-- Confirm no other sessions:
SQL> select username, program,machine from v$session where service_name='DB1_SERVICE_RW';
-- Recreate index:
SET TIME ON
SET TIMING ON
ALTER SYSTEM SET job_queue_processes=0;
ALTER SESSION ENABLE PARALLEL DDL;
ALTER TABLE user1.child_table DISABLE CONSTRAINT pk_child_table KEEP INDEX;
ALTER INDEX user1.pk_child_table REBUILD ONLINE PARALLEL 12;
ALTER TABLE user1.child_table ENABLE CONSTRAINT pk_child_table;
ALTER INDEX user1.pk_child_table NOPARALLEL;
ALTER SYSTEM SET job_queue_processes=10;
-- Restart service:
$ srvctl start service -d db1_site1 -s db1_service_rw
$ srvctl status service -d db1_site1
After all this finally we hoped that our database was back to normal. Application reconnects, we cross our fingers and since then no more errors. We are happy that this was just a light corruption.