Oracle


Impact of Azure-Oracle OCI interconnect on a multi-cloud solution 3

Introduction

Oracle and Microsoft announced in June 2019 a cloud interoperability partnership which enables workloads across Microsoft Azure and Oracle Cloud. By creating a first joint multi-cloud solution, the software giants can each continue to provide the best of their services. At the same time customers do not need to decide which vendor they opt-out when moving their on-premises constructs.

Being myself an Oracle Database Administrator, this article aims to check what is the impact of distributing resources in multiple clouds, with databases remaining on Oracle Cloud Infrastructure.

It is not aim of the article to discuss the costs of resources on any of the clouds.

(more…)

OCI / Azure and Oracle DB: firewall and security rules

Many of the problems with VMs on Clouds are related to firewall or security rules. Here some elements that might help.

These examples are based on Oracle Linux 7.7 virtual machines.

Firewall: check if it is running

When is loaded and running (active) is shows like:

[bash][opc@alfama ~]$ sudo systemctl status firewalld 
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-05-15 05:19:12 GMT; 15min ago
Docs: man:firewalld(1)
Main PID: 1640 (firewalld)
Tasks: 2
Memory: 33.2M
CGroup: /system.slice/firewalld.service
└─1640 /usr/bin/python2 -Es /usr/sbin/firewalld --nofork --nopid

When it is loaded but stopped:

[opc@alfama ~]$ sudo systemctl status firewalld -l
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Fri 2020-05-15 05:36:09 GMT; 7s ago
Docs: man:firewalld(1)
Process: 1640 ExecStart=/usr/sbin/firewalld --nofork --nopid $FIREWALLD_ARGS (code=exited, status=0/SUCCESS)
Main PID: 1640 (code=exited, status=0/SUCCESS)

Firewall: check ports open

[opc@boavista ~]$ sudo firewall-cmd --list-all
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eth0
  sources:
  services: dhcpv6-client ssh
  ports: 1521/tcp
  protocols:
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:

Firewall: open Oracle listener port

If we want to open the default listener port, two commands are needed. One for adding the rule and another to enable the new rule:

[opc@mouraria bin]$ sudo firewall-cmd --zone=public --permanent --add-port=1521/tcp
[opc@mouraria bin]$ sudo firewall-cmd --reload

Firewall: allow ping reponses

Ping uses a separate protocol. To allow the VMs to answer pings, this rule needs to be added to the firewall:

[opc@mouraria bin]$ sudo firewall-cmd --permanent --direct --add-rule ipv4 filter INPUT 0 -p icmp -s 0.0.0.0/0 -d 0.0.0.0/0 -j ACCEPT
[opc@mouraria bin]$ sudo systemctl restart firewalld.service

On OCI, to be able to ping between hosts, it might be necessary to add an Ingress rule that allows pings. The rule should look like below. The most important is the Protocol ICMP and type 8. The source CIDR should not be larger than the VCN CIDR.

Test if port is open:

To check if the remote port is open, I use the follow commands. This works both with IPs or hostnames and also lets you know if port is open, but listener is not running (on this port):

[opc@alfama ~]$ ### PORT UNREACHEABLE, SERVER DOWN/WRONG? ###
[opc@alfama ~]$ export CHECK_IP="10.1.2.4/1521"
[opc@alfama ~]$ timeout 1 bash -c '</dev/tcp/${CHECK_IP} && echo Port ${CHECK_IP} is open || echo Port ${CHECK_IP} is closed' || echo Connection timeout
Connection timeout

[opc@alfama ~]$ ### PORT OPEN, USING HOSTNAME ###
[opc@alfama ~]$ export CHECK_IP="luz.subnetpriv2/1521"
[opc@alfama ~]$ timeout 1 bash -c '</dev/tcp/${CHECK_IP} && echo Port ${CHECK_IP} is open || echo Port ${CHECK_IP} is closed' || echo Connection timeout
Port luz.subnetpriv2/1521 is open

[opc@alfama ~]$ ### PORT OPEN, USING IP ###
[opc@alfama ~]$ timeout 1 bash -c '</dev/tcp/${CHECK_IP} && echo Port ${CHECK_IP} is open || echo Port ${CHECK_IP} is closed' || echo Connection timeout
Port 10.1.5.3/1521 is open

[opc@alfama ~]$ ### PORT CLOSED ###
[opc@alfama ~]$ export CHECK_IP="luz.subnetpriv2/1522"
[opc@alfama ~]$ timeout 1 bash -c '</dev/tcp/${CHECK_IP} && echo Port ${CHECK_IP} is open || echo Port ${CHECK_IP} is closed' || echo Connection timeout
bash: connect: No route to host
bash: /dev/tcp/luz.subnetpriv2/1522: No route to host
Port luz.subnetpriv2/1522 is closed

[opc@alfama ~]$ ### PORT OPEN but LISTENER DOWN ###
[opc@alfama ~]$ timeout 1 bash -c '</dev/tcp/${CHECK_IP} && echo Port ${CHECK_IP} is open || echo Port ${CHECK_IP} is closed' || echo Connection timeout
bash: connect: Connection refused
bash: /dev/tcp/10.1.5.3/1521: Connection refused
Port 10.1.5.3/1521 is closed

Bonus

At Oracle OCI, one can access other VMs of the VCN using the <hostname>.<subnet>

On Azure, the <hostname> is enough to access other host of the VNet.


CRS-6706: Oracle Clusterware Release patch level (‘2565072065’) does not match Software patch level (‘277169099’)

Today on a dirt and quick patching session I end up with the following error when restarting the Oracle Restart Grid Infratructure.

The complete error was:

oracle@anjovm1: [+ASM] crsctl start has
CRS-6706: Oracle Clusterware Release patch level ('2565072065') does not match Software patch level ('277169099'). Oracle Clusterware cannot be started.
CRS-4000: Command Start failed, or completed with errors.

This was because I’ve installed the OJVM RU to the Grid home. The solution was simply:

oracle@anjovm1: ~oracle/ [+ASM] /u00/app/grid/18.0.0/OPatch/opatch rollback -id 28790647

And then I could restart:

oracle@anjovm1:~oracle/ [+ASM] crsctl start has
CRS-4123: Oracle High Availability Services has been started.

Index columns query

Query to show all indexes, its characteristics and respective columns from a table.
Note: It does not filter on owner.

select c.index_name,
       c.table_name||'('||listagg(column_name,',') within group (order by column_position)||')' as columns,
       decode(status,'VALID',null,status||' ')
       ||decode(visibility,'VISIBLE',null,visibility||' ')
       ||decode(segment_created,'YES',null,'EMPTY ')
       ||decode(uniqueness,'NONUNIQUE',null,uniqueness||' ')
       ||funcidx_status as STATUS
 from dba_ind_columns c, dba_indexes i
where i.table_name=c.table_name and i.index_name=c.index_name
  and c.table_name='&TABLE_NAME'
group by c.table_name,c.index_name,i.status,visibility, segment_created,uniqueness,funcidx_status
order by c.table_name,c.index_name;

Example of output:

INDEX_NAME                     COLUMNS                                 STATUS  
------------------------------ --------------------------------------- ----------
IDX_LIB_VERSION_PROFILE_ID     LIB_VERSION(PROFILE_ID)                       
INX_LIB_VERSION_SRC            LIB_VERSION(DATA_RESOURCE,SOURCE)             
LIB_VERSION_NAME_IDX           LIB_VERSION(VERSION_NAME)                     
LIB_VERSION_PK                 LIB_VERSION(ID)                         UNIQUE
LIB_VERSION_UK1                LIB_VERSION(PROFILE_ID,VERSION_NAME)    UNIQUE

Find NOSEGMENT indexes that you forgot to delete

Err is human and I, as human, I do forget sometimes to clean-up some tests.

Recently this happen with NOSEGMENT indexes. Yesterday I received an email from a colleague, reporting some errors he received with Datapump:

ORA-08114: can not alter a fake index
Failing sql is:
ALTER INDEX "SUPERAPP"."IDX_POSITION_TEST" NOPARALLEL
ORA-39083: Object type INDEX failed to create with error:
ORA-08114: can not alter a fake index

The indexes created with NOSEGMENT do not appear in DBA_INDEXES (don’t get fool by the SEGMENT_CREATED column).
They do appear on DBA_OBJECTS, but there they look like any other index.

The solution is to match the two tables. So to find the NOSEGMENT indexes that you forgot to delete you can use the following query:

SELECT 'DROP INDEX '||owner||'.'||object_name||';' cmd, created FROM dba_objects 
WHERE object_type='INDEX' 
AND object_name NOT IN (SELECT index_name FROM  dba_indexes);

If you wonder what are NOSEGMENT indexes, a good source is Oracle-Base.


Query DGMGRL silently from observer site

When the observer is “checking” several dataguard configurations, we can use this server to easily make queries on all of them, for instance using:

for db in `ps -ef | grep observer | awk -F '/' '{print $6}' | sort`; do
conn=`sed -n 1p /u00/app/oracle/admin/${db}/observer/fsfo_${db}.conf`
conndb=`echo $conn | cut -d "@" -f2`
dgmgrl -silent ${conn} "show configuration" | grep -E 'Primary'
done;

It does for every observed configuration:

  1. gets the DB name
  2. find the connection string in the config file
  3. connects to DGMGRL in silent more and runs a command

Index Logical Corruption!

Arriving to my client on Monday, I see the local DBA already with two people behind and him concerned about something. He was doing a new failover of a critical database that had problems during the weekend due to a change on a firewall. The timeline was somehow like this:

  • Friday afternoon – firewall change
  • Saturday morning – many syslog errors and almost impossible to use the DB (due to audit to OS level)
  • Saturday afternoon – failover of the database to secondary site + set syslogd to restart every 5 minutes
  • Sunday morning – reinstate old primary as standby (firewall change was fixed/rolledback)
  • Monday morning – failover of database to primary site + reinstate DB on secondary site

All looked back fine by end of the morning, we were happy that the reinstate worked fine, even though it was necessary to do it using sqlplus, as the broker was giving errors.

Monday midday the users complaint they are receiving ORA-01555 for some queries.

Mon Aug 20 11:21:35 2018
ORA-01555 caused by SQL statement below (SQL ID: cg61q9avtk3ta, Query Duration=1 sec, SCN: 0x01a9.3ae6da9e):
delete from child_table where parent_id = :parent and child_id in (:child1)

We check alertlog and see the ORA-01555 on two queries filtering on the same key-value pair. Probably a wrong plan was generated and undo exploded, I think. But the undo tablespace is not even half used. And query duration is 1 second, on the alertlog message.

This is strange. Also more strange is that few time after we also get ORA-00600 on the alertlog:

Mon Aug 20 16:16:54 2018
Errors in file /u00/app/oracle/diag/rdbms/db1_site1/DB1/trace/DB1_ora_13217.trc  (incident=1543593):
ORA-00600: internal error code, arguments: [ktbdchk1: bad dscn], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u00/app/oracle/diag/rdbms/db1_site1/DB1/incident/incdir_1543593/DB1_ora_13217_i1543593.trc

This is case for Metalink. And there we find the reason:

ALERT Bug 22241601 ORA-600 [kdsgrp1] / ORA-1555 / ORA-600 [ktbdchk1: bad dscn] / ORA-600 [2663] due to Invalid Commit SCN in INDEX (Doc ID 1608167.1)

Index corruption, nice. It could be worse, the workaround is to recreate the problematic indexes.

Ok, then we should run DB Verify to validate all indexes. There is this new option to run DBV on specific segments, which is great, and DBV was even enhanced to find this specific corruption:

Bug 7517208 – DBV enhanced to identify Logical SCN Block corruptions (Doc ID 7517208.8)

But trying to do on datafiles on ASM is not easy, does not seem to work… until we discover that:

Bug 18353157 – DBVERIFY (DBV) does not show SCN mismatch for datafiles in ASM (Doc ID 18353157.8)

Next step is to try to do with a simple ANALYZE INDEX. After some minutes the alert log starts showing a lot of:

Tue Aug 21 10:36:43 2018
ORA-01555 caused by SQL statement below (SQL ID: 6c3gh913k55pr, SCN: 0x01a9.3b4ea9e1):
ANALYZE INDEX USER1.IX_ABC VALIDATE STRUCTURE ONLINE
ORA-01555 caused by SQL statement below (SQL ID: 6c3gh913k55pr, SCN: 0x01a9.3b4ea9ee):
ANALYZE INDEX USER1.IX_ABC VALIDATE STRUCTURE ONLINE
…

Unfortunately the errors do not appear on sqlplus, so we need to have a tail -f on the alertlog, keep the eyes open and then cancel the ANALYZE INDEX.

As there seemed to be corruption to several indexes of USER1, we decided that it was just easier to directly rebuild all of them. The index rebuild need to be ONLINE. Not because we can’t support having locks on the tables of the application, but because only the ONLINE rebuild will do a full table scan to recreate the index. A normal rebuild uses the index itself, which is not good when it has some corruption.

One mistake we did at some point was to ANALYZE the primary key of one table and start an INDEX REBUILD ONLINE of an index on the same table. The ANALYZE and INDEX REBUILD, were locking each other (or advancing very slowly). So it is better not to perform both at the same time.

When we tried to rebuild online the primary key, we got the error:

ORA-08108: may not build or rebuild this type of index online

First we tought it would be due the fact it is a Primary Key, but soon we tumbled upon Jonnathan Lewis blog entry about this: https://jonathanlewis.wordpress.com/2012/09/04/online-rebuild-2/

Our Primary key is DEFERRABLE! Oups! The only option is now to change or deactivate the constraint, rebuild the index and go on. But this is not possible with the application running. So it implies an emergency change and a lot of actions. Here the technical summary on this Oracle Restart environment:

-- Stop service:
$ srvctl stop service -d db1_site1 -s db1_service_rw
$ srvctl status service -d db1_site1

-- Kill sessions using service and/or from users:
set pages 0
spool /tmp/to_kill.sql
select 'alter system disconnect session '''||sid||','||serial#||''' immediate;' from v$session where service_name='DB1_SERVICE_RW';
spool off

$ vi /tmp/to_kill.sql

SQL> @/tmp/to_kill.sql

-- Confirm no other sessions:
SQL> select username, program,machine from v$session where service_name='DB1_SERVICE_RW';

-- Recreate index:
SET TIME ON
SET TIMING ON
ALTER SYSTEM SET job_queue_processes=0;
ALTER SESSION ENABLE PARALLEL DDL;
ALTER TABLE user1.child_table DISABLE CONSTRAINT pk_child_table KEEP INDEX;
ALTER INDEX user1.pk_child_table REBUILD ONLINE PARALLEL 12;
ALTER TABLE user1.child_table ENABLE CONSTRAINT pk_child_table;
ALTER INDEX user1.pk_child_table NOPARALLEL;
ALTER SYSTEM SET job_queue_processes=10;

-- Restart service:
$ srvctl start service -d db1_site1 -s db1_service_rw
$ srvctl status service -d db1_site1

After all this finally we hoped that our database was back to normal. Application reconnects, we cross our fingers and since then no more errors. We are happy that this was just a light corruption.


Oracle 12.1 – Find OCR Master Node

I’ve been dealing with a problem where, sometimes, a rebooted RAC node is unable to join back the cluster. The issue seems to be with the “Master Node”, which refuses to accept the node.

So I’ve to know which is the “Master Node” (the current known solution is to reboot it, and then all nodes join the cluster).

There is the Oracle note: How to Find OCR Master Node (Doc ID 1281982.1)

And there is this blog entry: 11G R2 RAC: How to identify the master node in RAC

In my case I’m using Oracle 12.1.0.2, the location of the files is a bit different. The location of the OCR Master Node can be found on this version using one of the following ways:

  • Check the crsd logs for “OCR MASTER”
grep "OCR MASTER" ${ORACLE_BASE}/diag/crs/`hostname`/crs/trace/crsd*

and, if the logs did not rotate too much yet, you should see one of the two below:

/u00/app/oracle/diag/crs/anjovm1/crs/trace/crsd_73.trc:2018-01-13 14:05:30.535186 :  OCRMAS:3085: th_master:13: I AM THE NEW OCR MASTER at incar 2. Node Number 1

/u00/app/oracle/diag/crs/anjovm2/crs/trace/crsd_71.trc:2018-01-13 14:05:32.823231 :  OCRMAS:3085: th_master: NEW OCR MASTER IS 1
  • Check the location of the OCR automatic backups

the cluster node currently keeping the backups, is the OCR master node. If you see older backups on other nodes, it was when they were OCR master nodes on its turn.

ls -l /u00/app/12.1.0.2/grid/cdata/<cluster_name>
-rw-r--r--    1 root     system      943266 Jan  14 00:01 backup00.ocr
-rw-r--r--    1 root     system      943266 Jan 13 20:01 backup01.ocr
-rw-r--r--    1 root     system      943266 Jan 13 16:01 backup02.ocr
-rw-r--r--    1 root     system      943266 Jan 13 00:00 day.ocr
-rw-r--r--    1 root     system      943266 Jan  14 00:01 day_.ocr
-rw-r--r--    1 root     system      943266  Dec 31 23:55 week.ocr
-rw-r--r--    1 root     system      943266 Jan 07 23:59 week_.ocr

Note: do not confuse the OCR master node with the Cluster Health Monitor repository master node, which you get using the command:

/u00/app/12.1.0.2/grid/bin/oclumon manage -get MASTER

 


Check Advanced Compression option usage on Oracle database

To check if the “advanced compression” Oracle database option was used on your environment is not so easy, as there are several activities that can activate this option.

For example for a mix Oracle 11.2.0.4 and 12.1.0.2 environment we can use the following query:

select * from DBA_FEATURE_USAGE_STATISTICS
where ((name like '%Compress%' and name not in ('Backup BASIC Compression','Backup BZIP2 Compression','SecureFile Compression (system)')) 
  and detected_usages>0)
or (name='Data Guard' and FEATURE_INFO like '%Compression used: TRUE%')
or (name in ('Oracle Utility Metadata API','Oracle Utility Datapump (Export)','Oracle Utility Datapump (Import)') 
  and version like '11.2.%' and feature_info not like '%compression used: 0 times%')
or (name in ('Oracle Utility Metadata API','Oracle Utility Datapump (Import)') 
  and version like '12.%' and feature_info like '%compression%')
or (name in ('Oracle Utility Datapump (Export)') 
  and version like '12.%' and feature_info not like '%compression used: 0 times%');

Why do we need so many filters? Because the text that shows if compression was used changes depending on the utility and Oracle version!
Below 3 different examples when compression was not used:

Version Utilities FEATURE_INFO text
12.1 Oracle Utility Datapump (Export) invoked: 1 times, compression used: 0 times (BASIC algorithm used: 0 times, LOW algorithm used: 0 times, MEDIUM algorithm used: 0 times, HIGH algorithm used: 0 times), encryption used: 0 times (AES128 algorithm used: 0 times, AES192 algorithm used: 0 times, AES256 algorithm used: 0 times, PASSWORD mode used: 0 times, DUAL mode used: 0 times, TRANSPARENT mode used: 0 times), parallel used: 0 times, full transportable used: 0 times
12.1 Oracle Utility Datapump (Import)

Oracle Utility Metadata API

invoked: 1 times, parallel used: 0 times, full transportable used: 0 times
11.2 Oracle Utility Metadata API invoked: 1 times, compression used: 0 times, encryption used: 0 times

Also, seems that for “Data Guard” utility the text has capital letters “%Compression used: TRUE%”.

More information about which features enable the usage of Advanced Compression option can be find here.

And how to avoid the usage of the Advanced Compression option by error on Mathias Zarick blog.


Back to the basics: correct parameter from SPFILE in ASM

When you change the size of memory and forget to update SGA_MAX_SIZE it might happen:

oracle@anjo01:/home/oracle [] srvctl start database -d p05ltm02
PRCR-1079 : Failed to start resource ora.p05ltm02.db
CRS-5017: The resource action "ora.p05ltm02.db start" encountered the following error:
ORA-01078: failure in processing system parameters
ORA-00823: Specified value of sga_target greater than sga_max_size

As this is a clustered database with ASM, here the easiest way to solve:

1. Find the full path of the SPFILE on ASM

oracle@anjo01:/home/oracle [p05ltm02_1] srvctl config database -d p05ltm02 -a
Database unique name: p05ltm02
Database name: ltm02
Oracle home: /oracle/app/product/11.2.0.4
Oracle user: oracle
Spfile: +DGDATA/p05ltm02/spfileltm02.ora
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: sp_ha01
Database instances:
Disk Groups: DGDATA,DGFRA
Mount point paths:
Services: p05ltm02_app.anjo01
Type: RACOneNode
Online relocation timeout: 30
Instance name prefix: p05ltm02
Candidate servers: anjo01,anjo02
Database is enabled
Database is policy managed

2. Create a PFILE from SPFILE on the local filesystem

oracle@anjo01:/home/oracle [p05ltm02_1] sqh
SQL*Plus: Release 11.2.0.4.0 Production on Tue Feb 2 09:33:47 2016
Copyright (c) 1982, 2013, Oracle.  All rights reserved.
Connected to an idle instance.

SQL> create pfile='/tmp/pfile.txt' from spfile='+DGDATA/p05ltm02/spfileltm02.ora';
File created.

SQL> exit
Disconnected

3. Correct the damn parameter

oracle@anjo01:/home/oracle [p05ltm02_1] vi /tmp/pfile.txt

4. Recreate the SPFILE on ASM:

oracle@anjo01:/home/oracle [p05ltm02_1] sqh
SQL*Plus: Release 11.2.0.4.0 Production on Tue Feb 2 09:34:44 2016
Copyright (c) 1982, 2013, Oracle.  All rights reserved.
Connected to an idle instance.

SQL> create spfile='+DGDATA/p05ltm02/spfileltm02.ora' from pfile='/tmp/pfile.txt';
File created.

SQL> exit
Disconnected

5. And finally restart the database:

oracle@anjo01:/home/oracle [p05ltm02_1] srvctl start database -d p05ltm02