Script to check Data Guard status from SQL

In a previous blog post I have explained how to get the basic configuration from x$drc and display something like:

There are other possibilities, by using the DBMS_DRS PL/SQL package.

The package is quite rich. In order to get more details, I use CHECK_CONNECT to check the connectivity to the member databases:

example:

In the first case I get no exceptions, that means that the database is reachable using the DGConnectIdentifier specified in the configuration (‘TOOLCDB1_SITE2’ is my database name in the configuration, it is NOT a TNS entry. I use EZConnect in my lab).

In the second case I specify a database that is not in the configuration.

In the third case, it looks like the database is down (no service), or the DGConnectIdentifier is not correct.

 

GET_PROPERTY_OBJ is useful to get e single property of a database/instance:

Example:

Here I have, for the primary (the object_id from x$drc), a TransportLagThreshold of 30 seconds.

DO_CONTROL does a specific check and returns a document with the results:

The problem is… what’s the format for indoc?

To get the correct format, I have enabled sql trace to get the executions, with bind variables, of the dgmgrl commands. It happens that the input format is XML and the output format is HTML.

This is how you can get the LogXptStatus, for example:

 

The big script

So I said… why not trying to have a comprehensive SQL script to check a few vital statuses of Data Guard?

This is the script that came out:

Of course, it is not perfect (many checks missing: FSFO readiness, observer checks, etc.), but it is good enough for base monitoring. Also, it’s faster than a normal shell+dgmgrl script.

Output on a Primary database:

Output on a standby database:

In case of errors (e.g. standby listener stopped), I would get:

So easy to spot the error and use a shell wrapper to grep ^ERROR or similar.

Be careful, the script is not RAC aware, and it lacks some checks, so you might want to reuse it and extend it to fit your exact configuration.

Hope you like it!

Ludovico

Real-Time Cascade Standby Container Databases without Oracle Managed Files

OK, the title might not be the best… I just would like to add more detail to content you can already find in other blogs (E.g. this nice one from Philippe Fierens http://pfierens.blogspot.com/2020/04/19c-data-guard-series-part-iii-adding.html).

I have this Cascade Standby configuration:

Years ago I wrote this whitepaper about cascaded standbys:
https://fr.slideshare.net/ludovicocaldara/2014-603-caldarappr
While it is still relevant for non-CDBs, things have changed with Multitenant architecture.

In my config, the Oracle Database version is 19.7 and the databases are actually CDBs. No Grid Infrastructure, non-OMF datafiles.
It is important to highlight that a lot of things have changed since 12.1. And because 19c is the LTS version now, it does not make sense to try anything older.

First, I just want to make sure that my standbys are aligned.

Primary:

1st Standby alert log:

2nd Standby alert log:

Then, I create a pluggable database (from PDB$SEED):

On the first standby I get:

On the second:

So, yeah, not having OMF might get you some warnings like: WARNING: File being created with same name as in Primary
But it is good to know that the cascade standby deals well with new PDBs.

Of course, this is not of big interest as I know that the problem with Multitenant comes from CLONING PDBs from either local or remote PDBs in read-write mode.

So let’s try a relocate from another CDB:

This is what I get on the first standby:

and this is on the cascaded standby:

So absolutely the same behavior between the two levels of standby.
According to the documentation: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/CREATE-PLUGGABLE-DATABASE.html#GUID-F2DBA8DD-EEA8-4BB7-A07F-78DC04DB1FFC
I quote what is specified for the parameter STANDBYS={ALL|NONE|…}:
“If you include a PDB in a standby CDB, then during standby recovery the standby CDB will search for the data files for the PDB. If the data files are not found, then standby recovery will stop and you must copy the data files to the correct location before you can restart recovery.”

“Specify ALL to include the new PDB in all standby CDBs. This is the default.”

Specify NONE to exclude the new PDB from all standby CDBs. When a PDB is excluded from all standby CDBs, the PDB’s data files are unnamed and marked offline on all of the standby CDBs. Standby recovery will not stop if the data files for the PDB are not found on the standby. […]”

So, in order to avoid the MRP to crash, I should have included STANDBYS=NONE
But the documentation is not up to date, because in my case the PDB is skipped automatically and the recovery process DOES NOT STOP:

However, the recovery is marked ENABLED for the PDB on the standby, while usind STANDBYS=NONE it would have been DISABLED.

So, another difference with the doc who states:
“You can enable a PDB on a standby CDB after it was excluded on that standby CDB by copying the data files to the correct location, bringing the PDB online, and marking it as enabled for recovery.”

This reflects the findings of Philippe Fierens in his blog (http://pfierens.blogspot.com/2020/04/19c-data-guard-series-part-iii-adding.html).

This behavior has been introduced probably between 12.2 and 19c, but I could not manage to find exactly when, as it is not explicitly stated in the documentation.
However, I remember well that in 12.1.0.2, the MRP process was crashing.

In my configuration, not on purpose, but interesting for this article, the first standby has the very same directory structure, while the cascaded standby has not.

In any case, there is a potentially big problem for all the customers implementing Multitenant on Data Guard:

With the old behaviour (MRP crashing), it was easy to spot when a PDB was cloned online into a primary database, because a simple dgmgrl “show configuration” whould have displayed a warning because of the increasing lag (following the MRP crash).

With the current behavior, the MRP keeps recovering and the “show configuration” displays “SUCCESS” despite there is a PDB not copied on the standby (thus not protected).

Indeed, this is what I get after the clone:

I can see that the Data Guard Broker is completely silent about the missing PDB. So I might think my PDB is protected while it is not!

I actually have to add a check on the standby DBs to check if I have any missing datafiles:

This check should be implemented and put under monitoring (custom metrics in OEM?)

The missing PDB is easy to spot once I know that I have to do it. However, for each PDB to recover (I might have many!), I have to prepare the rename of datafiles and creation of directory (do not forget I am using non-OMF here).

Now, the datafile names on the standby got changed to …/UNNAMEDnnnnn.

So I have to get the original ones from the primary database and do the same replace that db_file_name_convert would do:

and put this in a rman script (this will be for the second standby, the first has the same name so same PATH):

Then, I need to stop the recovery, start it and stopping again, put the datafiles online and finally restart the recover.
These are the same steps used my Philippe in his blog post, just adapted to my taste 🙂

For the second part, I use this HEREDOC to online all offline datafiles:

and finally:

Now, I do not have anymore any datafiles offline on the standby:

I will not publish the steps for the second standby, they are exactly the same (same output as well).

At the end, for me it is important to highlight that monitoring the OFFLINE datafiles on the standby becomes a crucial point to guarantee the health of Data Guard in Multitenant. Relying on the Broker status or “PDB recovery disabled” is not enough.

On the bright side, it is nice to see that Cascade Standby configurations do not introduce any variation, so cascaded standbys can be threated the same as “direct” standby databases.

HTH

Ludovico

Duplicating a DB and setting up Data Guard through CMAN and SSH tunnel

I am fascinated about the new Zero Downtime Migration tool that has been available since November 28th. Whilst I am still in the process of testing it, there is one big requirement that might cause some headache to some customers. It is about network connectivity:

Configuring Connectivity Between the Source and Target Database Servers

The source database server […] can connect to target database instance over target SCAN through the respecitve scan port and vice versa.
The SCAN of the target should be resolvable from the source database server, and the SCAN of the source should resolve from the target server.
Having connectivity from both sides, you can synchronize between the source database and target database from either side. […]

If you are taking cloud migrations seriously, you should have either a VPN site-to-site to the cloud, or a Fast Connect link. At CERN we are quite lucky to have a high bandwidth Fast Connect to OCI Frankfurt.

This requirement might be missing for many customers, so what is a possible solution to setup connectivity for database duplicates and Data Guard setups?

In the picture above you can see a classic situation, that usually has two problems that must be solved:

  • the SCAN addresses are private: not accessible from internet
  • there are multiple SCAN addresses, so tunneling through all of them might be complex

Is it possible to configure CMAN in front of the SCAN listener as a single IP entry and tunnel through SSH to this single IP?

I will show now how to achieve this configuration.

For sake of simplicity, I have put two single instances without SCAN and a CMAN installation on the database servers, but it will work with little modification using SCAN and RAC setups as well. Note that in a Cloud Infrastructure setup, this will require a correct setup of the TDE wallet on both the source and the destination.

Because I put everything on s single host, I have to setup CMAN to listen to another port, but having a separate host for CMAN is a better practice when it has to proxy to SCAN listeners.

Installing and configuring CMAN

The most important part of the whole setup is that the CMAN on the standby site must have a public IP address and open SSH port so that we can tunnel through it.

The on-premises CMAN must have open access to the standby CMAN port 22.

For both primary and standby site you can follow the instructions of my blog post: Install and configure CMAN 19c in the Oracle Cloud, step by step.

In my example, because I install CMAN on the same host of the DB, I configure CMAN to run on port 1522.

CMAN primary config:

CMAN standby config:

This configuration is not secure at all, you might want to secure it further in order to allow only the services needed for setting up Data Guard.

The registration of database services to CMAN through the the remote_listener parameter is optional, as I will register the entries statically in the listener and use a routed connection through CMAN.

Listener configuration

The listener must have a static entry for the database, so that duplicate and switchover work properly.

On primary add to listener.ora:

On standby:

In a RAC config, all the local listeners must be configured with the correct SID_NAME running on the host. Make sure to reload the listeners 😉

Creating the SSH tunnels

There must be two tunnels open: one that tunnels from on-premises to the cloud and the other that tunnels from the cloud to on-premises.

However, such tunnels can both be created from the on-premises CMAN host that has access to the cloud CMAN host:

in my case, the hostnames are:

Important: with CMAN on a host other than the DB server, the CMAN sshd must be configured to have GatewayPorts set to yes:

After the tunnels are open, any connections to the local CMAN server port 1523 will be forwarded to the remote CMAN port 1522.

Configuring the TNSNAMES to hop through CMAN and SSH tunnel

Both servers must have now one entry for the local database pointing to the actual SCAN (or listener for single instances) and one entry for the remote database pointing to local port 1523 and routing to the remote scan.

On-premises tnsnames.ora:

Cloud tnsnames.ora:

After copying the passwordfile and starting nomount the cloud database, it should be possible from both sides to connect as SYSDBA to both DB_CLOUD and DB_ONPREM.

This configuration is ready for both duplicate from active database and for Data Guard.
I still have to figure out if it works with ZDM, but I think it is a big step towards establishing connection between on-premises and the Oracle Cloud when no VPN or Fast Connect are available.

Duplicate from active database

If the setup is correct, this should work:

Setting up Data Guard

  • Configure broker config files
  • Add and clear the standby logs
  • Start the broker
  • Create the configuration:

    The static connect identifier here is better if it uses the TNSNAMES resolution because each database sees each other differently.

Checking the DG config

A validate first:

Than a switchover, back and forth:

Conclusion

Yes, it is possible to setup a Data Guard between two sites that have no connections except mono-directional SSH. The SSH tunnels allow SQL*Net communication to a remote endpoint. CMAN allows to proxy through a single endpoint to multiple SCAN addresses.

However, do not forget about the ultimate goal that is to migrate your BUSINESS to the cloud, not just the database. Therefore, having a proper communication to the cloud with proper performance, architecture and security is crucial. Depending on your target Cloud database, Zero Downtime Migration or MV2ADB should be the correct and supported solutions.

Install and configure CMAN 19c in the Oracle Cloud, step by step

Installing and configuring CMAN is a trivial activity, but having the steps in one place is better than reinventing the wheel.

Prepare for the install

Download the Oracle Client 19.3.0.0 in the Oracle Database 19c download page.

Choose this one: LINUX.X64_193000_client.zip (64-bit) (1,134,912,540 bytes) , not the one named “LINUX.X64_193000_client_home.zip” because it is a preinstalled home that does not contain the CMAN tools.

Access the OCI Console and create a new Compute instance. The default  configuration is OK, just make sure that it is Oracle Linux 7 🙂

Do not forget to add your SSH Public Key to access the VM via SSH!

Access the VM using

Copy the Oracle Client zip in /tmp using your favorite scp program.

Install CMAN

Follow these steps to install CMAN:

 

Basic configuration

This will create a CMAN configuration named cman-test. Beware that it is very basic and insecure. Please read the CMAN documentation if you want something more secure or sophisticated.

The advantage of having the TNS_ADMIN outside the Oracle Home is that if you need to patch CMAN, you can do it out-of-place without the need to copy the configuration files somewhere else.

The advantage of using IFILE inside cman.ora, is that you can manage easily different CMAN configurations in the same host without editing directly cman.ora, with the risk of messing it up.

Preparing the start/stop script

Create a file /u01/app/oracle/scripts/cman_service.sh with this content:

This is at the same time ORACLE_HOME agnostic and configuration agnostic.

Make it executable:

and try to start CMAN:

Stop should work as well:

Add the service in systemctl

Open firewall ports

By default, new OL7 images use firewalld. Just open the port 1521 from the public zone:

 

Bonus: have a smart environment!

Ludo

How to install and access Oracle Weblogic 12.2 in the Oracle Cloud Infrastructure

I put here the steps required to install and access Weblogic in the OCI (mostly for me in case I need to do it again 😉 ). The assumptions are:

  • you already have an account for the Oracle Cloud Infrastructure and you can access the OCI console
  • you already have a Compartment with a VCN and a subnet configured (for test purposes, a VCN created with the default values will be just fine)
  • you already have a keypair for your SSH client (id_rsa, id_rsa.pub)
  • you have an X server on your laptop (if you have Windows, I recommend MobaXTerm, but Xming or other servers are just fine)

Create the compute instance

  • Menu -> Core Infrastructure -> Compute -> Instances -> Create Instance
  • Choose a name for the Instance, all the other fields defaults are fine for test (Oracle Linux 7.6, VM.Standard2.1, etc.)
  • Paste your SSH public key
  • Optionally, under advanced/network, specify a different name for the VM
  • Click on Create to complete the creation

At some point you will have an instance “Green” ready to access:

Click on it and get the public address:

Using your SSH keypair, you can now access the instance with:

 

Setup sshd for SSH tunneling and X11 forwarding

Edit as root the sshd_config:

Modify it so that the following lines are present with these values:

Those values are required for X11 forwarding (required for the graphical installation) and for SSH tunneling (required to access  the Weblogic ports without exposing them over internet).

Then restart sshd:

Install the packages for X11 

At this point, it should be possible to forward X11. You can test by reconnecting with:

and then:

Create the oracle user

At this point, you can reconnect using oracle directly, so X11 forward will work for the oracle user without any additional setup:

 

Follow the canonical steps to install weblogic

If you do not know how to do that, follow this good tutorial by Tim Hall (oracle-base):

Oracle WebLogic Server (WLS) 12cR2 (12.2.1) Installation on Oracle Linux 6 and 7

 

Access the Weblogic console from outside Oracle Cloud

If you configured correctly sshd, once the Oracle Weblogic instance is configured and started, you can tunnel to the port (it should be 7001):

And be able to browse from your laptop using localhost:7001:

HTH

Ludovico

First draft of a Common Oracle Environment… for the Cloud Database (and not only)

I have just published on GitHub a draft of a common Oracle environment scripts that make the shell environment a little bit smarter than what it is by default. It uses some function and aliases that I have published during the past years.

You can start playing with:

Ideal for the Oracle Cloud Infrastructure

If you are new to the Oracle Cloud, probably you do not have environment scripts that makes it easy to interact with the database.

The environment scripts that I have published work out-of the box in the cloud (just make sure that you have rlwrap installed so that you can have a better CLI experience).

Actually, they work great as well on-premises, but I assume that you already have something automatic there.

Some examples

  • My famous Smart Prompt 😉 (including version, edition, exit code, etc)

  • u : gets the status of the databases

  • pmon: just displays the running pmon processes

  • db : sets the environment for a specific DB_NAME, DB_UNIQUE_NAME or SID

  • svcstat : shows the running services (and the corresponding pdb, host, etc) as I described in my previous post

  • s_ : smart alias for sqlplus: connects as sysdba/sysasm by default, or with any arguments that you pass:

  • adr_, dg_ rman_, cm_, lsn_ : aliases for common oracle binaries
  • genpasswd : generates random passwords (default length 30)

  • lsoh: lists the Oracle Homes attached to the inventory

  • setoh: sets the Oracle Home given its name in the inventory

 

You might want to install the same environment for oracle, grid (if you have role separation, it should be the case for Cloud DB Systems) and (eventually) root.

I am curious to know if it works well for your environment.

Cheers

Ludo

Oracle Clusterware Services Status at a glance, fast!

If you use Oracle Clusterware or you deploy your databases to the Oracle Cloud, you probably have some application services defined with srvctl for your database.

If you have many databases, services and nodes, it might be annoying, when doing maintenance or service relocation, to have a quick overview about how services are distributed across the nodes and what’s their status.

With srvctl (the official tool for that), it is a per-database operation:

If you have many databases, you have to run db by db.

It is also slow! For example, this database has 20 services. Getting the status takes 27 seconds:

Instead of operating row-by-row (get the status for each service), why not relying on the cluster resources with crsctl and get the big picture once?

crsctl stat res -f  returns a list of ATTRIBUTE_NAME=value for each service, eventually more than one if the service is not singleton/single instance  but uniform/multi instance.

By parsing them with some awk code can provide nice results!

STATE, INTERNAL_STATE and TARGET are useful in this case and might be used to display colours as well.

  • Green: Status ONLINE, Target ONLINE, STABLE
  • Black: Status OFFLINE, Target OFFLNE, STABLE
  • Red: Status ONLINE, Target OFFLINE, STABLE
  • Yellow: all other cases

Here’s the code:

Here’s what you can expect, for 92 services distributed on 4 nodes and a dozen of databases (the output is snipped and the names are masked):

I’d be curious to know if it works well for your environment, please comment here. 🙂

Thanks

Ludo

Oracle Grid Infrastructure 18c patching part 3: Executing out-of-place patching with the local-mode automaton

I wish I had more time to blog in the recent weeks. Sorry for the delay in this blog series 🙂

If you have not read the two previous blog posts, please do it now. I suppose here that you have the Independent Local-Mode Automaton already enabled.

What does the Independent Local-mode Automaton?

The automaton automates the process of moving the active Grid Infrastructure Oracle Home from the current one to a new one. The new one can be either at a higher patch level or at a lower one. Of course, you will probably want to patch your grid infrastructure, going then to a higher level of patching.

Preparing the new Grid Infrastructure Oracle Home

The GI home, starting from 12.2, is just a zip that is extracted directly in the new Oracle Home. In this blog post I suppose that you want to patch your Grid Infrastructure from an existing 18.3 to a brand new 18.4 (18.5 will be released very soon).

So, if your current OH is /u01/app/grid/crs1830, you might want to prepare the new home in /u01/app/grid/crs1840 by unzipping the software and then patching using the steps described here.

If you already have a golden image with the correct version, you can unzip it directly.

Beware of four important things: 

  1. You have to register the new Oracle home in the Central Inventory using the SW_ONLY install, as  described here.
  2. You must do it for all the nodes in the cluster prior to upgrading
  3. The response file must contain the same groups (DBA, OPER, etc) as the current active Home, otherwise errors will appear.
  4. You must relink by hand your Oracle binaries with the RAC option:
    $ cd /u01/app/grid/1crs1840/rdbms/lib
    $ make -f ins_rdbms.mk rac_on ioracle

In fact, after every attach to the central inventory the binaries are relinked without RAC option, so it is important to activate RAC again to avoid bad problems when upgrading the ASM with the new Automaton.

Executing the move gihome

If everything is correct, you should have now the current and new Oracle Homes, correctly registered in the Central Inventory, with the RAC option activated.

You can now do a first eval to check if everything looks good:

My personal suggestion at least at your first experiences with the automaton, is to move the Oracle Home on one node at a time. This way, YOU control the relocation of the services and resources before doing the actual move operation.

Here is the execution for the first node:

From this output you can see at line 15 that the cluster status is NORMAL, then the cluster is stopped on node 1 (lines 16 to 100), then the active version is modified in the oracle-ohasd.service file (line 101), then started back with the new version (lines 102 to 171). The cluster status now is ROLLING PATCH (line 172). The TFA and the node list are updated. 

Before continuing with the other(s) node(s), make sure that all the resources are up & running:

You might want as well to relocate manually your resources back to node 1 prior to continuing on node 2.

After that, node 2 can follow the very same procedure:

As you can see, there are two differencse here: the second node was in this case the last one, so the cluster status gets back to NORMAL, and the GIMR is patched with datapatch (lines 176-227).

At this point, the cluster has been patched. After some testing, you can safely remove the inactive version of Grid Infrastructure using the deinstall binary ($OLD_OH/deinstall/deinstall).

Quite easy, huh?

If you combine the Independent Local-mode Automaton with a home-developed solution for the creation and the provisioning of Grid Infrastructure Golden Images, you can easily achieve automated Grid Infrastructure patching of a big, multi-cluster environment.

Of course, Fleet Patching and Provisioning remains the Rolls-Royce: if you can afford it, GI patching and much more is completely automated and developed by Oracle, so you will have no headaches when new versions are released. But the local-mode automaton might be enough for your needs.

— 

Ludo

Oracle Grid Infrastructure 18c patching part 2: Independent Local-mode Automaton architecture and activation

The first important step before starting using the new Independent Local-mode Automaton is understanding which are its components inside a cluster.

Resources

Here’s the list of service that you will find when you install a Grid Infrastructure 18c: