Video: Where should I put the Observer in a Fast-Start Failover configuration?

The video explains best practices and different failure scenarios for different observer placements. It also shows how to configure high availability for the observer.

Here’s the summary:

  • Always try to put the observer(s) on an external site.
  • If you don’t have any, put it where the primary database is, and have one ready on the secondary site after the role transition.
  • Don’t put the observer together with the standby database!
  • Configure multiple observers for high availability, and use the PreferredObserverHosts Data Guard member property to ensure you never run the observer where the standby database is.

 

Far Sync and Fast-Start Failover Protection modes

Oracle advertises Far Sync as a solution for “Zero Data Loss at any distance”. This is because the primary sends its redo stream synchronously to the Far Sync, which relays it to the remote physical standby.

There are many reasons why Far Sync is an optimal solution for this use case, but that’s not the topic of this post 🙂

Some customers ask: Can I configure Far Sync to receive the redo stream asynchronously?

Although a direct standby receiving asynchronously would be a better idea, Far Sync can receive asynchronously as well.

And one reason might be to send asynchronously to one Far Sync member that redistributes locally to many standbys.

It is very simple to achieve: just changing the RedoRoutes property on the primary.

This will work seamlessly. The v$dataguard_process will show the async transport process:

 

What about Fast-Start Failover?

Up to and including 19c, ASYNC transport to Far Sync will not work with Fast-Start Failover (FSFO).

ASYNC redo transport mandates Maximum Performance protection mode, and FSFO supports that in conjunction with Far Sync only starting with 21c.

Before 21c, trying to enable FSFO with a Far Sync will fail with:

So if you want FSFO with Far Sync in 19c, it has to be MaxAvailability (and SYNC redo transport to the FarSync).


If you don’t need FSFO, as we have seen, there is no problem. The only protection mode that will not work with Far Sync is Maximum Protection:

If FSFO is required, and you want Maximum Performance before 21c, or Maximum Protection, you have to remove Far Sync from the redo route.

Ludovico

Can a physical standby database receive the redo SYNC if the Far Sync instance fails?

The answer is YES.

In the following configuration, cdgsima_lhr1pq (primary) sends synchronously to cdgsima_farsync1 (far sync), which forwards the redo stream asynchronously to cdgsima_lhr1bm (physical standby):

But if cdgsima_farsync1 is not available, I want the primary to send synchronously to the physical standby database. I accept a performance penalty, but I do not want to compromise my data protection.

I just need to set up the Redoroutes as follows:

This is defined the second part of the RedoRoutes rules:

Let’s test. If I shutdown abort the farsync instance:

I can see the new SYNC destination being open almost instantaneously (because the old destination fails immediately with ORA-03113):

Indeed, I can see the new NSS process (synchronous redo transport) spawn at that time:

Ludo

Can I rename a PDB in a Data Guard configuration?

Someone asked me this question recently.

The answer is: yes!

Let’s see it in action.

On the primary I have:

And of course the same PDBs on the standby:

Let’s change the PDB RED name to TOBY: The PDB rename operation is straightforward (but it requires a brief downtime). To be done on the primary:

On the standby, I can see that the PDB changed its name:

The PDB name change is propagated transparently with the redo apply.

Ludo

rhpctl addnode gihome: specify HUB or LEAF when adding new nodes to a Flex Cluster

I have a customer trying to add a new node to a cluster using Fleet Patching and Provisioning.

The error in the command output is not very friendly:

The “RHPHELP_preNodeAddVal” might already give an idea of the cause: something related to the “cluvfy stage -pre nodeadd” evaluation that we normally do when adding a node by hand. FPP does not really run cluvfy, but it calls the same primitives cluvfy is based on.

In FPP, when the error does not give any useful information, this is the flow to follow:

  • use “rhpctl query audit” to get the date and time of the failing operation
  • open the “rhpserver.log.0” and look for the operation log in that time frame
  • get the UID of the operation e.g., in the following line it is “1556344143”:

  • Isolate the log for the operation: grep $UID rhpserver.log.0 > $UID.log
  • Locate the trace file of the rhphelper remote execution:

  • Find the root cause in the rhphelper trace:

In this case, the target cluster is a Flex Cluster, so the command must be run specifying the node_role.

The documentation is not clear (we will fix it soon):

node_role must be specified for Flex Clusters, and it must be either HUB or LEAF.

After using the correct command line, the command succeeded.

HTH

Ludovico

Changing FPP temporary directory (/tmp in noexec and other issues)

When using FPP, you might experience the following error (PRVF-7546):

This is often related to the filesystem /tmp that has the “noexec” option:

Although it is tempting to just remount the filesystem with “exec”, you might be in this situation because your systems are configured to adhere to the STIG recommendations:

The noexec option must be added to the /tmp partition (https://www.stigviewer.com/stig/red_hat_enterprise_linux_6/2016-12-16/finding/V-57569)

FPP 19.9 contains fix 30885598 that allows specifying the temporary location for FPP operations:

After that, the operation should run smoothly:

HTH

Ludo

Oracle Fleet Patching and Provisioning (FPP): My new role as PM and a brand new series of blog posts

It’s been 6 years since I’ve tried FPP for the first time (formerly Rapid Home Provisioning, or RHP).

Rapid Home Provisioning

FPP was still young and lacking many features at that time, but it already changed the way I’ve worked during the next years. I embraced the out of place patching, developed some basic scripts to install Oracle Homes, and sought automation and standardization at all costs:

Oracle Home Management – part 7: Putting all together

When 18c came with the FPP local-mode automaton, I have implemented it for the Grid Infrastructure patching strategy at CERN:

Oracle Grid Infrastructure 18c patching part 3: Executing out-of-place patching with the local-mode automaton

And discovered that meanwhile, FPP did giant steps, with many new features and fixes for quite a few usability and performance problems.

Last year, when joining the Oracle Database High Availability (HA), Scalability, and Maximum Availability Architecture (MAA) Product Management Team at Oracle, I took (among others) the Product Manager role for FPP.

Becoming an Oracle employee after 20 years of working with Oracle technology is a big leap. It allows me to understand how big the company is, and how collaborative and friendly the Oracle employees are (Yes, I was used to marketing nonsense, insisting salesmen, and unfriendly license auditors. This is slowly changing with Oracle embracing the Cloud, but it is still a fresh wound for many customers. Expect this to change even more! Regarding me… I’ll be the same I’ve always been 🙂 ).

Now I have daily meetings with big customers (bigger than the ones I have ever had in the past), development teams, other product managers, Oracle consultants, and community experts. My primary goal is to make the product better, increasing its adoption, and helping customers having the best experience with it. This includes testing the product myself, writing specs, presentations, videos, collecting feedback from the customers, tracking bugs, and manage escalations.

I am a Product Manager for other products as well, but I have to admit that FPP is the product that takes most of my Product Manager time. Why?

I will give a few reasons in my next blog post(s).

Ludo

The fear of (availability) loss is a path to the dark side.

I have been a DBA/consultant for customers and big production environments for over twenty years. I have explained more or less my career path in this blog post.

Database (and application) high availability has always been one of my favorite areas. Over the years I have become a high availability expert (my many blog posts are there to confirm it) and I have spent a lot of time building, troubleshooting, teaching, presenting, advocating these gems of technology that are RAC, Data Guard, Application Continuity and the many other products that are part of the Oracle Maximum Availability Architecture solution. Customers fear downtime, and I have always been with them on that. But in my case, it looks like Yoda’s famous quote worked well for me (in a good way):

I’ll be joining the Oracle Maximum Availability Architecture Product Management team as MAA Product Manager (or rather Cloud MAA, I will not explain here ;-))  next November.

(for those who are not familiar with the joke, the “Dark Side” is how we often refer to the Oracle employees in the Oracle Community 😉 )

I remember just like if it was yesterday that I was presenting some Data Guard 12c new features in front of a big audience at Collaborate 2014. There I have met two incredible people that were part of the MAA Product Management team: Larry Carpenter and Markus Michalewicz. Larry has been a great source of inspiration to improve my seniority and ease of presenting in front of the public, while Markus has become a friend over the years in addition of being one of the most influent persons in my professional network.

Now I have got the opportunity to join that team, and I feel like it’s the most natural change to do in my career.

And because I imagine some of you will have some questions, there are some answers to questions I’ve been frequently asked so far:

  • MAA PM does not mean becoming team lead or supervising other colleagues, I’ll be a “regular” PM
  • I will stay in Switzerland and work remotely from here
  • I will stay in “the conference circus” and keep presenting as soon the COVID-19 situation will allow to do so
  • Yes, I was VERY happy in Trivadis and it will always have a special place in my heart
  • Yep, that means no ACE Director award anymore 😉

Exciting times ahead! 🙂

Data Guard, Easy Connect and the Observer for multiple configurations

EZConnect

One of the challenges of automation in bin Oracle Environments is dealing with tnsnames.ora files.
These files might grow big and are sometimes hard to distribute/maintain properly.
The worst is when manual modifications are needed: manual operations, if not made carefully, can screw up the connection to the databases.
The best solution is always using LDAP naming resolution. I have seen customers using OID, OUD, Active Directory, openldapd, all with a great level of control and automation. However, some customer don’t have/want this possibility and keep relying on TNS naming resolution.
When Data Guard (and eventually RAC) are in place, the tnsnames.ora gets filled by entries for the DGConnectIdentifiers and StaticConnectIdentifier. If I add the observer, an additional entry is required to access the dbname_CFG service created by the Fast Start Failover.

Actually, all these entries are not required if I use Easy Connect.

My friend Franck Pachot wrote a couple of nice blog posts about Easy Connect while working with me at CERN:
https://medium.com/@FranckPachot/19c-easy-connect-e0c3b77968d7

https://medium.com/@FranckPachot/19c-ezconnect-and-wallet-easy-connect-and-external-password-file-8e326bb8c9f5

Basic Data Guard configuration

The basic configuration with Data Guard is quite simple to achieve with Easy Connect. In this examples I have:
– The primary database TOOLCDB1_SITE1
– The duplicated database for standby TOOLCDB1_SITE2

After setting up the static registration (no Grid Infrastructure in my lab):

and copying the passwordfile, the configuration can be created with:

That’s it.

Now, if I want to have the configuration observed, I need to activate the Fast Start Failover:

With just two databases, FastStartFailoverTarget is not explicitly needed, but I usually do it as other databases might be added to the configuration in the future.
After that, the broker complains that FSFO is enabled but there is no observer yet:

 

Observer for multiple configurations

This feature has been introduced in 12.2 but it is still not widely used.
Before 12.2, the Observer was a foreground process: the DBAs had to start it in a wrapper script executed with nohup in order to keep it live.
Since 12.2, the observer can run as a background process as far as there is a valid wallet for the connection to the databases.
Also, 12.2 introduced the capability of starting multiple configurations with a single dgmgrl command: “START OBSERVING”.

For more information about it, you can check the documentation here:
https://docs.oracle.com/en/database/oracle/oracle-database/19/dgbkr/using-data-guard-broker-to-manage-switchovers-failovers.html#GUID-BC513CDB-1E06-4EB3-9FE1-E1331E15E492

How to set it up with Easy Connect?

First, I need a wallet. And here comes the first compromise:
Having a single dgmgrl session to start all my configurations means that I have a single wallet for all the databases that I want to observe.
Fair enough, all the DBs (CDBs?) are managed by the same team in this case.
If I have only observers on my host I can easily point to the wallet from my central sqlnet.ora:

Otherwise I need to create a separate TNS_ADMIN for my observer management environment.
Then, I create the wallet:

Now I need to add the connection descriptors.

Which connection descriptors do I need?
The Observer uses the DGConnectIdentifier to keep observing the databases, but needs a connection to both of them using the TOOLCDB1_CFG service (unless I specify something different with the broker configuration property ConfigurationWideServiceName) to connect to the configuration and get the DGConnectIdentifier information. Again, you can check it in the doc. or the note Oracle 12.2 – Simplified OBSERVER Management for Multiple Fast-Start Failover Configurations (Doc ID 2285891.1)

So I need to specify three secrets for three connection descriptors:

The first one will be used for the initial connection. The other two to observe the Primary and Standby.
I need to be careful that the first EZConnect descriptor matches EXACTLY what I put in observer.ora (see next step) and the last two match my DGConnectIdentifier (unless I specify something different with ObserverConnectIdentifier), otherwise I will get some errors and the observer will not observe correctly (or will not start at all).

The dgmgrl needs then a file named observer.ora.
$ORACLE_BASE/admin/observers or the central TNS_ADMIN would be good locations, but what if I have observers that must be started from multiple Oracle Homes?
In that case, having a observer.ora in $ORACLE_HOME/network/admin (or $ORACLE_BASE/homes/{OHNAME}/network/admin/ if Read-Only Oracle Home is enabled) would be a better solution: in this case I would need to start one session per Oracle Home

The content of my observer.ora must be something like:

This is the example for my configuration, but I can put as many (CONFIG=…) as I want in order to observe multiple configurations.
Then, if everything is configured properly, I can start all the observers with a single command:

Troubleshooting

If the observer does not work, sometimes it is not easy to understand the cause.

  • Has SYSDG been granted to SYSDG user? Is SYSDG account unlocked?
  • Does sqlnet.ora contain the correct wallet location?
  • Is the wallet accessible in autologin?
  • Are the entries in the wallet correct? (check with “sqlplus /@connstring as sysdg”)

Missing pieces

Here, a few features that I think would be a nice addition in the future:

  • Awareness for the ORACLE_HOME to be used for each observer
  • Possibility to specify a different TNS_ADMIN per observer (different wallets)
  • Integration with Grid Infrastructure (srvctl add observer…) and support for multiple observers

Ludovico