My own Dbvisit Replicate integration with Grid Infrastructure

I am helping my customer for a PoC of Dbvisit Replicate as a logical replication tool. I will not discuss (at least, not in this post) about the capabilities of the tool itself, its configuration or the caveats that you should beware of when you do logical replication. Instead, I will concentrate on how we will likely integrate it in the current environment.

My role in this PoC is to make sure that the tool will be easy to operate from the operational point of view, and the database operations, here, are supported by Oracle Grid Infrastructure and cold failover clusters.

Note: there are official Dbvisit  online resources  about how to configure Dbvisit Replicate in a cluster. I aim to complement those informations, not copy them.

Quick overview

If you know Dbvisit replicate, skip this paragraph.

There are three main components of Dbvisit Replicate: The FETCHER, the MINE and the APPLY processes. The FETCHER gets the redo stream from the source and sends it to the MINE process. The MINE process elaborates the redo streams and converts it in proprietary transaction log files (named plog). The APPLY process gets the plog files and applies the transactions on the destination database.

From an architectural point of view, MINE and APPLY do not need to run close to the databases that are part of the configuration. The FETCHER process, by opposite, needs to be local to the source database online log files (and archived logs).

Because the MINE process is the most resource intensive, it is not convenient to run it where the databases reside, as it might consume precious CPU resources that are licensed for Oracle Database. So, first step in this PoC: the FETCHER processes will run on the cluster, while MINE and APPLY will run on a dedicated Virtual Machine.

dbvisit_gi_overview

Clustering considerations

  • the FETCHER does NOT need to run on the server of the source database: having access to the online logs through the ASM instance is enough
  • to avoid SPoF, the fetcher should be a cluster resource that can relocate without problems
  • to simplify the configuration, the FETCHER configuration and the Dbvisit binaries should be on a shared filesystem (the FETCHER does not persist any data, just the logs)
  • the destination database might be literally anywhere: the APPLY connects via SQL*Net, so a correct name resolution and routing to the destination database are enough

so the implementation steps are:

  1. create a shared filesystem
  2. install dbvisit in the shared filesystem
  3. create the Dbvisit Replicate configuration on the dedicated VM
  4. copy the configuration files on the cluster
  5. prepare an action script
  6. configure the resource
  7. test!

Convention over configuration: the importance of a strong naming convention

Before starting the implementation, I decided to put all the caveats related to the FETCHER  resource relocation on paper:

  • Where will the configuration files reside? Dbvisit has an important variable: the Configuration Name. All the operations are done by passing a configuration file named /{PATH}/{CONFIG_NAME}/{CONFIG_NAME}-{PROCESS_TYPE}.ddc to the dbvrep binary. So, I decided to put ALL the configuration directories under the same path: given the Configuration Name, I will always be able to get the configuration file path.
  • How will the configuration files relocate from one node to the other? Easy here: they won’t. I will use an ACFS filesystem
  • How can I link the cluster resource with its configuration name? Easy again: I call my resources dbvrep.CONFIGNAME.PROCESS_TYPE. e.g. dbvrep.FROM_A_TO_B.fetcher
  • How will I manage the need to use a new version of dbvisit in the future? Old and new versions must coexist: Instead of using external configuration files, I will just use a custom resource attribute named DBVREP_HOME inside my resource type definition. (see later)
  • What port number should I use? Of course, many fetchers started on different servers should not have conflicts. This is something that might be either planned or made dynamic. I will opt for the first one. But instead of getting the port number inside the Dbvisit configuration, I will use a custom resource attribute: DBVREP_PORT.

Considerations on the FETCHER listen address

This requires a dedicated paragraph. The Dbvisit documentation suggest to  create a VIP, bind on the VIP address and create a dependency between the FETCHER resource and the VIP. Here is where my configuration will differ.

Having a separate VIP per FETCHER resource might, potentially, lead to dozens of VIPs in the cluster. Everything will depend on the success of the PoC and on how many internal clients will decide to ask for such implementation. Many VIPs == many interactions with network admins for address reservation, DNS configurations, etc. Long story short, it might slow down the creation and maintenance of new configurations.

Instead, each FETCHER will listen to the local server address, and the action script will take care of:

  • getting the current host name
  • getting the current ASM instance
  • changing the settings of the specific Dbvisit Replicate configuration (ASM instance and FETCHER listen address)
  • starting the FETCHER

Implementation

Now that all the caveats and steps are clear, I can show how I implemented it:

Create a shared filesystem

Install dbvisit in the shared filesystem

Create the Dbvisit Replicate configuration on the dedicated VM

Copy the configuration files from the Dbvisit VM to the cluster

Prepare an action script

Configure the resource

Test!

 

Also the relocation worked as expected: when the settings are modified through:

The MINE process get the change dynamically, so no need to restart it.

Last consideration

Adding a hard dependency between the DB and the FETCHER will require to stop the DB with the force option or to always stop the fetcher before the database. Also, the start of the DB will pullup the FETCHER (pullup:always) and the opposite as well. We will consider furtherly if we will use this dependency or if we will manage it differently (e.g. through the action script).

The hard dependency declared without the global keyword, will always start the fetcher on the server where the database runs. This is not required, but it might be nice to see the fetcher on the same node. Again, a consideration that we will discuss furtherly.

HTH

Ludovico

Get the Most out of Oracle Data Guard – The material

Here we go: as usual, the feedback that I usually get after my talks (specifically, after POUG High Five conference), is if I will share my demo scripts and material.

Sadly, the demos I am doing for my presentation “Get the most out of Oracle Data Guard” are quite tied to an environment built for the purpose of the demos. So, do not expect to get scripts easy to use as is, but rather to get some ideas beyond the demo themselves.

I hope they will help to get the whole picture.

Of course, if you need to implement a cloning strategy based on Data Guard or any other solution that I describe in this post, please feel free to contact me, I will be glad to help you implement it in your environment.

Slides

Demo 1

Video:

Scripts:

 

Demo 2

Video:


Scripts:

 

Demo 3

Video:

Scripts:

Preparation:

snap_acfs.pl

 

snap_databasae.pl

clone_from_snap.pl

Cheers

Ludovico

12.1.0.2 Bundle Patch 170718 breaks Data Guard and Duplicate from active database

Recently my customer patched its 12.1.0.2 databases with the Bundle Patch 170718 on the new servers (half of the customer’s environment). The old servers are still on 161018 Bundle Patch.

We realized that we could not move anymore the databases from the old servers to the new ones because the duplicate from active database was failing with this error:

The last lines shows the same error that Franck blogged about some months ago.

Oracle 12.2 had introduced incompatibility with previous releases in remote file transfer via SQL*Net. At least this is what it seems. According to Oracle, this is due to a bugfix present in Oracle 12.2

Now, the bundle patch that we installed on BP 170718 contains the same bugfix (Patch for bug 18633374).

So, the incompatibility happens now between databases of the same “Major Release” (12.1.0.2).

There are two possible workarounds:

  1. Apply the same patch level on both sides (BP170718 in my case)
  2. Apply just the patch 18633374 on top of your current PSU/DBBP (a merge might be necessary).

We used the second approach and now we can setup Data Guard again to move our databases without downtime:

HTH

Ludovico

 

 

Another problem with “KSV master wait” and “ASM file metadata operation”

My customer today tried to do a duplicate on a cluster. When preparing the auxiliary instance, she noticed that the startup nomount was hanging forever: Nothing in the alert, nothing in the trace files.

Because the database and the spfile were stored inside ASM, I’ve been quite suspicious…

The ASM trace files had the following entries:

The ASM instance had the following sessions waiting:

OMS?

Around 12:38:56, another colleague in the office added a disk to one of the disk groups, through Enterprise Manager 12c!

But there were no rebalance operations:

It’s not the first time that I hit this type of problems. Sadly, sometimes it requires a full restart of the cluster or of ASM (because of different bugs).

This time, however, I have tried to kill only the foreground sessions waiting on “ASM file metadata operation”, starting with the one coming from the OMS.

Surprisingly, after killing that session, everything was fine again:

I never add disks via OMS (I’m a sqlplus guy ;-)) , I wonder what went wrong with it 🙂

Ludovico

RMAN Catalog Housekeeping: how to purge the old incarnations

First, let me apologize because every post in my blog starts with a disclaimer… but sometimes it is really necessary. 😉

Disclaimer: this blog post contains PL/SQL code that deletes incarnations from your RMAN recovery catalog. Please DON’T use it unless you deeply understand what you are doing, as it can compromise your backup and recovery strategy.

Small introduction

You may have a central RMAN catalog that stores all the backup metadata for your databases. If it is the case, you will have a database entry for each of your databases and a new incarnation entry for each duplicate, incomplete recovery or  flashback (or whatever).

You should also have a delete strategy that deletes the obsolete backups from either your DISK or SBT_TAPE media. If you have old incarnations, however, after some time you will notice that their information never goes away from your catalog, and you may end up soon or later to do some housekeeping. But there is nothing more tedious than checking and deleting the incarnations one by one, especially if you have average big numbers like this catalog:

Where db, dbinc, bdf and brl contain reslectively the registered databases, incarnations, datafile backups and archivelog backups.

Different incarnations?

Consider the following query:

You can run it safely: it returns the list of incarnations hierarchically connected to their parent, by database name, key and level.

Then you have several types of behaviors:

  • Normal databases (created once, never restored or flashed back) will have just one or two incarnations (it depends on how they are created):

They are usually the ones that you may want to keep in your catalog, unless the database no longer exist: in this case perhaps you omitted the deletion from the catalog when you have dropped your database?

  • Flashed back databases (flashed back multiple times) will have as many incarnations as the number of flashbacks, but all connected with the incarnation prior to the flashback:

Here, despite you have several incarnations, they all belong to the same database (same DB_KEY and DBID), then you must also keep it inside the recovery catalog.

  • Non-production databases that are frequently refreshed from the production database (via duplicate) will have several incarnations with different DBIDs and DB_KEY:

This is usually the most frequent case: here you want to delete the old incarnations, but only as far as there are no backups attached to them that are still in the recovery window.

  • You may also have orphaned incarnations:

In this case, again, it depends whether the DBID and DB_KEY are the same as the current incarnation or not.

What do you need to delete?

Basically:

  • Incarnations of databases that no longer exist
  • Incarnations of existing databases where the database has a more recent current incarnation, only if there are no backups still in the retention window

How to do it?

In order to be sure 100% that you can delete an incarnation, you have to verify that there are no recent backups (for instance, no backups more rercent than the current recovery window for that database). If the database does not have a specified recovery window but rather a default “CONFIGURE RETENTION POLICY TO REDUNDANCY 1; # default”, it is a bit more problematic… in this case let’s assume that we consider “old” an incarnation that does not backup since 1 year (365 days), ok?

Getting the last backup of each database

Sadly, there is not a single table where you can verify that. You have to collect the information from several tables. I think bdf, al, cdf, bs would suffice in most cases.

When you delete an incarnation you specify a db_key: you have to get the last backup for each db_key, with queries like this:

Putting together all the tables:

Getting the  recovery window

The configuration information for each database is stored inside the conf table, but the retention information is stored in a VARCHAR2, either ‘TO RECOVERY WINDOW OF % DAYS’ or ‘TO REDUNDANCY %’

You need to convert it to a number when the retention policy is recovery windows, otherwise you default it to 365 days wher the redundancy is used. You can add a column and a join to the query:

and eventually, either display if it the incarnation is no more used or filter by usage:

Delete the incarnations!

You can delete the incarnations with this procedure:

This procedure will raise an exception (-20001, ‘Database not found’) when a database does not exist anymore (either already deleted by this procedure or by another session), so you need to handle it.

Putting all together:

I have used this procedure today for the first time and it worked like a charm.

However, if you have any adjustment or suggestion, don’t hesitate to comment it 🙂

HTH

DBMS_QOPATCH, datapatch, rollback, apply force

I am working for a customer on a quite big implementation of Cold Failover Cluster with Oracle Grid Infrastructure on Linux. I hope to have some material to publish soon about it! However, in this post I will be talking about patching the database in a cold-failover environment.

DISCLAIMER: I use massively scripts provided in this great blog post by Simon Pane:

https://www.pythian.com/blog/oracle-database-12c-patching-dbms_qopatch-opatch_xml_inv-and-datapatch/

Thank you Simon for sharing this 🙂

Intro

We are not yet in the process of doing out-of-place patching; at the moment the customer prefers to do in-place patching:

  • evacuate a node by relocating all the databases on other nodes
  • patching the node binaries
  • move back the databases and patch them with datapatch
  • do the same for the remaining nodes

I beg to disagree with this method, being a fan of having many patched golden copies distributed on all servers and patching the databases by just changing the ORACLE_HOME and running datapatch (like Rapid Home Provisioning does). But, this is the situation today, and we have to live with it.

Initial situation

  • Server 1, 2 and 3: one-off 20139391 applied
  • New database created

cfc_qopatch1When the DBCA creates a new database, in 12.1.0.2, it does not run datapatch by default, thus, the database does not have any patches installed.

However, this specific one-off patch does not modify anything in the database (sql_patch=false)

and the datapatch runs without touching the db:

Next step: I evacuate the server 2 and patch it, then I relocate my database on it

cfc_qopatch2

Now the database is not at the same level of the binaries and need to be patched:

The column CONSTITUENT is important here because it tells us what the parent patch_id is. This is the column that we have to check when we want to know if the patch has been applied on the database.

Now the patch is visible inside the dba_registry_sqlpatch:

Notice that the child patches are not listed in thie view.

Rolling back

Now, one node is patched, but the others are not. What happen if I relocate the patched database to a non-patched node?

cfc_qopatch3

The patch is applied inside the database but not in the binaries!

If I run datapatch again, the patch is rolled back:

The patch has been rolled back according to the datapatch, and the action is shown in the dba_registry_sqlpatch:

But if I look at the logfile, the patch had some errors:

Indeed, the patch looks still there:

If I try to run it again, it does nothing/it fails saying the patch is not there:

What does it say on the patched node?

Whaaat? datapatch there says that the patch IS in the registry and there’s nothing to do. Let’s try to force its apply again: