FPP local-mode: Steps to remove/add node from a cluster if RHP fails to move gihome

I am getting more and more experience with patching clusters with the local-mode automaton. The whole process would be very complex, but the local-mode automaton makes it really easy.

I have had nevertheless a couple of clusters where the process did not work:

#1: The very first cluster that I installed in 18c

This cluster has “kind of failed” patching the first node. Actually, the rhpctl command exited with an error:

But actually, the helper kept running and configured everything properly:

The cluster was OK on the first node, with the correct patch level. The second node, however, was failing with:

I am not sure about the cause, but let’s assume it is irrelevant for the moment.

#2: A cluster with new GI home not properly linked with RAC

This was another funny case, where the first node patched successfully, but the second one failed upgrading in the middle of the process with a java NullPointer exception. We did a few bad tries of prePatch and postPatch to solve, but after that the second node of the cluster was in an inconsistent state: in ROLLING_UPGRADE mode and not possible to patch anymore.

Common solution: removing the node from the cluster and adding it back

In both cases we were in the following situation:

  • one node was successfully patched to 18.6
  • one node was not patched and was not possible to patch it anymore (at least without heavy interventions)

So, for me, the easiest solution has been removing the failing node and adding it back with the new patched version.

Steps to remove the node

Although the steps are described here: https://docs.oracle.com/en/database/oracle/oracle-database/18/cwadd/adding-and-deleting-cluster-nodes.html#GUID-8ADA9667-EC27-4EF9-9F34-C8F65A757F2A, there are a few differences that I will highlight:

Stop of the cluster:

The actual procedure to remove a node asks to deconfigure the databases and managed homes from the active cluster version. But as we manage our homes with golden images, we do not need this; we rather want to keep all the entries in the OCR so that when we add it back, everything is in place.

Once stopped the CRS, we have deinstalled the CRS home on the failing node:

This  complained about the CRS that was down, but it continued and ask for this script to be executed:

We’ve got errors also for this script, but the remove process was OK afterall.

Then, from the surviving node:

Adding the node back

From the surviving node, we ran gridSetup.sh and followed the steps to ad the node.

Wait before running root.sh.

In our case, we have originally installed the cluster starting with a SW_ONLY install. This type of installation keeps some leftovers in the configuration files that prevent the root.sh from configuring the cluster…we have had to modify rootconfig.sh:

then, after running root.sh and the config tools, everything was back as before removing the node form the cluster.

For one of the clusters , both nodes were at the same patch level, but the cluster was still in ROLLING_PATCH mode. So we have had to do a

Ludo

How to install and access Oracle Weblogic 12.2 in the Oracle Cloud Infrastructure

I put here the steps required to install and access Weblogic in the OCI (mostly for me in case I need to do it again 😉 ). The assumptions are:

  • you already have an account for the Oracle Cloud Infrastructure and you can access the OCI console
  • you already have a Compartment with a VCN and a subnet configured (for test purposes, a VCN created with the default values will be just fine)
  • you already have a keypair for your SSH client (id_rsa, id_rsa.pub)
  • you have an X server on your laptop (if you have Windows, I recommend MobaXTerm, but Xming or other servers are just fine)

Create the compute instance

  • Menu -> Core Infrastructure -> Compute -> Instances -> Create Instance
  • Choose a name for the Instance, all the other fields defaults are fine for test (Oracle Linux 7.6, VM.Standard2.1, etc.)
  • Paste your SSH public key
  • Optionally, under advanced/network, specify a different name for the VM
  • Click on Create to complete the creation

At some point you will have an instance “Green” ready to access:

Click on it and get the public address:

Using your SSH keypair, you can now access the instance with:

 

Setup sshd for SSH tunneling and X11 forwarding

Edit as root the sshd_config:

Modify it so that the following lines are present with these values:

Those values are required for X11 forwarding (required for the graphical installation) and for SSH tunneling (required to access  the Weblogic ports without exposing them over internet).

Then restart sshd:

Install the packages for X11 

At this point, it should be possible to forward X11. You can test by reconnecting with:

and then:

Create the oracle user

At this point, you can reconnect using oracle directly, so X11 forward will work for the oracle user without any additional setup:

 

Follow the canonical steps to install weblogic

If you do not know how to do that, follow this good tutorial by Tim Hall (oracle-base):

Oracle WebLogic Server (WLS) 12cR2 (12.2.1) Installation on Oracle Linux 6 and 7

 

Access the Weblogic console from outside Oracle Cloud

If you configured correctly sshd, once the Oracle Weblogic instance is configured and started, you can tunnel to the port (it should be 7001):

And be able to browse from your laptop using localhost:7001:

HTH

Ludovico

Oracle SW_ONLY install leads to relink with rac_off at every attachHome

OK, I really do not know what other title I should use for this post.

I have developed and presented a few times my personal approach to Oracle Home provisioning and patching. You can read more in this series.

With this approach:

  • I install the software (either GI or RDBMS) with the option SW_ONLY once
  • I patch it to the last version
  • I create a golden image that I evolve for the rest of the release lifecycle

When I need to install it, I just unzip the golden image and attach it to the Central Inventory.

I have discovered quite longtime ago that, every time I was attaching the home to the inventory, the binaries were relinked with rac_off, disregarding the fact that the home that I zipped actually had RAC enabled. This is quite annoying at my work at CERN, as all our databases are RAC.

So my solution to the problem is to detect if the server is on a cluster, and relink on the fly:

This is a simplified snippet of my actual code, but it gives the idea.

What causes the relink with rac_off?

I have discovered recently that the steps used by the runInstaller process to attach the Oracle Home are described in this file:

and in my case, for all my golden images, it contains:

So, it does not matter how I prepare my images: unless I change this file and put rac_on, the runInstaller keeps relinking with rac_off.

I have thought about changing the file, but then realized that I prefer to check and recompile at runtime, so I can reuse my images also for standalone servers (in case we need them).

Just to avoid surprises, it is convenient to check if a ORACLE_HOME is linked with RAC with this small function:

This is true especially for Grid Infrastructure golden images, as they have the very same behavior of RDBMS homes, with the exception that they might break out-of-place patching if RAC is not enabled: the second ASM instance will not mount because the first will be exclusively mounted without the RAC option.

 

HTH.

Ludovico

Oracle Grid Infrastructure 19c does not configure FPP in local-mode by default. How to add it?

I have been installing Grid Infrastructure 18c for a while, then switched to 19c when it became GA.

At the beginning I have been overly enthusiast by the shorter installation time:

The GIMR is now optional, that means that deciding to install it is a choice of the customer, and a customer might like to keep it or not, depending on its practices.

Not having the GIMR by default means not having the local-mode automaton. This is also not a problem at all. The default configuration is good for most customers and works really well.

This new simplified configuration reduces some maintenance effort at the beginning, but personally I use a lot the local-mode automaton for out-of-place patching of Grid Infrastructure (read my blog posts to know why I really love the local-mode automaton), so it is something that I definitely need in my clusters.

A choice that makes sense for Oracle and most customers

Oracle vision regarding Grid Infrastructure consists of a central management of clusters, using the Oracle Domain Services Cluster. In this kind of deployment, the Management Repository, TFA, and many other services, are centralized. All the clusters use those services remotely instead of having them configured locally. The local-mode automaton is no exception: the full, enterprise-grade version of Fleet Patching and Provisioning (FPP, formerly Rapid home provisioning or RHP) allows much more than just out-of-place patching of Grid Infrastructure, so it makes perfectly sense to avoid those configurations everywhere, if you use a Domain Cluster architecture. Read more here.

Again, as I said many times in the past, doing out-of-place patching is the best approach in my opinion, but if you keep doing in-place patching, not having the local-mode automaton is not a problem at all and the default behavior in 19c is a good thing for you.

I need local-mode automaton on 19c, what I need to do at install time?

If you have many clusters, you are not installing them by hand with the graphic interface (hopefully!). In the responseFile for 19c Grid Infrastructure installation, this is all you need to change comparing to a 18c:

as you can see, also Flex ASM is not part of the game by default in 19c.

Once you specify in the responseFile that you want GIMR, then the local-mode automaton is installed  as well by default.

I installed GI 19c without GIMR and local-mode automaton. How can I add them to my new cluster?

First, recreate the empty MGMTDB CDB by hand:

Then, configure the PDB for the cluster. Pay attention to the -local switch that is not documented (or at least it does not appear in the inline help):

After that, you might check that you have the PDB for your cluster inside the MGMTDB, I’ll skip this step.

Before creating the rhpserver (local-mode automaton resource), we need the volume and filesystem to make it work (read here for more information).

The volume:

The filesystem:

Finally, create the local-mode automaton resource:

Again, note that there is a -local switch that is not documented. Specifying it will create the resource as a local-mode automaton and not as a full FPP Server (or RHP Server, damn, this change of name gets me mad when I write blog posts about it 🙂 ).

HTH

Ludovico

(unsupported) DST_UPGRADE_STATE is DATAPUMP(1) but no data pump jobs are running. How to fix?

This blog post contains unsupported commands. Do not try to use them without the supervision of Oracle Support!

I have just run across an Oracle Database who had a broken configuration in database_properties.

The database was in process of being upgraded to 18c, but the DST upgrade step was not working because of wrong entries in the view DATABASE_PROPERTIES:

The MOS note Updating the RDBMS DST version in 12c Release 1 (12.1.0.1 and up) using DBMS_DST (Doc ID 1509653.1) states that I had to check note How To Cleanup Orphaned DataPump Jobs In DBA_DATAPUMP_JOBS ? (Doc ID 336014.1) to solve the problem.

In fact, there should have  been an orphan data pump job trying to import the timezone file. But in my case, no jobs at all, do data pump job tables.

Second, the secondary time zone being lower than the primary one was, to me, sign of an old upgrade went wrong.

Trying to begin a new prepare phase was failing with:

Trying to end the old one was failing as well:

Trying to unload the secondary was failing as well:

I double-checked ALL the notes to clean-up the situation and made sure that there was nothing actually running regarding a DST upgrade.

I am pretty evil trying unsupported stuff. So I have decided to check the underlying table:

Fixed tables are not writable, but sys.props$ is, and it was containing the same bad data:

So I did what I knew was wrong, after taking a guaranteed restore point. Do not try this at home without the supervision of Oracle Support!

Trying again:

The rest of the upgrade procedure went smoothly.

Ludovico

First draft of a Common Oracle Environment… for the Cloud Database (and not only)

I have just published on GitHub a draft of a common Oracle environment scripts that make the shell environment a little bit smarter than what it is by default. It uses some function and aliases that I have published during the past years.

You can start playing with:

Ideal for the Oracle Cloud Infrastructure

If you are new to the Oracle Cloud, probably you do not have environment scripts that makes it easy to interact with the database.

The environment scripts that I have published work out-of the box in the cloud (just make sure that you have rlwrap installed so that you can have a better CLI experience).

Actually, they work great as well on-premises, but I assume that you already have something automatic there.

Some examples

  • My famous Smart Prompt 😉 (including version, edition, exit code, etc)

  • u : gets the status of the databases

  • pmon: just displays the running pmon processes

  • db : sets the environment for a specific DB_NAME, DB_UNIQUE_NAME or SID

  • svcstat : shows the running services (and the corresponding pdb, host, etc) as I described in my previous post

  • s_ : smart alias for sqlplus: connects as sysdba/sysasm by default, or with any arguments that you pass:

  • adr_, dg_ rman_, cm_, lsn_ : aliases for common oracle binaries
  • genpasswd : generates random passwords (default length 30)

  • lsoh: lists the Oracle Homes attached to the inventory

  • setoh: sets the Oracle Home given its name in the inventory

 

You might want to install the same environment for oracle, grid (if you have role separation, it should be the case for Cloud DB Systems) and (eventually) root.

I am curious to know if it works well for your environment.

Cheers

Ludo

The new era of spam? Technical-related comments that defeat anti-spam filters

Part of the tasks of every blogger is to update the anti-spam filters (a WordPress plugin in my case, akismet), read the comments in quarantine and either approve them, delete them or mark them as spam (to that the anti-spam plugin can enrich the spam detection algorithms).

So far, the main factor to me to decide to approve or not, is to read the content of the message and find it appropriate or not for the blog post.

Until today.

Some time ago I have written this blog post about diagnosing problems in deleting the Audit Trail in Oracle 12c:

DBMS_AUDIT_MGMT.CLEAN_AUDIT_TRAIL not working on 12c? Here’s why…

Last week I have got this comment:

Technical. Talking about Oracle Audit (which is relevant to my blog post) and with  technical details that only an Oracle professional might know.

So I have approved it and read my blog post again to find a proper answer.

Except that my blog post was not talking about Unified Audit, relink of binaries or exports!

I have realized then that the “link” of the blogger was in fact a website selling sex toys… (yeah, writing this here will bring even more spam… I take the risk 🙂 )

So where does it come from? Actually, the comment has been copied from another blog:

https://petesdbablog.wordpress.com/2013/07/20/12c-new-feature-unified-auditing/

A new type of spam (maybe it exists since a while? I am not a spam expert).

The spammer (bot?) copies technical comments from blog posts with similar keywords and just changes the name and the link of the author.

Beware, bloggers.

Ludo

Oracle Clusterware Services Status at a glance, fast!

If you use Oracle Clusterware or you deploy your databases to the Oracle Cloud, you probably have some application services defined with srvctl for your database.

If you have many databases, services and nodes, it might be annoying, when doing maintenance or service relocation, to have a quick overview about how services are distributed across the nodes and what’s their status.

With srvctl (the official tool for that), it is a per-database operation:

If you have many databases, you have to run db by db.

It is also slow! For example, this database has 20 services. Getting the status takes 27 seconds:

Instead of operating row-by-row (get the status for each service), why not relying on the cluster resources with crsctl and get the big picture once?

crsctl stat res -f  returns a list of ATTRIBUTE_NAME=value for each service, eventually more than one if the service is not singleton/single instance  but uniform/multi instance.

By parsing them with some awk code can provide nice results!

STATE, INTERNAL_STATE and TARGET are useful in this case and might be used to display colours as well.

  • Green: Status ONLINE, Target ONLINE, STABLE
  • Black: Status OFFLINE, Target OFFLNE, STABLE
  • Red: Status ONLINE, Target OFFLINE, STABLE
  • Yellow: all other cases

Here’s the code:

Here’s what you can expect, for 92 services distributed on 4 nodes and a dozen of databases (the output is snipped and the names are masked):

I’d be curious to know if it works well for your environment, please comment here. 🙂

Thanks

Ludo

Oracle Grid Infrastructure 18c patching part 3: Executing out-of-place patching with the local-mode automaton

I wish I had more time to blog in the recent weeks. Sorry for the delay in this blog series 🙂

If you have not read the two previous blog posts, please do it now. I suppose here that you have the Independent Local-Mode Automaton already enabled.

What does the Independent Local-mode Automaton?

The automaton automates the process of moving the active Grid Infrastructure Oracle Home from the current one to a new one. The new one can be either at a higher patch level or at a lower one. Of course, you will probably want to patch your grid infrastructure, going then to a higher level of patching.

Preparing the new Grid Infrastructure Oracle Home

The GI home, starting from 12.2, is just a zip that is extracted directly in the new Oracle Home. In this blog post I suppose that you want to patch your Grid Infrastructure from an existing 18.3 to a brand new 18.4 (18.5 will be released very soon).

So, if your current OH is /u01/app/grid/crs1830, you might want to prepare the new home in /u01/app/grid/crs1840 by unzipping the software and then patching using the steps described here.

If you already have a golden image with the correct version, you can unzip it directly.

Beware of four important things: 

  1. You have to register the new Oracle home in the Central Inventory using the SW_ONLY install, as  described here.
  2. You must do it for all the nodes in the cluster prior to upgrading
  3. The response file must contain the same groups (DBA, OPER, etc) as the current active Home, otherwise errors will appear.
  4. You must relink by hand your Oracle binaries with the RAC option:
    $ cd /u01/app/grid/1crs1840/rdbms/lib
    $ make -f ins_rdbms.mk rac_on ioracle

In fact, after every attach to the central inventory the binaries are relinked without RAC option, so it is important to activate RAC again to avoid bad problems when upgrading the ASM with the new Automaton.

Executing the move gihome

If everything is correct, you should have now the current and new Oracle Homes, correctly registered in the Central Inventory, with the RAC option activated.

You can now do a first eval to check if everything looks good:

My personal suggestion at least at your first experiences with the automaton, is to move the Oracle Home on one node at a time. This way, YOU control the relocation of the services and resources before doing the actual move operation.

Here is the execution for the first node:

From this output you can see at line 15 that the cluster status is NORMAL, then the cluster is stopped on node 1 (lines 16 to 100), then the active version is modified in the oracle-ohasd.service file (line 101), then started back with the new version (lines 102 to 171). The cluster status now is ROLLING PATCH (line 172). The TFA and the node list are updated. 

Before continuing with the other(s) node(s), make sure that all the resources are up & running:

You might want as well to relocate manually your resources back to node 1 prior to continuing on node 2.

After that, node 2 can follow the very same procedure:

As you can see, there are two differencse here: the second node was in this case the last one, so the cluster status gets back to NORMAL, and the GIMR is patched with datapatch (lines 176-227).

At this point, the cluster has been patched. After some testing, you can safely remove the inactive version of Grid Infrastructure using the deinstall binary ($OLD_OH/deinstall/deinstall).

Quite easy, huh?

If you combine the Independent Local-mode Automaton with a home-developed solution for the creation and the provisioning of Grid Infrastructure Golden Images, you can easily achieve automated Grid Infrastructure patching of a big, multi-cluster environment.

Of course, Fleet Patching and Provisioning remains the Rolls-Royce: if you can afford it, GI patching and much more is completely automated and developed by Oracle, so you will have no headaches when new versions are released. But the local-mode automaton might be enough for your needs.

— 

Ludo

Oracle Grid Infrastructure 18c patching part 2: Independent Local-mode Automaton architecture and activation

The first important step before starting using the new Independent Local-mode Automaton is understanding which are its components inside a cluster.

Resources

Here’s the list of service that you will find when you install a Grid Infrastructure 18c:

As you can see, there are 4 components that are OFFLINE by default:

Three local resources (that are present on each node):

  • ora.MGMT.GHCHKPT.advm
  • ora.mgmt.ghchkpt.acfs
  • ora.helper

One cluster resource (active on only one server at a time, it can relocate):

  • ora.rhpserver

If you have ever worked with 12c Rapid Home Provisioning, those name should sound familiar.

The GHCHKPT filesystem (ant its relative volume), is used to store some data regarding the ongoing operations across the cluster during the GI home move.

The ora.helper is the process that actually does the operations. It is local because each node needs it to execute some actions at some point.

The rhpserver is the server process that coordinates the operations and delegates them to the helpers.

All those services compose the independent local-mode automaton, that is the default deployment. The full RHP framework (RHP Server and RHP Client) might be configured instead with some additional work.

Important note: Just a few weeks ago Oracle changed the name of Rapid Home Provisioning (RHP) to Fleet Patching and Provisioning (FPP). The name is definitely more appealing now, but it generates again some confusion about product names and acronyms, so beware that in this series sometimes I refer to RHP, sometimes to FPP, but actually it is the same thing.

Tomcat?

You might have noticed that tomcat is deployed now in the GI home, as there are patches specific to it (here I paste the 18.4 version):

 

Indeed Tomcat is registered in the inventory and patched just like any other product inside the OH:

 

Out of the box, Tomcat is used for the Quality of Services Management (ora.qosmserver resource):

But it is used for the Independent Local Mode Automaton as well, when it is started.

Enabling and starting the independent local-mode automaton

The resources are started using the following commands (as root, the order is quite important):

Before continuing with the rhpserver resource, you might want to check if the filesystem is mounted:

Now the rhpserver should start without problems, as oracle:

Please note that if you omit to activate the filesystem first, the rhpserver will fail to start.

As you can see, now both rhpserver and the helper are online:

Now all is set to start using it!

We’ll see how to use it in the next posts.

Ludo