Steps to remove/add node from a cluster if RHP fails to move gihome

I am getting more and more experience patching clusters with the local-mode automaton. The whole process would be very complex, but the local-mode automaton makes it really easy.

I have had nevertheless a couple of clusters where the process did not work:

#1: The very first cluster that I installed in 18c

This cluster has “kind of failed” patching the first node. Actually, the rhpctl command exited with an error:

But actually, the helper lept running and configured everything properly:

The cluster was OK on the first node, with the correct patch level. The second node, however, was filing with:

I am not sure about the cause, but let’s assume it is irrelevant for the moment.

#2: A cluster with new GI home not properly linked with RAC

This was another funny case, where the first node patched successfully, but the second one failed upgrading in the middle of the process with a java NullPointer exception. We did a few bad tries of prePatch and postPatch to solve, but after that the second node of the cluster was in an inconsistent state: in ROLLING_UPGRADE mode and not possible to patch anymore.

Common solution: removing the node from the cluster and adding it back

In both cases we were in the following situation:

  • one node was successfully patched to 18.6
  • one node was not patched and was not possible to patch it anymore (at least without heavy interventions)

So, for me, the easiest solution has been removing the failing node and adding it back with the new patched version.

Steps to remove the node

Although the steps are described here: https://docs.oracle.com/en/database/oracle/oracle-database/18/cwadd/adding-and-deleting-cluster-nodes.html#GUID-8ADA9667-EC27-4EF9-9F34-C8F65A757F2A, there are a few differences that I will highlight:

Stop of the cluster:

The actual procedure to remove a node asks to deconfigure the databases and managed homes from the active cluster version. But as we manage our homes with golden images, we do not need this; we rather want to keep all the entries in the OCR so that when we add it back, everything is in place.

Once stopped the CRS, we have deinstalled the CRS home on the failing node:

ThisĀ  complained about the CRS that was down, but continued and ask for this script to be executed:

We’ve got errors also for this script, but the remove process was OK afterall.

Then, from the surviving node:

Adding the node back

From the surviving node, we ran gridSetup.sh and followed the steps to ad the node.

Wait before running root.sh.

In our case, we have originally installed the cluster starting with a SW_ONLY install. This type of installation keeps some leftovers in the configuration files that prevent the root.sh from configuring the cluster…we have had to modify rootconfig.sh:

then, after running root.sh and the config tools, everything was back as before removing the node form the cluster.

For one of the clusters , both nodes were at the same patch level, but the cluster was still in ROLLING_PATCH mode. So we have had to do a

Ludo

The following two tabs change content below.

Ludovico

Oracle ACE Director and Computing Engineer at CERN
Ludovico is an Oracle ACE Director, frequent speaker and community contributor, working as Computing Engineer at CERN, the European Organization for Nuclear Research, in Switzerland.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.