DBA survival BLOG

DBA stuff and Oracle Data Guard

rhpctl addnode gihome: specify HUB or LEAF when adding new nodes to a Flex Cluster

Posted on July 29, 2021 by Ludovico

I have a customer trying to add a new node to a cluster using Fleet Patching and Provisioning.

The error in the command output is not very friendly:

[grid@fpps ~]$ rhpctl addnode gihome -workingcopy WC_gi19110_FPPC3 \
  -newnodes fppc3:fppc3-vip  -cred fppc-cred
fpps: Audit ID: 269
PRCT-1003 : failed to run "rhphelper" on node "fppc2"
PRCT-1014 : Internal error: RHPHELP_preNodeAddVal-05null

[grid@fpps ~]$ rhpctl addnode gihome -workingcopy WC_gi19110_FPPC3 \

-newnodes fppc3:fppc3-vip -cred fppc-cred

fpps: Audit ID: 269

PRCT-1003 : failed to run "rhphelper" on node "fppc2"

PRCT-1014 : Internal error: RHPHELP_preNodeAddVal-05null

The “RHPHELP_preNodeAddVal” might already give an idea of the cause: something related to the “cluvfy stage -pre nodeadd” evaluation that we normally do when adding a node by hand. FPP does not really run cluvfy, but it calls the same primitives cluvfy is based on.

In FPP, when the error does not give any useful information, this is the flow to follow:

use “rhpctl query audit” to get the date and time of the failing operation
open the “rhpserver.log.0” and look for the operation log in that time frame
get the UID of the operation e.g., in the following line it is “1556344143”:

[UID:-1556344143] [RMI TCP Connection(153)-192.168.1.151] [ 2021-07-27 00:25:20.741 KST ]
  [ServerCommon.processParameters:485]  before parsing: params = 
  {-methodName=addnodesWorkingCopy, -userName=grid, -version=19.0.0.0.0, -auditId=-1556344143,
  -auditCli=rhpctl addnode gihome -workingcopy WC_gi19110_FPPC3 -newnodes fppc3:fppc3-vip -cred cred_fppc,
  -plsnrPort=31605, -noun=gihome, -isSingleNodeProv=FALSE, -nls_lang=AMERICAN_AMERICA.AL32UTF8,
  -clusterName=fpps-cluster, -plsnrHost=fpps, -SA11204ClusterName=null,
  -lang=en_US, -clientNode=fpps, -verb=addnode, -ghopuid=-1556344143}

[UID:-1556344143] [RMI TCP Connection(153)-192.168.1.151] [ 2021-07-27 00:25:20.741 KST ]

[ServerCommon.processParameters:485] before parsing: params =

{-methodName=addnodesWorkingCopy, -userName=grid, -version=19.0.0.0.0, -auditId=-1556344143,

-auditCli=rhpctl addnode gihome -workingcopy WC_gi19110_FPPC3 -newnodes fppc3:fppc3-vip -cred cred_fppc,

-plsnrPort=31605, -noun=gihome, -isSingleNodeProv=FALSE, -nls_lang=AMERICAN_AMERICA.AL32UTF8,

-clusterName=fpps-cluster, -plsnrHost=fpps, -SA11204ClusterName=null,

-lang=en_US, -clientNode=fpps, -verb=addnode, -ghopuid=-1556344143}

Isolate the log for the operation: grep $UID rhpserver.log.0 > $UID.log
Locate the trace file of the rhphelper remote execution:

[UID:-1556344143] [RMI TCP Connection(153)-192.168.1.151] [ 2021-07-27 00:26:07.031 KST ] [RHPHELPERUtil.getTraceEnvs:4386] 
  TraceFileLocEnv is :RHPHELPER_TRACEFILE=/u01/app/grid/crsdata/fppc2/rhp/rhphelp_20210727002603.trc

1 2	[UID:-1556344143] [RMI TCP Connection(153)-192.168.1.151] [ 2021-07-27 00:26:07.031 KST ] [RHPHELPERUtil.getTraceEnvs:4386] TraceFileLocEnv is :RHPHELPER_TRACEFILE=/u01/app/grid/crsdata/fppc2/rhp/rhphelp_20210727002603.trc

Find the root cause in the rhphelper trace:

[main] [ 2021-07-27 00:27:02.600 KST ] [reflect.GeneratedMethodAccessor1.invoke:-1]  PRVG-11406 : API with node roles argument must be called for Flex Cluster

1	[main] [ 2021-07-27 00:27:02.600 KST ] [reflect.GeneratedMethodAccessor1.invoke:-1] PRVG-11406 : API with node roles argument must be called for Flex Cluster

In this case, the target cluster is a Flex Cluster, so the command must be run specifying the node_role.

The documentation is not clear (we will fix it soon):

rhpctl addnode gihome {-workingcopy workingcopy_name | -client cluster_name}
  -newnodes node_name:node_vip[:node_role][,node_name:node_vip[:node_role]...]

1 2	rhpctl addnode gihome {-workingcopy workingcopy_name \| -client cluster_name} -newnodes node_name:node_vip[:node_role][,node_name:node_vip[:node_role]...]

node_role must be specified for Flex Clusters, and it must be either HUB or LEAF.

After using the correct command line, the command succeeded.

rhpctl addnode gihome -workingcopy WC_gi19110_FPPC3 \
 -newnodes fppc3:fppc3-vip:HUB  -cred fppc-cred

1 2	rhpctl addnode gihome -workingcopy WC_gi19110_FPPC3 \ -newnodes fppc3:fppc3-vip:HUB -cred fppc-cred

HTH

—

Ludovico

Changing FPP temporary directory (/tmp in noexec and other issues)

Posted on July 13, 2021 by Ludovico

When using FPP, you might experience the following error (PRVF-7546):

$ rhpctl add workingcopy -workingcopy WC_db_19_11_FPPC -image db_19_11 -path /u01/app/oracle/product/WC_db_19_11_FPPC -client fppc -oraclebase /u01/app/oracle
fpps01: Audit ID: 121
PRGO-1260 : Cluster Verification checks for database home provisioning  failed for the specified working copy WC_db_19_11_FPPC.
PRCR-1178 : Execution of command failed on one or more nodes
 
PRVF-7546 : The work directory "/tmp/CVU_19.0.0.0.0_oracle/" cannot be used on node "fppc02"

$ rhpctl add workingcopy -workingcopy WC_db_19_11_FPPC -image db_19_11 -path /u01/app/oracle/product/WC_db_19_11_FPPC -client fppc -oraclebase /u01/app/oracle

fpps01: Audit ID: 121

PRGO-1260 : Cluster Verification checks for database home provisioning failed for the specified working copy WC_db_19_11_FPPC.

PRCR-1178 : Execution of command failed on one or more nodes

PRVF-7546 : The work directory "/tmp/CVU_19.0.0.0.0_oracle/" cannot be used on node "fppc02"

This is often related to the filesystem /tmp that has the “noexec” option:

$ mount | grep /tmp
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,noexec)

1 2	$ mount \| grep /tmp tmpfs on /tmp type tmpfs (rw,nosuid,nodev,noexec)

Although it is tempting to just remount the filesystem with “exec”, you might be in this situation because your systems are configured to adhere to the STIG recommendations:

The noexec option must be added to the /tmp partition (https://www.stigviewer.com/stig/red_hat_enterprise_linux_6/2016-12-16/finding/V-57569)

FPP 19.9 contains fix 30885598 that allows specifying the temporary location for FPP operations:

$ srvctl modify rhpserver  -tmploc <new_tmp>

1	$ srvctl modify rhpserver -tmploc <new_tmp>

After that, the operation should run smoothly:

fppc02: Successfully executed clone operation.
fppc02: Executing root script on nodes ltora401,ltora402.
fppc02: Successfully executed root script on nodes fppc01,fppc02.
fppc02: Working copy creation completed.
fppc02: Oracle home provisioned.
fpps01: Client-side action completed.

fppc02: Successfully executed clone operation.

fppc02: Executing root script on nodes ltora401,ltora402.

fppc02: Successfully executed root script on nodes fppc01,fppc02.

fppc02: Working copy creation completed.

fppc02: Oracle home provisioned.

fpps01: Client-side action completed.

HTH

—

Ludo

Oracle Fleet Patching and Provisioning (FPP): My new role as PM and a brand new series of blog posts

Posted on May 4, 2021 by Ludovico

It’s been 6 years since I’ve tried FPP for the first time (formerly Rapid Home Provisioning, or RHP).

Rapid Home Provisioning

FPP was still young and lacking many features at that time, but it already changed the way I’ve worked during the next years. I embraced the out of place patching, developed some basic scripts to install Oracle Homes, and sought automation and standardization at all costs:

Oracle Home Management – part 7: Putting all together

When 18c came with the FPP local-mode automaton, I have implemented it for the Grid Infrastructure patching strategy at CERN:

Oracle Grid Infrastructure 18c patching part 3: Executing out-of-place patching with the local-mode automaton

And discovered that meanwhile, FPP did giant steps, with many new features and fixes for quite a few usability and performance problems.

Last year, when joining the Oracle Database High Availability (HA), Scalability, and Maximum Availability Architecture (MAA) Product Management Team at Oracle, I took (among others) the Product Manager role for FPP.

Becoming an Oracle employee after 20 years of working with Oracle technology is a big leap. It allows me to understand how big the company is, and how collaborative and friendly the Oracle employees are (Yes, I was used to marketing nonsense, insisting salesmen, and unfriendly license auditors. This is slowly changing with Oracle embracing the Cloud, but it is still a fresh wound for many customers. Expect this to change even more! Regarding me… I’ll be the same I’ve always been 🙂 ).

Now I have daily meetings with big customers (bigger than the ones I have ever had in the past), development teams, other product managers, Oracle consultants, and community experts. My primary goal is to make the product better, increasing its adoption, and helping customers having the best experience with it. This includes testing the product myself, writing specs, presentations, videos, collecting feedback from the customers, tracking bugs, and manage escalations.

I am a Product Manager for other products as well, but I have to admit that FPP is the product that takes most of my Product Manager time. Why?

I will give a few reasons in my next blog post(s).

—

Ludo

Oracle Grid Infrastructure 19c does not configure FPP in local-mode by default. How to add it?

Posted on June 13, 2019 by Ludovico

I have been installing Grid Infrastructure 18c for a while, then switched to 19c when it became GA.

At the beginning I have been overly enthusiast by the shorter installation time:

Grid Infra 19c install process is MUCH faster than 18c/12cR2. Mean time for 2 node clusters @ CERN (incl. volumes, puppet runs, etc.) lowered from 1h30 to 45mins. No GIMR anymore by default!

— Ludovico Caldara (@ludodba) 5 maggio 2019

The GIMR is now optional, that means that deciding to install it is a choice of the customer, and a customer might like to keep it or not, depending on its practices.

Not having the GIMR by default means not having the local-mode automaton. This is also not a problem at all. The default configuration is good for most customers and works really well.

This new simplified configuration reduces some maintenance effort at the beginning, but personally I use a lot the local-mode automaton for out-of-place patching of Grid Infrastructure (read my blog posts to know why I really love the local-mode automaton), so it is something that I definitely need in my clusters.

A choice that makes sense for Oracle and most customers

Oracle vision regarding Grid Infrastructure consists of a central management of clusters, using the Oracle Domain Services Cluster. In this kind of deployment, the Management Repository, TFA, and many other services, are centralized. All the clusters use those services remotely instead of having them configured locally. The local-mode automaton is no exception: the full, enterprise-grade version of Fleet Patching and Provisioning (FPP, formerly Rapid home provisioning or RHP) allows much more than just out-of-place patching of Grid Infrastructure, so it makes perfectly sense to avoid those configurations everywhere, if you use a Domain Cluster architecture. Read more here.

Again, as I said many times in the past, doing out-of-place patching is the best approach in my opinion, but if you keep doing in-place patching, not having the local-mode automaton is not a problem at all and the default behavior in 19c is a good thing for you.

I need local-mode automaton on 19c, what I need to do at install time?

If you have many clusters, you are not installing them by hand with the graphic interface (hopefully!). In the responseFile for 19c Grid Infrastructure installation, this is all you need to change comparing to a 18c:

$ diff grid_install_template_18.rsp grid_install_template_19.rsp
1c1
< oracle.install.responseFileVersion=/oracle/install/rspfmt_crsinstall_response_schema_v18.0.0
---
> oracle.install.responseFileVersion=/oracle/install/rspfmt_crsinstall_response_schema_v19.0.0
25a26
> oracle.install.crs.configureGIMR=true
27c28
< oracle.install.crs.config.storageOption=
---
> oracle.install.crs.config.storageOption=FLEX_ASM_STORAGE

$ diff grid_install_template_18.rsp grid_install_template_19.rsp

1c1

< oracle.install.responseFileVersion=/oracle/install/rspfmt_crsinstall_response_schema_v18.0.0

---

> oracle.install.responseFileVersion=/oracle/install/rspfmt_crsinstall_response_schema_v19.0.0

25a26

> oracle.install.crs.configureGIMR=true

27c28

< oracle.install.crs.config.storageOption=

---

> oracle.install.crs.config.storageOption=FLEX_ASM_STORAGE

as you can see, also Flex ASM is not part of the game by default in 19c.

Once you specify in the responseFile that you want GIMR, then the local-mode automaton is installed as well by default.

I installed GI 19c without GIMR and local-mode automaton. How can I add them to my new cluster?

First, recreate the empty MGMTDB CDB by hand:

$ dbca -silent -createDatabase -sid -MGMTDB -createAsContainerDatabase true \
 -templateName MGMTSeed_Database.dbc -gdbName _mgmtdb \
 -storageType ASM -diskGroupName +MGMT \
 -datafileJarLocation $OH/assistants/dbca/templates \
 -characterset AL32UTF8 -autoGeneratePasswords -skipUserTemplateCheck

Prepare for db operation
10% complete
Registering database with Oracle Grid Infrastructure
14% complete
Copying database files
43% complete
Creating and starting Oracle instance
45% complete
49% complete
54% complete
58% complete
62% complete
Completing Database Creation
66% complete
69% complete
71% complete
Executing Post Configuration Actions
100% complete
Database creation complete. For details check the logfiles at:
 /u01/app/oracle/cfgtoollogs/dbca/_mgmtdb.
Database Information:
Global Database Name:_mgmtdb
System Identifier(SID):-MGMTDB
Look at the log file "/u01/app/oracle/cfgtoollogs/dbca/_mgmtdb/_mgmtdb2.log" for further details.

$ dbca -silent -createDatabase -sid -MGMTDB -createAsContainerDatabase true \

-templateName MGMTSeed_Database.dbc -gdbName _mgmtdb \

-storageType ASM -diskGroupName +MGMT \

-datafileJarLocation $OH/assistants/dbca/templates \

-characterset AL32UTF8 -autoGeneratePasswords -skipUserTemplateCheck

Prepare for db operation

10% complete

Registering database with Oracle Grid Infrastructure

14% complete

Copying database files

43% complete

Creating and starting Oracle instance

45% complete

49% complete

54% complete

58% complete

62% complete

Completing Database Creation

66% complete

69% complete

71% complete

Executing Post Configuration Actions

100% complete

Database creation complete. For details check the logfiles at:

/u01/app/oracle/cfgtoollogs/dbca/_mgmtdb.

Database Information:

Global Database Name:_mgmtdb

System Identifier(SID):-MGMTDB

Look at the log file "/u01/app/oracle/cfgtoollogs/dbca/_mgmtdb/_mgmtdb2.log" for further details.

Then, configure the PDB for the cluster. Pay attention to the -local switch that is not documented (or at least it does not appear in the inline help):

$ mgmtca -local

1	$ mgmtca -local

After that, you might check that you have the PDB for your cluster inside the MGMTDB, I’ll skip this step.

Before creating the rhpserver (local-mode automaton resource), we need the volume and filesystem to make it work (read here for more information).

The volume:

ASMCMD> volcreate -G MGMT -s 1536M --column 8 --width 1024k --redundancy unprotected GHCHKPT

ASMCMD> volinfo --all
Diskgroup Name: MGMT

         Volume Name: GHCHKPT
         Volume Device: /dev/asm/ghchkpt-303
         State: ENABLED
         Size (MB): 1536
         Resize Unit (MB): 64
         Redundancy: UNPROT
         Stripe Columns: 8
         Stripe Width (K): 1024
         Usage:
         Mountpath:

ASMCMD> volcreate -G MGMT -s 1536M --column 8 --width 1024k --redundancy unprotected GHCHKPT

ASMCMD> volinfo --all

Diskgroup Name: MGMT

Volume Name: GHCHKPT

Volume Device: /dev/asm/ghchkpt-303

State: ENABLED

Size (MB): 1536

Resize Unit (MB): 64

Redundancy: UNPROT

Stripe Columns: 8

Stripe Width (K): 1024

Usage:

Mountpath:

The filesystem:

(oracle)$ mkfs -t acfs /dev/asm/ghchkpt-303

(root)# $CRS_HOME/bin/srvctl add filesystem -d /dev/asm/ghchkpt-303 -m /opt/oracle/rhp_images/chkbase -u oracle -fstype ACFS
(root)# $CRS_HOME/bin/srvctl enable filesystem -volume ghchkpt -diskgroup MGMT
(root)# $CRS_HOME/bin/srvctl start filesystem -volume ghchkpt -diskgroup MGMT

(oracle)$ mkfs -t acfs /dev/asm/ghchkpt-303

(root)# $CRS_HOME/bin/srvctl add filesystem -d /dev/asm/ghchkpt-303 -m /opt/oracle/rhp_images/chkbase -u oracle -fstype ACFS

(root)# $CRS_HOME/bin/srvctl enable filesystem -volume ghchkpt -diskgroup MGMT

(root)# $CRS_HOME/bin/srvctl start filesystem -volume ghchkpt -diskgroup MGMT

Finally, create the local-mode automaton resource:

(root)# $CRS_HOME/bin/srvctl add rhpserver -local -storage /opt/oracle/rhp_images

1	(root)# $CRS_HOME/bin/srvctl add rhpserver -local -storage /opt/oracle/rhp_images

Again, note that there is a -local switch that is not documented. Specifying it will create the resource as a local-mode automaton and not as a full FPP Server (or RHP Server, damn, this change of name gets me mad when I write blog posts about it 🙂 ).

HTH

—

Ludovico