DBA survival BLOG

DBA stuff and Oracle Data Guard

Checking usage of HugePages by Oracle databases in Linux environments

Posted on November 22, 2019 by Ludovico

Yesterday several databases on one server started logging errors in the alert log:

ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2

ORA-00603: ORACLE server session terminated by fatal error

ORA-27504: IPC error creating OSD context

ORA-27300: OS system dependent operation:sendmsg failed with status: 105

ORA-27301: OS failure message: No buffer space available

ORA-27302: failure occurred at: sskgxpsnd2

That means not enough contiguous free memory in the OS. The first thing that I have checked has been of course the memory, and the used huge pages:

# [ oracle@oraserver1:/home/oracle [10:45:46] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #
$ free
              total        used        free      shared  buff/cache   available
Mem:      528076056   398142940     3236764   119855448   126696352     5646964
Swap:      16760828    11615324     5145504

# [ oracle@oraserver1:/home/oracle [10:46:47] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #
$ cat /proc/meminfo | grep Huge
HugePages_Total:   180000
HugePages_Free:    86029
HugePages_Rsvd:    11507
HugePages_Surp:        0
Hugepagesize:       2048 kB

# [ oracle@oraserver1:/home/oracle [10:45:46] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #

$ free

total used free shared buff/cache available

Mem: 528076056 398142940 3236764 119855448 126696352 5646964

Swap: 16760828 11615324 5145504

# [ oracle@oraserver1:/home/oracle [10:46:47] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #

$ cat /proc/meminfo | grep Huge

HugePages_Total: 180000

HugePages_Free: 86029

HugePages_Rsvd: 11507

HugePages_Surp: 0

Hugepagesize: 2048 kB

The memory available (last column in the free command) was indeed quite low, but still plenty of space in the huge pages (86k pages free out of 180k).

The usage by Oracle instances:

# [ oracle@oraserver1:/home/oracle [10:45:39] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #
$ sh mem.sh
DB12 : 54081544
DB22 : 37478820
DB32 : 67970828
DB42 : 14846552
DB52 : 16326380
DB62 : 15122048
DB82 : 56900472
DB92 : 14401080
DBA2 : 12622736
DBB2 : 14379916
DBC2 : 46078336
DBD2 : 46137728
DB72 : 37351336
total :  433697776

# [ oracle@oraserver1:/home/oracle [10:45:39] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #

$ sh mem.sh

DB12 : 54081544

DB22 : 37478820

DB32 : 67970828

DB42 : 14846552

DB52 : 16326380

DB62 : 15122048

DB82 : 56900472

DB92 : 14401080

DBA2 : 12622736

DBB2 : 14379916

DBC2 : 46078336

DBD2 : 46137728

DB72 : 37351336

total : 433697776

You can get the code of mem.sh in this post.

Regarding pure shared memory usage, the situation was what I was expecting:

$ ipcs -m | awk 'BEGIN{a=0} {a+=$5} END{print a}'
369394520064

1 2	$ ipcs -m \| awk 'BEGIN{a=0} {a+=$5} END{print a}' 369394520064

360G of shared memory usage, much more than what was allocated in the huge pages.

I have compared the situation with the other node in the cluster: it had more memory allocated by the databases (because of more load on it), more huge page usage and less 4k pages consumption overall.

$ sh mem.sh
DB12 : 78678000
DB22 : 14220000
DB32 : 14287528
DB42 : 12369352
DB52 : 14868596
DB62 : 14633984
DB82 : 54316104
DB92 : 86148332
DBA2 : 61473288
DBB2 : 68678788
DBC2 : 9831288
DBD2 : 64759352
DB72 : 68114604
total :  562379216

$ free
              total        used        free      shared  buff/cache   available
Mem:      528076056   402288800    17100464     5818032   108686792   114351784
Swap:      16760828       47360    16713468

$ cat /proc/meminfo | grep Huge
AnonHugePages:     10240 kB
HugePages_Total:   176654
HugePages_Free:    15557
HugePages_Rsvd:    15557
HugePages_Surp:        0
Hugepagesize:       2048 kB

$ sh mem.sh

DB12 : 78678000

DB22 : 14220000

DB32 : 14287528

DB42 : 12369352

DB52 : 14868596

DB62 : 14633984

DB82 : 54316104

DB92 : 86148332

DBA2 : 61473288

DBB2 : 68678788

DBC2 : 9831288

DBD2 : 64759352

DB72 : 68114604

total : 562379216

$ free

total used free shared buff/cache available

Mem: 528076056 402288800 17100464 5818032 108686792 114351784

Swap: 16760828 47360 16713468

$ cat /proc/meminfo | grep Huge

AnonHugePages: 10240 kB

HugePages_Total: 176654

HugePages_Free: 15557

HugePages_Rsvd: 15557

HugePages_Surp: 0

Hugepagesize: 2048 kB

So I was wondering if all the DBs were property allocating the SGA in huge pages or not.

This redhat page has been quite useful to create a quick snippet to check the huge page memory allocation per process:

# [ oracle@oraserver1:/home/oracle [10:55:27] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #
$ cat /proc/707/numa_maps | grep -i hug
60000000 default file=/SYSV00000000\040(deleted) huge dirty=1 mapmax=57 N0=1 kernelpagesize_kB=2048
70000000 default file=/SYSV00000000\040(deleted) huge dirty=1525 mapmax=57 N0=743 N1=782 kernelpagesize_kB=2048
c60000000 interleave:0-1 file=/SYSV0b46df00\040(deleted) huge dirty=1 mapmax=57 N0=1 kernelpagesize_kB=2048


# [ oracle@oraserver1:/home/oracle [10:56:39] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #
$ function pshugepage () {
> HUGEPAGECOUNT=0
> for num in `grep 'huge.*dirty=' /proc/$@/numa_maps | awk '{print $5}' | sed 's/dirty=//'` ; do
> HUGEPAGECOUNT=$((HUGEPAGECOUNT+num))
> done
> echo process $@ using $HUGEPAGECOUNT huge pages
> }

# [ oracle@oraserver1:/home/oracle [10:57:09] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #
$ pshugepage 707
process 707 using 1527 huge pages


# [ oracle@oraserver1:/home/oracle [10:57:11] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #
$ for pid in `ps -eaf | grep [p]mon | awk '{print $2}'` ; do pshugepage $pid ; done
process 707 using 1527 huge pages
process 3685 using 2409 huge pages
process 16092 using 3056 huge pages
process 55718 using 0 huge pages
process 58490 using 0 huge pages
process 70583 using 0 huge pages
process 94479 using 1135 huge pages
process 98216 using 0 huge pages
process 98755 using 0 huge pages
process 100245 using 0 huge pages
process 100265 using 0 huge pages
process 100270 using 0 huge pages
process 101681 using 0 huge pages
process 179079 using 1699 huge pages
process 189585 using 14566 huge pages

# [ oracle@oraserver1:/home/oracle [10:55:27] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #

$ cat /proc/707/numa_maps | grep -i hug

60000000 default file=/SYSV00000000\040(deleted) huge dirty=1 mapmax=57 N0=1 kernelpagesize_kB=2048

70000000 default file=/SYSV00000000\040(deleted) huge dirty=1525 mapmax=57 N0=743 N1=782 kernelpagesize_kB=2048

c60000000 interleave:0-1 file=/SYSV0b46df00\040(deleted) huge dirty=1 mapmax=57 N0=1 kernelpagesize_kB=2048

# [ oracle@oraserver1:/home/oracle [10:56:39] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #

$ function pshugepage () {

> HUGEPAGECOUNT=0

> for num in `grep 'huge.*dirty=' /proc/$@/numa_maps | awk '{print $5}' | sed 's/dirty=//'` ; do

> HUGEPAGECOUNT=$((HUGEPAGECOUNT+num))

> done

> echo process $@ using $HUGEPAGECOUNT huge pages

> }

# [ oracle@oraserver1:/home/oracle [10:57:09] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #

$ pshugepage 707

process 707 using 1527 huge pages

# [ oracle@oraserver1:/home/oracle [10:57:11] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #

$ for pid in `ps -eaf | grep [p]mon | awk '{print $2}'` ; do pshugepage $pid ; done

process 707 using 1527 huge pages

process 3685 using 2409 huge pages

process 16092 using 3056 huge pages

process 55718 using 0 huge pages

process 58490 using 0 huge pages

process 70583 using 0 huge pages

process 94479 using 1135 huge pages

process 98216 using 0 huge pages

process 98755 using 0 huge pages

process 100245 using 0 huge pages

process 100265 using 0 huge pages

process 100270 using 0 huge pages

process 101681 using 0 huge pages

process 179079 using 1699 huge pages

process 189585 using 14566 huge pages

It has been easy to spot the databases not using huge pages at all:

# [ oracle@oraserver1:/home/oracle [10:58:26] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #
$ ps -eaf | grep [p]mon
oracle      707      1  0 Sep30 ?        00:23:55 ora_pmon_DB12
oracle     3685      1  0 Nov01 ?        00:09:17 ora_pmon_DB22
oracle    16092      1  0 Oct15 ?        00:04:15 ora_pmon_DB32
oracle    55718      1  0 Aug12 ?        00:08:25 asm_pmon_+ASM2
oracle    58490      1  0 Aug12 ?        00:08:24 apx_pmon_+APX2
oracle    70583      1  0 Aug12 ?        00:57:55 ora_pmon_DB42
oracle    94479      1  0 Oct02 ?        00:32:03 ora_pmon_DB52
oracle    98216      1  0 Aug12 ?        00:58:36 ora_pmon_DB62
oracle    98755      1  0 Aug12 ?        00:59:27 ora_pmon_DB82
oracle   100245      1  0 Aug12 ?        00:56:52 ora_pmon_DB92
oracle   100265      1  0 Aug12 ?        00:51:54 ora_pmon_DBA2
oracle   100270      1  0 Aug12 ?        00:54:57 ora_pmon_DBB2
oracle   101681      1  0 Aug12 ?        00:56:55 ora_pmon_DBC2
oracle   179079      1  0 Sep10 ?        00:35:17 ora_pmon_DBD2
oracle   189585      1  0 Nov01 ?        00:09:34 ora_pmon_DB72

# [ oracle@oraserver1:/home/oracle [10:58:26] [19.3.0.0.0 [GRID] SID=GRID] 0 ] #

$ ps -eaf | grep [p]mon

oracle 707 1 0 Sep30 ? 00:23:55 ora_pmon_DB12

oracle 3685 1 0 Nov01 ? 00:09:17 ora_pmon_DB22

oracle 16092 1 0 Oct15 ? 00:04:15 ora_pmon_DB32

oracle 55718 1 0 Aug12 ? 00:08:25 asm_pmon_+ASM2

oracle 58490 1 0 Aug12 ? 00:08:24 apx_pmon_+APX2

oracle 70583 1 0 Aug12 ? 00:57:55 ora_pmon_DB42

oracle 94479 1 0 Oct02 ? 00:32:03 ora_pmon_DB52

oracle 98216 1 0 Aug12 ? 00:58:36 ora_pmon_DB62

oracle 98755 1 0 Aug12 ? 00:59:27 ora_pmon_DB82

oracle 100245 1 0 Aug12 ? 00:56:52 ora_pmon_DB92

oracle 100265 1 0 Aug12 ? 00:51:54 ora_pmon_DBA2

oracle 100270 1 0 Aug12 ? 00:54:57 ora_pmon_DBB2

oracle 101681 1 0 Aug12 ? 00:56:55 ora_pmon_DBC2

oracle 179079 1 0 Sep10 ? 00:35:17 ora_pmon_DBD2

oracle 189585 1 0 Nov01 ? 00:09:34 ora_pmon_DB72

Indeed, after stopping them, the huge page usage has not changed:

# [ oracle@oraserver1:/home/oracle [11:01:52] [11.2.0.4.0 [DBMS EE] SID=DB62] 1 ] #
$ srvctl stop instance -d DB6_SITE1 -i DB62

# [ oracle@oraserver1:/home/oracle [11:02:24] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #
$ srvctl stop instance -d DB4_SITE1 -i DB42

# [ oracle@oraserver1:/home/oracle [11:03:29] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #
$ srvctl stop instance -d DB8_SITE1 -i DB82

# [ oracle@oraserver1:/home/oracle [11:06:36] [11.2.0.4.0 [DBMS EE] SID=DB62] 130 ] #
$ srvctl stop instance -d DB9_SITE1 -i DB92

# [ oracle@oraserver1:/home/oracle [11:07:16] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #
$ srvctl stop instance -d DBA_SITE1 -i DBA2

# [ oracle@oraserver1:/home/oracle [11:07:56] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #
$ srvctl stop instance -d DBB_SITE1 -i DBB2

# [ oracle@oraserver1:/home/oracle [11:08:42] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #
$ srvctl stop instance -d DBC_SITE1 -i DBC2

# [ oracle@oraserver1:/home/oracle [11:09:16] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #
$ cat /proc/meminfo | grep Huge
HugePages_Total:   180000
HugePages_Free:    86029
HugePages_Rsvd:    11507
HugePages_Surp:        0
Hugepagesize:       2048 kB

# [ oracle@oraserver1:/home/oracle [11:01:52] [11.2.0.4.0 [DBMS EE] SID=DB62] 1 ] #

$ srvctl stop instance -d DB6_SITE1 -i DB62

# [ oracle@oraserver1:/home/oracle [11:02:24] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #

$ srvctl stop instance -d DB4_SITE1 -i DB42

# [ oracle@oraserver1:/home/oracle [11:03:29] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #

$ srvctl stop instance -d DB8_SITE1 -i DB82

# [ oracle@oraserver1:/home/oracle [11:06:36] [11.2.0.4.0 [DBMS EE] SID=DB62] 130 ] #

$ srvctl stop instance -d DB9_SITE1 -i DB92

# [ oracle@oraserver1:/home/oracle [11:07:16] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #

$ srvctl stop instance -d DBA_SITE1 -i DBA2

# [ oracle@oraserver1:/home/oracle [11:07:56] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #

$ srvctl stop instance -d DBB_SITE1 -i DBB2

# [ oracle@oraserver1:/home/oracle [11:08:42] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #

$ srvctl stop instance -d DBC_SITE1 -i DBC2

# [ oracle@oraserver1:/home/oracle [11:09:16] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #

$ cat /proc/meminfo | grep Huge

HugePages_Total: 180000

HugePages_Free: 86029

HugePages_Rsvd: 11507

HugePages_Surp: 0

Hugepagesize: 2048 kB

But after starting them back I could see the new huge pages reserved/allocated:

# [ oracle@oraserver1:/home/oracle [11:10:35] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #
$ srvctl start instance -d DB6_SITE1 -i DB62

# [ oracle@oraserver1:/home/oracle [11:12:14] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #
$ srvctl start instance -d DB4_SITE1 -i DB42

# [ oracle@oraserver1:/home/oracle [11:12:54] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #
$ srvctl start instance -d DB8_SITE1 -i DB82

# [ oracle@oraserver1:/home/oracle [11:13:41] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #
$ srvctl start instance -d DB9_SITE1 -i DB92

# [ oracle@oraserver1:/home/oracle [11:14:43] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #
$ srvctl start instance -d DBA_SITE1 -i DBA2

# [ oracle@oraserver1:/home/oracle [11:15:25] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #
$ srvctl start instance -d DBB_SITE1 -i DBB2

# [ oracle@oraserver1:/home/oracle [11:15:54] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #
$ srvctl start instance -d DBC_SITE1 -i DBC2

# [ oracle@oraserver1:/home/oracle [11:17:49] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #
$ cat /proc/meminfo | grep Huge
HugePages_Total:   180000
HugePages_Free:    72820
HugePages_Rsvd:    68961
HugePages_Surp:        0
Hugepagesize:       2048 kB

# [ oracle@oraserver1:/home/oracle [11:17:54] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #
$ free
              total        used        free      shared  buff/cache   available
Mem:      528076056   392011828   123587116     5371848    12477112   126250868
Swap:      16760828      587308    16173520

# [ oracle@oraserver1:/home/oracle [11:10:35] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #

$ srvctl start instance -d DB6_SITE1 -i DB62

# [ oracle@oraserver1:/home/oracle [11:12:14] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #

$ srvctl start instance -d DB4_SITE1 -i DB42

# [ oracle@oraserver1:/home/oracle [11:12:54] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #

$ srvctl start instance -d DB8_SITE1 -i DB82

# [ oracle@oraserver1:/home/oracle [11:13:41] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #

$ srvctl start instance -d DB9_SITE1 -i DB92

# [ oracle@oraserver1:/home/oracle [11:14:43] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #

$ srvctl start instance -d DBA_SITE1 -i DBA2

# [ oracle@oraserver1:/home/oracle [11:15:25] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #

$ srvctl start instance -d DBB_SITE1 -i DBB2

# [ oracle@oraserver1:/home/oracle [11:15:54] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #

$ srvctl start instance -d DBC_SITE1 -i DBC2

# [ oracle@oraserver1:/home/oracle [11:17:49] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #

$ cat /proc/meminfo | grep Huge

HugePages_Total: 180000

HugePages_Free: 72820

HugePages_Rsvd: 68961

HugePages_Surp: 0

Hugepagesize: 2048 kB

# [ oracle@oraserver1:/home/oracle [11:17:54] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] #

$ free

total used free shared buff/cache available

Mem: 528076056 392011828 123587116 5371848 12477112 126250868

Swap: 16760828 587308 16173520

The reason was that the server has been started without huge pages first, and after a few instances started, the huge pages has been set.

HTH

—

Ludovico

Basic Vagrantfile for multiple groups of VMs

Posted on April 5, 2018 by Ludovico

In case you want to prepare multiple sets of machines quickly using Vagrant, ready for different setups, this might be something for you:

## -*- mode: ruby -*-
## vi: set ft=ruby :

require 'ipaddr'

###############################
# CUSTOM CONFIGURATION START
###############################

# lab_name is the name of the lab where all the files will be organized.
lab_name = "lab_bigdata"

# here is where you download your software, so it will be available to the VMs.
sw_path  = "C:\\Users\\ludov\\Downloads\\Software"

# cluster(s) definition
clusters = [
  {
  :prefix  => "hadoop", 				# prefix: VMs will be named prefix01, prefix02, etc
  :domain  => "ludovicocaldara.net",	# domain name
  :box     => "ludodba/ol7.3-base",		# base box, either "ludodba/ol7.3-base" or "ludodba/ubu1604"
  :nodes   => 3,						# number of nodes for this cluster
  :cpu     => 1,
  :mem     => 2048,
  :publan  => IPAddr.new("192.168.56.0/24"), 	# public lan for the cluster
  :publan_start => 121							# starting IP, each VM will increment it by one
  },
  {
  :prefix  => "kafka",							# eventually, continue with another cluster!
  :domain  => "ludovicocaldara.net",
  :box     => "ludodba/ol7.3-base",
  :nodes   => 1,
  :cpu     => 1,
  :mem     => 2048,
  :publan  => IPAddr.new("192.168.56.0/24"),
  :publan_start => 131
  },
  {
  :prefix  => "postgres",
  :domain  => "ludovicocaldara.net",
  :box     => "ludodba/ubu1604",
  :nodes   => 1,
  :cpu     => 1,
  :mem     => 2048,
  :publan  => IPAddr.new("192.168.56.0/24"),
  :publan_start => 141
  }
]

###############################
# CUSTOM CONFIGURATION END
###############################

######################################################
# Extending Class IPAddr to add the CIDR to the lan
class IPAddr
  def to_cidr_s
    if @addr
      mask = @mask_addr.to_s(2).count('1')
      "#{to_s}/#{mask}"
    else
      nil
    end
  end
end # extend class IPAddr

########
# MAIN #
########

Vagrant.configure(2) do |config|
  config.ssh.username = "root"  	# my boxes are password based for simplicity
  config.ssh.password = "vagrant"
  config.vm.graceful_halt_timeout = 360	# in case you install grid infra... do not force shutdown after a few seconds

  if File.directory?(sw_path)
    # our shared folder for oracle 12c installation files (uid 54320 is grid, uid 54321 is oracle)
    config.vm.synced_folder sw_path, "/media/sw", :mount_options => ["dmode=775","fmode=775","uid=54322","gid=54328"]
  end

  # looping through each cluster
  (0..(clusters.length-1)).each do |cluid|

    # assign variable clu to current cluster, for convenience
    clu = clusters[cluid]
      
    # looping through each node in the cluster
    (1..(clu[:nodes])).each do |nid|

      # let's start from the last node (see RAC Attack automation for the reason) :-)
      nid = clu[:nodes]+1-nid
      config.vm.define vm_name = "#{clu[:prefix]}%02d" % nid do |cnf|
	  
		# set the right box for the VM
		cnf.vm.box = clu[:box]
		if (clu[:box_version]) then
			cnf.vm.box_version = clu[:box_version]
		end #if
		
		# the new vm name
        vm_name = "#{clu[:prefix]}%02d" % nid
        fqdn = "#{vm_name}.#{clu[:domain]}"
        cnf.vm.hostname = "#{fqdn}"

		# incrementing public ip for the cluster
        pubip = clu[:publan].|(clu[:publan_start]+nid-1).to_s

        cnf.vm.provider :virtualbox do |vb|
          #vb.linked_clone = true  # in case you want thin provisioning. read the vagrant doc before setting it
          vb.name = vm_name
          vb.gui = false
          vb.customize ["modifyvm", :id, "--memory", clu[:mem]]
          vb.customize ["modifyvm", :id, "--cpus",   clu[:cpu]]
          vb.customize ["modifyvm", :id, "--groups", "/#{lab_name}/#{clu[:prefix]}"]
        end #config.vm.provider
		
        # Configuring virtualbox network for #{pubip}
        cnf.vm.network :private_network, ip: pubip

      end #config.vm.define
    end #loop nodes
  end  #loop clusters
end #Vagrant.configure

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

## -*- mode: ruby -*-

## vi: set ft=ruby :

require 'ipaddr'

###############################

# CUSTOM CONFIGURATION START

###############################

# lab_name is the name of the lab where all the files will be organized.

lab_name = "lab_bigdata"

# here is where you download your software, so it will be available to the VMs.

sw_path = "C:\\Users\\ludov\\Downloads\\Software"

# cluster(s) definition

clusters = [

{

:prefix => "hadoop", # prefix: VMs will be named prefix01, prefix02, etc

:domain => "ludovicocaldara.net", # domain name

:box => "ludodba/ol7.3-base", # base box, either "ludodba/ol7.3-base" or "ludodba/ubu1604"

:nodes => 3, # number of nodes for this cluster

:cpu => 1,

:mem => 2048,

:publan => IPAddr.new("192.168.56.0/24"), # public lan for the cluster

:publan_start => 121 # starting IP, each VM will increment it by one

{

:prefix => "kafka", # eventually, continue with another cluster!

:domain => "ludovicocaldara.net",

:box => "ludodba/ol7.3-base",

:nodes => 1,

:cpu => 1,

:mem => 2048,

:publan => IPAddr.new("192.168.56.0/24"),

:publan_start => 131

{

:prefix => "postgres",

:domain => "ludovicocaldara.net",

:box => "ludodba/ubu1604",

:nodes => 1,

:cpu => 1,

:mem => 2048,

:publan => IPAddr.new("192.168.56.0/24"),

:publan_start => 141

}

]

###############################

# CUSTOM CONFIGURATION END

###############################

######################################################

# Extending Class IPAddr to add the CIDR to the lan

class IPAddr

def to_cidr_s

if @addr

mask = @mask_addr.to_s(2).count('1')

"#{to_s}/#{mask}"

else

nil

end

end # extend class IPAddr

########

# MAIN #

########

Vagrant.configure(2) do |config|

config.ssh.username = "root" # my boxes are password based for simplicity

config.ssh.password = "vagrant"

config.vm.graceful_halt_timeout = 360 # in case you install grid infra... do not force shutdown after a few seconds

if File.directory?(sw_path)

# our shared folder for oracle 12c installation files (uid 54320 is grid, uid 54321 is oracle)

config.vm.synced_folder sw_path, "/media/sw", :mount_options => ["dmode=775","fmode=775","uid=54322","gid=54328"]

end

# looping through each cluster

(0..(clusters.length-1)).each do |cluid|

# assign variable clu to current cluster, for convenience

clu = clusters[cluid]

# looping through each node in the cluster

(1..(clu[:nodes])).each do |nid|

# let's start from the last node (see RAC Attack automation for the reason) :-)

nid = clu[:nodes]+1-nid

config.vm.define vm_name = "#{clu[:prefix]}%02d" % nid do |cnf|

# set the right box for the VM

cnf.vm.box = clu[:box]

if (clu[:box_version]) then

cnf.vm.box_version = clu[:box_version]

end #if

# the new vm name

vm_name = "#{clu[:prefix]}%02d" % nid

fqdn = "#{vm_name}.#{clu[:domain]}"

cnf.vm.hostname = "#{fqdn}"

# incrementing public ip for the cluster

pubip = clu[:publan].|(clu[:publan_start]+nid-1).to_s

cnf.vm.provider :virtualbox do |vb|

#vb.linked_clone = true # in case you want thin provisioning. read the vagrant doc before setting it

vb.name = vm_name

vb.gui = false

vb.customize ["modifyvm", :id, "--memory", clu[:mem]]

vb.customize ["modifyvm", :id, "--cpus", clu[:cpu]]

vb.customize ["modifyvm", :id, "--groups", "/#{lab_name}/#{clu[:prefix]}"]

end #config.vm.provider

# Configuring virtualbox network for #{pubip}

cnf.vm.network :private_network, ip: pubip

end #config.vm.define

end #loop nodes

end #loop clusters

end #Vagrant.configure

The nice thing, (beside speeding up the creation and basic configuration) is the organization of the directories. The configuration at the beginning of the script will result in 5 virtual machines:

your VM directory
        |- lab_bigdata 
                |- hadoop
                        |- hadoop01  (ol7)
                        |- hadoop02  (ol7)
                        |- hadoop03  (ol7)
                |- kafka
                        |- kafka01   (ol7)
                |- postgres
                        |- postgres01  (ubuntu 16.04)

your VM directory

|- lab_bigdata

|- hadoop

|- hadoop01 (ol7)

|- hadoop02 (ol7)

|- hadoop03 (ol7)

|- kafka

|- kafka01 (ol7)

|- postgres

|- postgres01 (ubuntu 16.04)

It is based, in part (but modified and simplified a lot), from the RAC Attack automation scripts by Alvaro Miranda.

I have a more complex version that automates all the tasks for a full multi-cluster RAC environment, but if this is your requirement, I would rather check oravirt scripts on github (https://github.com/oravirt) . They are much more powerful and complete (and complex…) than my Vagrantfile. 🙂

Cheers

Bash tips & tricks [ep. 7]: Cleanup on EXIT with a trap

Posted on March 24, 2016 by Ludovico

This is the seventh epidose of a small series.

Description:

Pipes, temporary files, lock files, processes spawned in background, rows inserted in a status table that need to be updated… Everything need to be cleaned up if the script exits, even when the exit condition is not triggered inside the script.

BAD:

The worst practice is, of course, to forget to cleanup the tempfiles, leaving my output and temporary directories full of files *.tmp, *.pipe, *.lck, etc. I will not show the code because the list of bad practices is quite long…

Better than forgiving to cleanup, but still very bad, is to cleanup everything just before triggering the exit command (in the following example, F_check_exit is a function that exits the script if the first argument is non-zero, as defined it in the previous episode):

...
some_command_that_must_succeed
EXITCODE=$?
if [ $EXITCODE -ne 0 ] ; then
    # Need to exit here, but F_check_exit function does not cleanup correctly
    [[ $TEMPFILE ]] && [[ -f $TEMPFILE ]] && rm $TMPFILE
    [[ $EXP_PIPE ]] && [[ -f $EXP_PIPE ]] && rm $EXP_PIPE
    if [ $CHILD_PID ] ; then
        ps --pid $CHILD_PID >/dev/null
        if [ $? -eq 0 ] ; then
            kill $CHILD_PID # or wait, or what?
        fi
    fi
    F_check_exit $EXITCODE "Some command that must succeed"
fi

...

some_command_that_must_succeed

EXITCODE=$?

if [ $EXITCODE -ne 0 ] ; then

# Need to exit here, but F_check_exit function does not cleanup correctly

[[ $TEMPFILE ]] && [[ -f $TEMPFILE ]] && rm $TMPFILE

[[ $EXP_PIPE ]] && [[ -f $EXP_PIPE ]] && rm $EXP_PIPE

if [ $CHILD_PID ] ; then

ps --pid $CHILD_PID >/dev/null

if [ $? -eq 0 ] ; then

kill $CHILD_PID # or wait, or what?

F_check_exit $EXITCODE "Some command that must succeed"

A better approach, would be to put all the cleanup tasks in a Cleanup() function and then call this function instead of duplicating all the code everywhere:

...
some_command_that_must_succeed
EXITCODE=$?
[[ $EXITCODE -eq 0 ]] || Cleanup
F_check_exit $EXITCODE "Some command that must succeed"

...

some_command_that_must_succeed

EXITCODE=$?

[[ $EXITCODE -eq 0 ]] || Cleanup

F_check_exit $EXITCODE "Some command that must succeed"

But still, I need to make sure that I insert this piece of code everywhere. Not optimal yet.

I may include the Cleanup function inside the F_check_exit function, but then I have two inconvenients:
1 – I need to define the Cleanup function in every script that includes my include file
2 – still there will be exit conditions that are not trapped

GOOD:

The good approach would be to trap the EXIT signal with the Cleanup function:

Cleanup() {
  # cleanup your stuff here
}

trap Cleanup EXIT

do_something
F_check_exit $? "Something"

Cleanup() {

# cleanup your stuff here

}

trap Cleanup EXIT

do_something

F_check_exit $? "Something"

Much better! But what if my include script has some logic that also creates some temporary files?

I can create a global F_Cleanup function that eventually executes the local Cleanup function, if defined. Let me show this:

Include script:

# this is the include file (e.g. $BASEBIN/Init_Env.sh)
function F_cleanup() {
        EXITCODE=$?
        if [ `typeset -F Cleanup` ] ; then
                edebug "Cleanup function defined. Executing it..."
                Cleanup $EXITCODE
                edebug "Cleanup function executed with return code $?"
        else
                edebug "No cleanup function defined."
        fi
        # do other global cleanups
}

### Register the cleanup function
trap F_cleanup EXIT

# this is the include file (e.g. $BASEBIN/Init_Env.sh)

function F_cleanup() {

EXITCODE=$?

if [ `typeset -F Cleanup` ] ; then

edebug "Cleanup function defined. Executing it..."

Cleanup $EXITCODE

edebug "Cleanup function executed with return code $?"

else

edebug "No cleanup function defined."

# do other global cleanups

}

### Register the cleanup function

trap F_cleanup EXIT

Main script:

# Cleanup: If any function named Cleanup is defined, it will automatically be executed
# upon the EXIT signal.
Cleanup () {
    if [ $1 -eq 0 ] ; then
        # exit 0 trapped
    else
        # exit !0 trapped
        # report the error
    fi
    # remove pipes, temporary files etc
}

. $BASEBIN/Init_Env.sh

do_something
F_check_exit $? "Something"

# Cleanup: If any function named Cleanup is defined, it will automatically be executed

# upon the EXIT signal.

Cleanup () {

if [ $1 -eq 0 ] ; then

# exit 0 trapped

else

# exit !0 trapped

# report the error

# remove pipes, temporary files etc

}

. $BASEBIN/Init_Env.sh

do_something

F_check_exit $? "Something"

The Cleanup function will be executed only if defined.

No Cleanup function: no worries, but still the F_Cleanup function can do some global cleanup not specific to the main script.

Bash tips & tricks [ep. 6]: Check the exit code

Posted on March 23, 2016 by Ludovico

This is the sixth epidose of a small series.

Description:

Every command in a script may fail due to external reasons. Bash programming is not functional programming! 🙂

After running a command, make sure that you check the exit code and either raise a warning or exit with an error, depending on how a failure can impact the execution of the script.

BAD:

The worst example is not to check the exit code at all:

#!/bin/bash -l

recover -a -f -c ${NWCLIENT} -d ${DEST_FILE_PATH} $BASEBCK_FILENAME
# what if recover fails?

do_something_with_recovered_files

#!/bin/bash -l

recover -a -f -c ${NWCLIENT} -d ${DEST_FILE_PATH} $BASEBCK_FILENAME

# what if recover fails?

do_something_with_recovered_files

Next one is better, but you may have a lot of additional code to type:

#!/bin/bash -l

recover -a -f -c ${NWCLIENT} -d ${DEST_FILE_PATH} $BASEBCK_FILENAME

#---------
# the following piece of code is frequently copied&pasted 
ERR=$?
if [ $ERR -ne 0 ] ; then
    # I've got an error with the recovery
    eerror "The recovery failed with exit code $ERR"
    Log_Close
    exit $ERR
else
    eok "The recovery succeeded."
fi
#---------

do_something_with_recovered_files

#!/bin/bash -l

recover -a -f -c ${NWCLIENT} -d ${DEST_FILE_PATH} $BASEBCK_FILENAME

#---------

# the following piece of code is frequently copied&pasted

ERR=$?

if [ $ERR -ne 0 ] ; then

# I've got an error with the recovery

eerror "The recovery failed with exit code $ERR"

Log_Close

exit $ERR

else

eok "The recovery succeeded."

#---------

do_something_with_recovered_files

Again, Log_Close, eok, eerror, etc are functions defined using the previous Bash Tips & Tricks in this series.

GOOD:

Define once the check functions that you will use after every command:

# F_check_warn will eventually raise a warning but let the script continue
function F_check_warn() {
        EXITCODE=$1
        shift
        if [ $EXITCODE -eq 0 ] ; then
                eok $@ succeded with exit code $EXITCODE
        else
                ewarn $@ failed with exit code $EXITCODE. The script will continue.
        fi
        # return the same code so other checks can follow this one inside the script
        return $EXITCODE
}

# F_check_warn will eventually raise an error and exit
function F_check_exit() {
        EXITCODE=$1
        shift
        if [ $EXITCODE -eq 0 ] ; then
                eok $@ succeded with exit code $EXITCODE
        else
                eerror $@ failed with exit code $EXITCODE. The script will exit.
                Log_Close
                exit $EXITCODE
        fi
}

CMD="recover -a -f -c ${NWCLIENT} -d ${DEST_FILE_PATH} $BASEBCK_FILENAME"
enotify "Recover command: $CMD"
eval $CMD
F_check_exit $? "Recovery from networker"

do_something_with_the_recovered_files
F_check_warn $? "Non-blocking operation with recovered files"

# F_check_warn will eventually raise a warning but let the script continue

function F_check_warn() {

EXITCODE=$1

shift

if [ $EXITCODE -eq 0 ] ; then

eok $@ succeded with exit code $EXITCODE

else

ewarn $@ failed with exit code $EXITCODE. The script will continue.

# return the same code so other checks can follow this one inside the script

return $EXITCODE

}

# F_check_warn will eventually raise an error and exit

function F_check_exit() {

EXITCODE=$1

shift

if [ $EXITCODE -eq 0 ] ; then

eok $@ succeded with exit code $EXITCODE

else

eerror $@ failed with exit code $EXITCODE. The script will exit.

Log_Close

exit $EXITCODE

}

CMD="recover -a -f -c ${NWCLIENT} -d ${DEST_FILE_PATH} $BASEBCK_FILENAME"

enotify "Recover command: $CMD"

eval $CMD

F_check_exit $? "Recovery from networker"

do_something_with_the_recovered_files

F_check_warn $? "Non-blocking operation with recovered files"

Bash tips & tricks [ep. 5]: Write the output to a logfile

Posted on March 22, 2016 by Ludovico

This is the fifth epidose of a small series.

Description:

Logging the output of the scripts to a file is very important. There are several ways to achieve it, I will just show one of my favorites.

BAD:

You can log badly either from the script to a log file:

#!/bin/bash -l

TODAY=`date +"%Y%m%d"
LOGDIR='/path/to/log'
OUTPUT="${LOGDIR}/output_${TODAY}.log"

# create the empty file or overwrite the existing one
> $OUTPUT

echo "Writing to the logfile" | tee -a $OUTPUT
command | tee -a $OUTPUT

echo "ops, this message and command will not be logged"
command
exit $?

#!/bin/bash -l

TODAY=`date +"%Y%m%d"

LOGDIR='/path/to/log'

OUTPUT="${LOGDIR}/output_${TODAY}.log"

# create the empty file or overwrite the existing one

> $OUTPUT

echo "Writing to the logfile" | tee -a $OUTPUT

command | tee -a $OUTPUT

echo "ops, this message and command will not be logged"

command

exit $?

or by redirecting badly the standard output of the script:

$ crontab -l
0 * * * * /path/to/script.sh > /path/to/always_the_same_log.out 2>&1

1 2	$ crontab -l 0 * * * * /path/to/script.sh > /path/to/always_the_same_log.out 2>&1

GOOD:

My favorite solution is to automatically open a pipe that will receive from the standard output and redirect to the logfile. With this solution, I can programmatically define my logfile name inside the script (based on the script name and input parameters for example) and forget about redirecting the output everytime that I run a command.

export LOGDIR=/path/to/logfiles
export DATE=`date +"%Y%m%d"`
export DATETIME=`date +"%Y%m%d_%H%M%S"`

ScriptName=`basename $0`
Job=`basename $0 .sh`"_whatever_I_want"
JobClass=`basename $0 .sh`

function Log_Open() {
        if [ $NO_JOB_LOGGING ] ; then
                einfo "Not logging to a logfile because -Z option specified." #(*)
        else
                [[ -d $LOGDIR/$JobClass ]] || mkdir -p $LOGDIR/$JobClass
                Pipe=${LOGDIR}/$JobClass/${Job}_${DATETIME}.pipe
                mkfifo -m 700 $Pipe
                LOGFILE=${LOGDIR}/$JobClass/${Job}_${DATETIME}.log
                exec 3>&1
                tee ${LOGFILE} <$Pipe >&3 &
                teepid=$!
                exec 1>$Pipe
                PIPE_OPENED=1
                enotify Logging to $LOGFILE  # (*)
                [ $SUDO_USER ] && enotify "Sudo user: $SUDO_USER" #(*)
        fi
}

function Log_Close() {
        if [ ${PIPE_OPENED} ] ; then
                exec 1<&3
                sleep 0.2
                ps --pid $teepid >/dev/null
                if [ $? -eq 0 ] ; then
                        # a wait $teepid whould be better but some
                        # commands leave file descriptors open
                        sleep 1
                        kill  $teepid
                fi
                rm $Pipe
                unset PIPE_OPENED
        fi
}

OPTIND=1
while getopts ":Z" opt ; do
        case $opt in
                Z)
                        NO_JOB_LOGGING="true"
                        ;;
        esac
done

Log_Open
echo "whatever I execute here will be logged to $LOGFILE"
command
Log_Close

export LOGDIR=/path/to/logfiles

export DATE=`date +"%Y%m%d"`

export DATETIME=`date +"%Y%m%d_%H%M%S"`

ScriptName=`basename $0`

Job=`basename $0 .sh`"_whatever_I_want"

JobClass=`basename $0 .sh`

function Log_Open() {

if [ $NO_JOB_LOGGING ] ; then

einfo "Not logging to a logfile because -Z option specified." #(*)

else

[[ -d $LOGDIR/$JobClass ]] || mkdir -p $LOGDIR/$JobClass

Pipe=${LOGDIR}/$JobClass/${Job}_${DATETIME}.pipe

mkfifo -m 700 $Pipe

LOGFILE=${LOGDIR}/$JobClass/${Job}_${DATETIME}.log

exec 3>&1

tee ${LOGFILE} <$Pipe >&3 &

teepid=$!

exec 1>$Pipe

PIPE_OPENED=1

enotify Logging to $LOGFILE # (*)

[ $SUDO_USER ] && enotify "Sudo user: $SUDO_USER" #(*)

}

function Log_Close() {

if [ ${PIPE_OPENED} ] ; then

exec 1<&3

sleep 0.2

ps --pid $teepid >/dev/null

if [ $? -eq 0 ] ; then

# a wait $teepid whould be better but some

# commands leave file descriptors open

sleep 1

kill $teepid

rm $Pipe

unset PIPE_OPENED

}

OPTIND=1

while getopts ":Z" opt ; do

case $opt in

NO_JOB_LOGGING="true"

;;

esac

done

Log_Open

echo "whatever I execute here will be logged to $LOGFILE"

command

Log_Close

(*) the functions edebug, einfo, etc, have to be created using the guidelines I have used in this post: Bash tips & tricks [ep. 4]: Use logging levels

The -Z parameter can be used to intentionally avoid logging.

Again, all this stuff (function definitions and variables) should be put in a global include file.

If I execute it:

# [ ludo@testsrv:/scripts [21:10:17] [not set env:"not set"] 0 ] #
# sudo -u oracle ./myscript.sh
2016-03-16 21:10:20 - Logging to /path/to/logfiles/myscript/myscript_whatever_I_want_20160316_211020.log
2016-03-16 21:10:20 - Sudo user: ludo
whatever I execute here will be logged to /path/to/logfiles/myscript/myscript_whatever_I_want_20160316_211020.log

# [ ludo@testsrv:/scripts [21:10:20] [not set env:"not set"] 0 ] #
# sudo -u oracle ./myscript.sh -Z
2016-03-16 21:15:18 - INFO ---- Not logging to a logfile because -Z option specified.
whatever I execute here will be logged to

# [ ludo@testsrv:/scripts [21:10:20] [not set env:"not set"] 0 ] #
# cat /path/to/logfiles/myscript/myscript_whatever_I_want_20160316_211020.log
2016-03-16 21:10:20 - Logging to /path/to/logfiles/myscript/myscript_whatever_I_want_20160316_211020.log
2016-03-16 21:10:20 - Sudo user: ludo
whatever I execute here will be logged to /path/to/logfiles/myscript/myscript_whatever_I_want_20160316_211020.log

# [ ludo@testsrv:/scripts [21:10:17] [not set env:"not set"] 0 ] #

# sudo -u oracle ./myscript.sh

2016-03-16 21:10:20 - Logging to /path/to/logfiles/myscript/myscript_whatever_I_want_20160316_211020.log

2016-03-16 21:10:20 - Sudo user: ludo

whatever I execute here will be logged to /path/to/logfiles/myscript/myscript_whatever_I_want_20160316_211020.log

# [ ludo@testsrv:/scripts [21:10:20] [not set env:"not set"] 0 ] #

# sudo -u oracle ./myscript.sh -Z

2016-03-16 21:15:18 - INFO ---- Not logging to a logfile because -Z option specified.

whatever I execute here will be logged to

# [ ludo@testsrv:/scripts [21:10:20] [not set env:"not set"] 0 ] #

# cat /path/to/logfiles/myscript/myscript_whatever_I_want_20160316_211020.log

2016-03-16 21:10:20 - Logging to /path/to/logfiles/myscript/myscript_whatever_I_want_20160316_211020.log

2016-03-16 21:10:20 - Sudo user: ludo

whatever I execute here will be logged to /path/to/logfiles/myscript/myscript_whatever_I_want_20160316_211020.log

Bash tips & tricks [ep. 4]: Use logging levels

Posted on March 21, 2016 by Ludovico

This is the fourth epidose of a small series.

Description:

Support different logging levels natively in your scripts so that your code will be more stable and maintainable.

BAD:

#!/bin/bash -l
...
# for debug only, comment out when OK
echo $a 
do_something $a

# echo $? # sometimes does not work?

#!/bin/bash -l

...

# for debug only, comment out when OK

echo $a

do_something $a

# echo $? # sometimes does not work?

GOOD:

Nothing to invent, there are already a few blog posts around about the best practices for log messages. I personally like the one from Michael Wayne Goodman:

http://www.goodmami.org/2011/07/04/Simple-logging-in-BASH-scripts.html

I have reused his code in my scripts with very few modifications to fit my needs:

### verbosity levels
silent_lvl=0
crt_lvl=1
err_lvl=2
wrn_lvl=3
ntf_lvl=4
inf_lvl=5
dbg_lvl=6

## esilent prints output even in silent mode
function esilent () { verb_lvl=$silent_lvl elog "$@" ;}
function enotify () { verb_lvl=$ntf_lvl elog "$@" ;}
function eok ()    { verb_lvl=$ntf_lvl elog "SUCCESS - $@" ;}
function ewarn ()  { verb_lvl=$wrn_lvl elog "${colylw}WARNING${colrst} - $@" ;}
function einfo ()  { verb_lvl=$inf_lvl elog "${colwht}INFO${colrst} ---- $@" ;}
function edebug () { verb_lvl=$dbg_lvl elog "${colgrn}DEBUG${colrst} --- $@" ;}
function eerror () { verb_lvl=$err_lvl elog "${colred}ERROR${colrst} --- $@" ;}
function ecrit ()  { verb_lvl=$crt_lvl elog "${colpur}FATAL${colrst} --- $@" ;}
function edumpvar () { for var in $@ ; do edebug "$var=${!var}" ; done }
function elog() {
        if [ $verbosity -ge $verb_lvl ]; then
                datestring=`date +"%Y-%m-%d %H:%M:%S"`
                echo -e "$datestring - $@"
        fi
}

### verbosity levels

silent_lvl=0

crt_lvl=1

err_lvl=2

wrn_lvl=3

ntf_lvl=4

inf_lvl=5

dbg_lvl=6

## esilent prints output even in silent mode

function esilent () { verb_lvl=$silent_lvl elog "$@" ;}

function enotify () { verb_lvl=$ntf_lvl elog "$@" ;}

function eok () { verb_lvl=$ntf_lvl elog "SUCCESS - $@" ;}

function ewarn () { verb_lvl=$wrn_lvl elog "${colylw}WARNING${colrst} - $@" ;}

function einfo () { verb_lvl=$inf_lvl elog "${colwht}INFO${colrst} ---- $@" ;}

function edebug () { verb_lvl=$dbg_lvl elog "${colgrn}DEBUG${colrst} --- $@" ;}

function eerror () { verb_lvl=$err_lvl elog "${colred}ERROR${colrst} --- $@" ;}

function ecrit () { verb_lvl=$crt_lvl elog "${colpur}FATAL${colrst} --- $@" ;}

function edumpvar () { for var in $@ ; do edebug "$var=${!var}" ; done }

function elog() {

if [ $verbosity -ge $verb_lvl ]; then

datestring=`date +"%Y-%m-%d %H:%M:%S"`

echo -e "$datestring - $@"

}

The edumpvar is handy to have the status of several variables at once:

#!/bin/bash -l
# code
#...

verbosity=6

edumpvar ORACLE_SID ORACLE_HOME

<output>
2016-03-15 23:06:10 - DEBUG --- ORACLE_SID=orcl12c
2016-03-15 23:06:10 - DEBUG --- ORACLE_HOME=/u01/app/oracle/product/12.1.0.2
</output>

#!/bin/bash -l

# code

#...

verbosity=6

edumpvar ORACLE_SID ORACLE_HOME

2016-03-15 23:06:10 - DEBUG --- ORACLE_SID=orcl12c

2016-03-15 23:06:10 - DEBUG --- ORACLE_HOME=/u01/app/oracle/product/12.1.0.2

</output>

If you couple the verbosity level with input parameters you can have something quite clever (e.g. -s for silent, -V for verbose, -G for debug). I’m putting everything into one single snippet just as example, but as you can imagine, you should seriously put all the fixed variables and functions inside an external file that you will systematically include in your scripts:

#!/bin/bash -l

colblk='\033[0;30m' # Black - Regular
colred='\033[0;31m' # Red
colgrn='\033[0;32m' # Green
colylw='\033[0;33m' # Yellow
colpur='\033[0;35m' # Purple
colrst='\033[0m'    # Text Reset

verbosity=4

### verbosity levels
silent_lvl=0
crt_lvl=1
err_lvl=2
wrn_lvl=3
ntf_lvl=4
inf_lvl=5
dbg_lvl=6

## esilent prints output even in silent mode
function esilent () { verb_lvl=$silent_lvl elog "$@" ;}
function enotify () { verb_lvl=$ntf_lvl elog "$@" ;}
function eok ()    { verb_lvl=$ntf_lvl elog "SUCCESS - $@" ;}
function ewarn ()  { verb_lvl=$wrn_lvl elog "${colylw}WARNING${colrst} - $@" ;}
function einfo ()  { verb_lvl=$inf_lvl elog "${colwht}INFO${colrst} ---- $@" ;}
function edebug () { verb_lvl=$dbg_lvl elog "${colgrn}DEBUG${colrst} --- $@" ;}
function eerror () { verb_lvl=$err_lvl elog "${colred}ERROR${colrst} --- $@" ;}
function ecrit ()  { verb_lvl=$crt_lvl elog "${colpur}FATAL${colrst} --- $@" ;}
function edumpvar () { for var in $@ ; do edebug "$var=${!var}" ; done }
function elog() {
        if [ $verbosity -ge $verb_lvl ]; then
                datestring=`date +"%Y-%m-%d %H:%M:%S"`
                echo -e "$datestring - $@"
        fi
}

OPTIND=1
while getopts ":sVG" opt ; do
        case $opt in
        s)
                verbosity=$silent_lvl
                edebug "-s specified: Silent mode"
                ;;
        V)
                verbosity=$inf_lvl
                edebug "-V specified: Verbose mode"
                ;;
        G)
                verbosity=$dbg_lvl
                edebug "-G specified: Debug mode"
                ;;
        esac
done

ewarn "this is a warning"
eerror "this is an error"
einfo "this is an information"
edebug "debugging"
ecrit "CRITICAL MESSAGE!"
edumpvar ORACLE_SID

#!/bin/bash -l

colblk='\033[0;30m' # Black - Regular

colred='\033[0;31m' # Red

colgrn='\033[0;32m' # Green

colylw='\033[0;33m' # Yellow

colpur='\033[0;35m' # Purple

colrst='\033[0m' # Text Reset

verbosity=4

### verbosity levels

silent_lvl=0

crt_lvl=1

err_lvl=2

wrn_lvl=3

ntf_lvl=4

inf_lvl=5

dbg_lvl=6

## esilent prints output even in silent mode

function esilent () { verb_lvl=$silent_lvl elog "$@" ;}

function enotify () { verb_lvl=$ntf_lvl elog "$@" ;}

function eok () { verb_lvl=$ntf_lvl elog "SUCCESS - $@" ;}

function ewarn () { verb_lvl=$wrn_lvl elog "${colylw}WARNING${colrst} - $@" ;}

function einfo () { verb_lvl=$inf_lvl elog "${colwht}INFO${colrst} ---- $@" ;}

function edebug () { verb_lvl=$dbg_lvl elog "${colgrn}DEBUG${colrst} --- $@" ;}

function eerror () { verb_lvl=$err_lvl elog "${colred}ERROR${colrst} --- $@" ;}

function ecrit () { verb_lvl=$crt_lvl elog "${colpur}FATAL${colrst} --- $@" ;}

function edumpvar () { for var in $@ ; do edebug "$var=${!var}" ; done }

function elog() {

if [ $verbosity -ge $verb_lvl ]; then

datestring=`date +"%Y-%m-%d %H:%M:%S"`

echo -e "$datestring - $@"

}

OPTIND=1

while getopts ":sVG" opt ; do

case $opt in

verbosity=$silent_lvl

edebug "-s specified: Silent mode"

;;

verbosity=$inf_lvl

edebug "-V specified: Verbose mode"

;;

verbosity=$dbg_lvl

edebug "-G specified: Debug mode"

;;

esac

done

ewarn "this is a warning"

eerror "this is an error"

einfo "this is an information"

edebug "debugging"

ecrit "CRITICAL MESSAGE!"

edumpvar ORACLE_SID

Example:

$ example.sh -s

1	$ example.sh -s

$ example.sh

1	$ example.sh

$ example.sh -V

1	$ example.sh -V

$ example.sh -G

1	$ example.sh -G

It does not take into account the output file. That will be part of the next tip 🙂

Bash tips & tricks [ep. 3]: Colour your terminal!

Posted on March 18, 2016 by Ludovico

This is the third epidose of a small series.

Description:

The days of monochrome green-on-black screens are over, in a remote shell terminal you can have something fancier!

BAD:

GOOD:

Define a series of variables as shortcuts for color escape codes, there are plenty of examples on internet.

        colblk='\033[0;30m' # Black - Regular
        colred='\033[0;31m' # Red
        colgrn='\033[0;32m' # Green
        colylw='\033[0;33m' # Yellow
        colblu='\033[0;34m' # Blue
        colpur='\033[0;35m' # Purple
        colcyn='\033[0;36m' # Cyan
        colwht='\033[0;37m' # White
        colbblk='\033[1;30m' # Black - Bold
        colbred='\033[1;31m' # Red
        colbgrn='\033[1;32m' # Green
        colbylw='\033[1;33m' # Yellow
        colbblu='\033[1;34m' # Blue
        colbpur='\033[1;35m' # Purple
        colbcyn='\033[1;36m' # Cyan
        colbwht='\033[1;37m' # White
        colublk='\033[4;30m' # Black - Underline
        colured='\033[4;31m' # Red
        colugrn='\033[4;32m' # Green
        coluylw='\033[4;33m' # Yellow
        colublu='\033[4;34m' # Blue
        colupur='\033[4;35m' # Purple
        colucyn='\033[4;36m' # Cyan
        coluwht='\033[4;37m' # White
        colbgblk='\033[40m'   # Black - Background
        colbgred='\033[41m'   # Red
        colbggrn='\033[42m'   # Green
        colbgylw='\033[43m'   # Yellow
        colbgblu='\033[44m'   # Blue
        colbgpur='\033[45m'   # Purple
        colbgcyn='\033[46m'   # Cyan
        colbgwht='\033[47m'   # White
        colrst='\033[0m'    # Text Reset

colblk='\033[0;30m' # Black - Regular

colred='\033[0;31m' # Red

colgrn='\033[0;32m' # Green

colylw='\033[0;33m' # Yellow

colblu='\033[0;34m' # Blue

colpur='\033[0;35m' # Purple

colcyn='\033[0;36m' # Cyan

colwht='\033[0;37m' # White

colbblk='\033[1;30m' # Black - Bold

colbred='\033[1;31m' # Red

colbgrn='\033[1;32m' # Green

colbylw='\033[1;33m' # Yellow

colbblu='\033[1;34m' # Blue

colbpur='\033[1;35m' # Purple

colbcyn='\033[1;36m' # Cyan

colbwht='\033[1;37m' # White

colublk='\033[4;30m' # Black - Underline

colured='\033[4;31m' # Red

colugrn='\033[4;32m' # Green

coluylw='\033[4;33m' # Yellow

colublu='\033[4;34m' # Blue

colupur='\033[4;35m' # Purple

colucyn='\033[4;36m' # Cyan

coluwht='\033[4;37m' # White

colbgblk='\033[40m' # Black - Background

colbgred='\033[41m' # Red

colbggrn='\033[42m' # Green

colbgylw='\033[43m' # Yellow

colbgblu='\033[44m' # Blue

colbgpur='\033[45m' # Purple

colbgcyn='\033[46m' # Cyan

colbgwht='\033[47m' # White

colrst='\033[0m' # Text Reset

Use them whenever you need to highlight the output of a script, and eventually integrate them in a smart prompt (like the one I’ve blogged about sometimes ago).

The echo builtin command requires -e in order to make the colours work. When reading files, cat works, less requires -r. vi may work with some hacking, but it’s not worth to spend too much time, IMHO.

Bash tips & tricks [ep. 2]: Have a smart environment for personal accounts

Posted on March 17, 2016 by Ludovico

This is the second epidose of a small series.

Description:

The main technical account (oracle here) usually has the smart environment, with aliases, scripts avilable at fingertips, correct environment variables and functions.

When working with personal accounts, it may be boring to set the new environment at each login, copy it from a golden copy or reinvent the wheel everytime.

BAD:

Login: ludo
Password:

-bash-4.1$  env
HOSTNAME=testsrv
TERM=xterm
SHELL=/bin/bash
SSH_CLIENT=w.x.y.z 65373 22
OLDPWD=/home/ludo
SSH_TTY=/dev/pts/0
USER=ludo
LS_COLORS=...
MAIL=/var/spool/mail/ludo
PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin
PWD=/home/ludo
LANG=en_US.UTF-8
HISTCONTROL=ignoredups
SHLVL=1
HOME=/home/ludo
LOGNAME=ludo
LESSOPEN=||/usr/bin/lesspipe.sh %s
_=/bin/env

-bash-4.1$ typeset -f | grep '()'
_module ()
    COMPREPLY=();
_module_avail ()
_module_long_arg_list ()
_module_not_yet_loaded ()
module ()

-bash-4.1$ vi .bash_profile
... damn, let's make this environment smarter
...

Password:

-bash-4.1$ env

HOSTNAME=testsrv

TERM=xterm

SHELL=/bin/bash

SSH_CLIENT=w.x.y.z 65373 22

OLDPWD=/home/ludo

SSH_TTY=/dev/pts/0

USER=ludo

LS_COLORS=...

MAIL=/var/spool/mail/ludo

PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin

PWD=/home/ludo

LANG=en_US.UTF-8

HISTCONTROL=ignoredups

SHLVL=1

HOME=/home/ludo

LOGNAME=ludo

LESSOPEN=||/usr/bin/lesspipe.sh %s

_=/bin/env

-bash-4.1$ typeset -f | grep '()'

_module ()

COMPREPLY=();

_module_avail ()

_module_long_arg_list ()

_module_not_yet_loaded ()

module ()

-bash-4.1$ vi .bash_profile

... damn, let's make this environment smarter

...

GOOD:

Distribute a standard .bash_profile that calls a central profile script valid for all the users:

# [ ludo@testsrv:/home/ludo [15:53:18] [12.1.0.2 env:orcl12c] 0 ] #
# cat .bash_profile
# .bash_profile

#################################################
# WARNING: This script is controlled by puppet.
# If you need to override or add something
# please use ~/.bash_profile_local
#################################################

if [ -f ~/.bashrc ]; then
    . ~/.bashrc
fi

# load oracle common environment
. /u01/app/oracle/scripts/sbin/ora_profile

[ -f $HOME/.bash_profile_local ] && . $HOME/.bash_profile_local

# [ ludo@testsrv:/home/ludo [15:53:21] [12.1.0.2 env:orcl12c] 0 ] #
#

# [ ludo@testsrv:/home/ludo [15:53:18] [12.1.0.2 env:orcl12c] 0 ] #

# cat .bash_profile

# .bash_profile

#################################################

# WARNING: This script is controlled by puppet.

# If you need to override or add something

# please use ~/.bash_profile_local

#################################################

if [ -f ~/.bashrc ]; then

. ~/.bashrc

# load oracle common environment

. /u01/app/oracle/scripts/sbin/ora_profile

[ -f $HOME/.bash_profile_local ] && . $HOME/.bash_profile_local

# [ ludo@testsrv:/home/ludo [15:53:21] [12.1.0.2 env:orcl12c] 0 ] #

Make your common environment as smart as possible. If any commands need to be run differently depending on the user (oracle or not oracle), just use a simple if:

if [ $USER != "oracle" ] ; then
        alias vioratab='sudoedit -u oracle $ORATAB'
else
        alias vioratab='vi $ORATAB'
fi

if [ $USER != "oracle" ] ; then

alias vioratab='sudoedit -u oracle $ORATAB'

else

alias vioratab='vi $ORATAB'

The goal of course is to avoid as many types as you can, and let all your colleagues profit of the smart environment.

Bash tips & tricks [ep. 1]: Deal with personal accounts and file permissions

Posted on March 16, 2016 by Ludovico

This is the first episode of a mini series of Bash tips for Linux (in case you are wondering, yes, they are respectively my favorite shell and my favorite OS 😉 ).

Episode 1: Deal with personal accounts and file permissions
Episode 2: Have a smart environment for personal accounts
Epidode 3: Colour your terminal!
Episode 4: Use logging levels
Episode 5: Write the output to a logfile
Episode 6: Check the exit code
Episode 7: Cleanup on EXIT with a trap

Description:

Nowadays it is mandatory at many companies to log in on Linux servers with a personal account (either integrated with LDAP, kerberos or whatelse) to comply with strict auditing rules.

I need to be sure that I have an environment where my modifications do not conflict with my colleagues environment.

BAD:

-bash-4.1$ id
uid=20928(ludo) gid=200(dba) groups=200(dba)
-bash-4.1$ ls -lia
total 8
8196 drwxrwxr-x   2 oracle dba  4096 Mar 15 15:14 .
   2 drwxrwxrwt. 14 root   root 4096 Mar 15 15:15 ..
-bash-4.1$ vi script.sh
... edit here...
-bash-4.1$ ls -l
total 4
-rw-r--r-- 1 ludo  dba 8 Mar 15 15:15 script.sh
-bash-4.1$

-bash-4.1$ id

uid=20928(ludo) gid=200(dba) groups=200(dba)

-bash-4.1$ ls -lia

total 8

8196 drwxrwxr-x 2 oracle dba 4096 Mar 15 15:14 .

2 drwxrwxrwt. 14 root root 4096 Mar 15 15:15 ..

-bash-4.1$ vi script.sh

... edit here...

-bash-4.1$ ls -l

total 4

-rw-r--r-- 1 ludo dba 8 Mar 15 15:15 script.sh

-bash-4.1$

the script has been created by me, but my colleagues may need to modify it! So I need to change the ownership:

$ chown oracle:dba script.sh
chown: changing ownership of `script.sh': Operation not permitted
$

$ chown oracle:dba script.sh

chown: changing ownership of `script.sh': Operation not permitted

But I can only change the permissions:

$ chmod 775 script.sh
$

1 2	$ chmod 775 script.sh $

If I really want to change the owner, I have to ask to someone that has root privileges or delete the file with my account and create it with the correct one (oracle or something else).

GOOD:

Set the setgid bit at the directory level
Define an alias for my favorite editor that use sudoedit instead:

$ chmod 2751 .
$ ls -lia
total 4
8196 drwxr-s--x 2 oracle dba  4096 Mar 15 15:26 .
$ alias vi='SUDO_EDITOR=/usr/bin/vim sudoedit -u oracle '
$ vi script.sh
[sudo] password for ludo:
... edit here ...
$ ls -l script.sh
total 8
-rw-r--r-- 1 oracle dba 6 Mar 15 15:24 script.sh
$

$ chmod 2751 .

$ ls -lia

total 4

8196 drwxr-s--x 2 oracle dba 4096 Mar 15 15:26 .

$ alias vi='SUDO_EDITOR=/usr/bin/vim sudoedit -u oracle '

$ vi script.sh

[sudo] password for ludo:

... edit here ...

$ ls -l script.sh

total 8

-rw-r--r-- 1 oracle dba 6 Mar 15 15:24 script.sh

In case I need to modify other files with MY account, I can either use the full path (/usr/bin/vim) or define another alias:

alias vime="/usr/bin/vim"

1	alias vime="/usr/bin/vim"

Migrating Oracle RAC from SuSE to OEL (or RHEL) live

Posted on November 10, 2015 by Ludovico

I have a customer that needs to migrate its Oracle RAC cluster from SuSE to OEL.

I know, I know, there is a paper from Dell and Oracle named:

How Dell Migrated from SUSE Linux to Oracle Linux

That explains how Dell migrated its many RAC clusters from SuSE to OEL. The problem is that they used a different strategy:

– backup the configuration of the nodes
– then for each node, one at time
– stop the node
– reinstall the OS
– restore the configuration and the Oracle binaries
– relink
– restart

What I want to achieve instead is:
– add one OEL node to the SuSE cluster as new node
– remove one SuSE node from the now-mixed cluster
– install/restore/relink the RDBMS software (RAC) on the new node
– move the RAC instances to the new node (taking care to NOT run more than the number of licensed nodes/CPUs at any time)
– repeat (for the remaining nodes)

because the customer will also migrate to new hardware.

In order to test this migration path, I’ve set up a SINGLE NODE cluster (if it works for one node, it will for two or more).

oracle@sles01:~> crsctl stat res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       sles01                   STABLE
ora.LISTENER.lsnr
               ONLINE  ONLINE       sles01                   STABLE
ora.asm
               ONLINE  ONLINE       sles01                   Started,STABLE
ora.net1.network
               ONLINE  ONLINE       sles01                   STABLE
ora.ons
               ONLINE  ONLINE       sles01                   STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       sles01                   STABLE
ora.cvu
      1        ONLINE  ONLINE       sles01                   STABLE
ora.oc4j
      1        OFFLINE OFFLINE                               STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       sles01                   STABLE
ora.sles01.vip
      1        ONLINE  ONLINE       sles01                   STABLE
--------------------------------------------------------------------------------
oracle@sles01:~> cat /etc/issue

Welcome to SUSE Linux Enterprise Server 11 SP4  (x86_64) - Kernel \r (\l).

oracle@sles01:~> crsctl stat res -t

--------------------------------------------------------------------------------

Name Target State Server State details

--------------------------------------------------------------------------------

Local Resources

--------------------------------------------------------------------------------

ora.DATA.dg

ONLINE ONLINE sles01 STABLE

ora.LISTENER.lsnr

ONLINE ONLINE sles01 STABLE

ora.asm

ONLINE ONLINE sles01 Started,STABLE

ora.net1.network

ONLINE ONLINE sles01 STABLE

ora.ons

ONLINE ONLINE sles01 STABLE

--------------------------------------------------------------------------------

Cluster Resources

--------------------------------------------------------------------------------

ora.LISTENER_SCAN1.lsnr

1 ONLINE ONLINE sles01 STABLE

ora.cvu

1 ONLINE ONLINE sles01 STABLE

ora.oc4j

1 OFFLINE OFFLINE STABLE

ora.scan1.vip

1 ONLINE ONLINE sles01 STABLE

ora.sles01.vip

1 ONLINE ONLINE sles01 STABLE

--------------------------------------------------------------------------------

oracle@sles01:~> cat /etc/issue

Welcome to SUSE Linux Enterprise Server 11 SP4 (x86_64) - Kernel \r (\l).

I have to setup the new node addition carefully, mainly as I would do with a traditional node addition:

Add new ip addresses (public, private, vip) to the DNS/hosts
Install the new OEL server
Keep the same user and groups (uid, gid, etc)
Verify the network connectivity and setup SSH equivalence
Check that the multicast connection is ok
Add the storage, configure persistent naming (udev) and verify that the disks (major, minor, names) are the very same
The network cards also must be the very same

Once the new host ready, the cluvfy stage -pre nodeadd will likely fail due to

Kernel release mismatch
Package mismatch

Here’s an example of output:

oracle@sles01:~> cluvfy stage -pre nodeadd -n rhel01

Performing pre-checks for node addition

Checking node reachability...
Node reachability check passed from node "sles01"


Checking user equivalence...
User equivalence check passed for user "oracle"
Package existence check passed for "cvuqdisk"

Checking CRS integrity...

CRS integrity check passed

Clusterware version consistency passed.

Checking shared resources...

Checking CRS home location...
Location check passed for: "/u01/app/12.1.0/grid"
Shared resources check for node addition passed


Checking node connectivity...

Checking hosts config file...

Verification of the hosts config file successful

Check: Node connectivity using interfaces on subnet "192.168.56.0"
Node connectivity passed for subnet "192.168.56.0" with node(s) sles01,rhel01
TCP connectivity check passed for subnet "192.168.56.0"


Check: Node connectivity using interfaces on subnet "172.16.100.0"
Node connectivity passed for subnet "172.16.100.0" with node(s) rhel01,sles01
TCP connectivity check passed for subnet "172.16.100.0"

Checking subnet mask consistency...
Subnet mask consistency check passed for subnet "192.168.56.0".
Subnet mask consistency check passed for subnet "172.16.100.0".
Subnet mask consistency check passed.

Node connectivity check passed

Checking multicast communication...

Checking subnet "172.16.100.0" for multicast communication with multicast group "224.0.0.251"...
Check of subnet "172.16.100.0" for multicast communication with multicast group "224.0.0.251" passed.

Check of multicast communication passed.
Total memory check passed
Available memory check passed
Swap space check passed
Free disk space check passed for "sles01:/usr,sles01:/var,sles01:/etc,sles01:/u01/app/12.1.0/grid,sles01:/sbin,sles01:/tmp"
Free disk space check passed for "rhel01:/usr,rhel01:/var,rhel01:/etc,rhel01:/u01/app/12.1.0/grid,rhel01:/sbin,rhel01:/tmp"
Check for multiple users with UID value 1101 passed
User existence check passed for "oracle"
Run level check passed
Hard limits check passed for "maximum open file descriptors"
Soft limits check passed for "maximum open file descriptors"
Hard limits check passed for "maximum user processes"
Soft limits check passed for "maximum user processes"
System architecture check passed

WARNING:
PRVF-7524 : Kernel version is not consistent across all the nodes.
Kernel version = "3.0.101-63-default" found on nodes: sles01.
Kernel version = "3.8.13-16.2.1.el6uek.x86_64" found on nodes: rhel01.
Kernel version check passed
Kernel parameter check passed for "semmsl"
Kernel parameter check passed for "semmns"
Kernel parameter check passed for "semopm"
Kernel parameter check passed for "semmni"
Kernel parameter check passed for "shmmax"
Kernel parameter check passed for "shmmni"
Kernel parameter check passed for "shmall"
Kernel parameter check passed for "file-max"
Kernel parameter check passed for "ip_local_port_range"
Kernel parameter check passed for "rmem_default"
Kernel parameter check passed for "rmem_max"
Kernel parameter check passed for "wmem_default"
Kernel parameter check passed for "wmem_max"
Kernel parameter check passed for "aio-max-nr"
Package existence check passed for "make"
Package existence check passed for "libaio"
Package existence check passed for "binutils"
Package existence check passed for "gcc(x86_64)"
Package existence check passed for "gcc-c++(x86_64)"
Package existence check passed for "glibc"
Package existence check passed for "glibc-devel"
Package existence check passed for "ksh"
Package existence check passed for "libaio-devel"
Package existence check failed for "libstdc++33"
Check failed on nodes:
        rhel01
Package existence check failed for "libstdc++43-devel"
Check failed on nodes:
        rhel01
Package existence check passed for "libstdc++-devel(x86_64)"
Package existence check failed for "libstdc++46"
Check failed on nodes:
        rhel01
Package existence check failed for "libgcc46"
Check failed on nodes:
        rhel01
Package existence check passed for "sysstat"
Package existence check failed for "libcap1"
Check failed on nodes:
        rhel01
Package existence check failed for "nfs-kernel-server"
Check failed on nodes:
        rhel01
Check for multiple users with UID value 0 passed
Current group ID check passed

Starting check for consistency of primary group of root user

Check for consistency of root user's primary group passed
Group existence check passed for "asmadmin"
Group existence check passed for "asmoper"
Group existence check passed for "asmdba"

Checking ASMLib configuration.
Check for ASMLib configuration passed.

Checking OCR integrity...

OCR integrity check passed

Checking Oracle Cluster Voting Disk configuration...

Oracle Cluster Voting Disk configuration check passed
Time zone consistency check passed

Starting Clock synchronization checks using Network Time Protocol(NTP)...

NTP Configuration file check started...
No NTP Daemons or Services were found to be running

Clock synchronization check using Network Time Protocol(NTP) passed


User "oracle" is not part of "root" group. Check passed
Checking integrity of file "/etc/resolv.conf" across nodes

"domain" and "search" entries do not coexist in any  "/etc/resolv.conf" file
All nodes have same "search" order defined in file "/etc/resolv.conf"
PRVF-5636 : The DNS response time for an unreachable node exceeded "15000" ms on following nodes: sles01,rhel01

Check for integrity of file "/etc/resolv.conf" failed


Checking integrity of name service switch configuration file "/etc/nsswitch.conf" ...
Check for integrity of name service switch configuration file "/etc/nsswitch.conf" passed


Pre-check for node addition was unsuccessful on all the nodes.

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

oracle@sles01:~> cluvfy stage -pre nodeadd -n rhel01

Performing pre-checks for node addition

Checking node reachability...

Node reachability check passed from node "sles01"

Checking user equivalence...

User equivalence check passed for user "oracle"

Package existence check passed for "cvuqdisk"

Checking CRS integrity...

CRS integrity check passed

Clusterware version consistency passed.

Checking shared resources...

Checking CRS home location...

Location check passed for: "/u01/app/12.1.0/grid"

Shared resources check for node addition passed

Checking node connectivity...

Checking hosts config file...

Verification of the hosts config file successful

Check: Node connectivity using interfaces on subnet "192.168.56.0"

Node connectivity passed for subnet "192.168.56.0" with node(s) sles01,rhel01

TCP connectivity check passed for subnet "192.168.56.0"

Check: Node connectivity using interfaces on subnet "172.16.100.0"

Node connectivity passed for subnet "172.16.100.0" with node(s) rhel01,sles01

TCP connectivity check passed for subnet "172.16.100.0"

Checking subnet mask consistency...

Subnet mask consistency check passed for subnet "192.168.56.0".

Subnet mask consistency check passed for subnet "172.16.100.0".

Subnet mask consistency check passed.

Node connectivity check passed

Checking multicast communication...

Checking subnet "172.16.100.0" for multicast communication with multicast group "224.0.0.251"...

Check of subnet "172.16.100.0" for multicast communication with multicast group "224.0.0.251" passed.

Check of multicast communication passed.

Total memory check passed

Available memory check passed

Swap space check passed

Free disk space check passed for "sles01:/usr,sles01:/var,sles01:/etc,sles01:/u01/app/12.1.0/grid,sles01:/sbin,sles01:/tmp"

Free disk space check passed for "rhel01:/usr,rhel01:/var,rhel01:/etc,rhel01:/u01/app/12.1.0/grid,rhel01:/sbin,rhel01:/tmp"

Check for multiple users with UID value 1101 passed

User existence check passed for "oracle"

Run level check passed

Hard limits check passed for "maximum open file descriptors"

Soft limits check passed for "maximum open file descriptors"

Hard limits check passed for "maximum user processes"

Soft limits check passed for "maximum user processes"

System architecture check passed

WARNING:

PRVF-7524 : Kernel version is not consistent across all the nodes.

Kernel version = "3.0.101-63-default" found on nodes: sles01.

Kernel version = "3.8.13-16.2.1.el6uek.x86_64" found on nodes: rhel01.

Kernel version check passed

Kernel parameter check passed for "semmsl"

Kernel parameter check passed for "semmns"

Kernel parameter check passed for "semopm"

Kernel parameter check passed for "semmni"

Kernel parameter check passed for "shmmax"

Kernel parameter check passed for "shmmni"

Kernel parameter check passed for "shmall"

Kernel parameter check passed for "file-max"

Kernel parameter check passed for "ip_local_port_range"

Kernel parameter check passed for "rmem_default"

Kernel parameter check passed for "rmem_max"

Kernel parameter check passed for "wmem_default"

Kernel parameter check passed for "wmem_max"

Kernel parameter check passed for "aio-max-nr"

Package existence check passed for "make"

Package existence check passed for "libaio"

Package existence check passed for "binutils"

Package existence check passed for "gcc(x86_64)"

Package existence check passed for "gcc-c++(x86_64)"

Package existence check passed for "glibc"

Package existence check passed for "glibc-devel"

Package existence check passed for "ksh"

Package existence check passed for "libaio-devel"

Package existence check failed for "libstdc++33"

Check failed on nodes:

rhel01

Package existence check failed for "libstdc++43-devel"

Check failed on nodes:

rhel01

Package existence check passed for "libstdc++-devel(x86_64)"

Package existence check failed for "libstdc++46"

Check failed on nodes:

rhel01

Package existence check failed for "libgcc46"

Check failed on nodes:

rhel01

Package existence check passed for "sysstat"

Package existence check failed for "libcap1"

Check failed on nodes:

rhel01

Package existence check failed for "nfs-kernel-server"

Check failed on nodes:

rhel01

Check for multiple users with UID value 0 passed

Current group ID check passed

Starting check for consistency of primary group of root user

Check for consistency of root user's primary group passed

Group existence check passed for "asmadmin"

Group existence check passed for "asmoper"

Group existence check passed for "asmdba"

Checking ASMLib configuration.

Check for ASMLib configuration passed.

Checking OCR integrity...

OCR integrity check passed

Checking Oracle Cluster Voting Disk configuration...

Oracle Cluster Voting Disk configuration check passed

Time zone consistency check passed

Starting Clock synchronization checks using Network Time Protocol(NTP)...

NTP Configuration file check started...

No NTP Daemons or Services were found to be running

Clock synchronization check using Network Time Protocol(NTP) passed

User "oracle" is not part of "root" group. Check passed

Checking integrity of file "/etc/resolv.conf" across nodes

"domain" and "search" entries do not coexist in any "/etc/resolv.conf" file

All nodes have same "search" order defined in file "/etc/resolv.conf"

PRVF-5636 : The DNS response time for an unreachable node exceeded "15000" ms on following nodes: sles01,rhel01

Check for integrity of file "/etc/resolv.conf" failed

Checking integrity of name service switch configuration file "/etc/nsswitch.conf" ...

Check for integrity of name service switch configuration file "/etc/nsswitch.conf" passed

Pre-check for node addition was unsuccessful on all the nodes.

So the problem is not if the check succeed or not (it will not), but what fails.

Solving all the problems not related to the difference SuSE-OEL is crucial, because the addNode.sh will fail with the same errors. I need to run it using -ignorePrereqs and -ignoreSysPrereqs switches. Let’s see how it works:

oracle@sles01:/u01/app/12.1.0/grid/addnode> ./addnode.sh -silent "CLUSTER_NEW_NODES={rhel01}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={rhel01-vip}" -ignorePrereq -ignoreSysPrereqs
Starting Oracle Universal Installer...

Checking Temp space: must be greater than 120 MB.   Actual 27479 MB    Passed
Checking swap space: must be greater than 150 MB.   Actual 2032 MB    Passed

Prepare Configuration in progress.

Prepare Configuration successful.
..................................................   9% Done.
You can find the log of this install session at:
 /u01/app/oraInventory/logs/addNodeActions2015-11-09_09-57-16PM.log

Instantiate files in progress.

Instantiate files successful.
..................................................   15% Done.

Copying files to node in progress.

Copying files to node successful.
..................................................   79% Done.

Saving cluster inventory in progress.
..................................................   87% Done.

Saving cluster inventory successful.
The Cluster Node Addition of /u01/app/12.1.0/grid was successful.
Please check '/tmp/silentInstall.log' for more details.

As a root user, execute the following script(s):
        1. /u01/app/oraInventory/orainstRoot.sh
        2. /u01/app/12.1.0/grid/root.sh

Execute /u01/app/oraInventory/orainstRoot.sh on the following nodes:
[rhel01]
Execute /u01/app/12.1.0/grid/root.sh on the following nodes:
[rhel01]

The scripts can be executed in parallel on all the nodes. If there are any policy managed databases managed by cluster, proceed with the addnode procedure without executing the root.sh script. Ensure that root.sh script is executed after all the policy managed databases managed by clusterware are extended to the new nodes.
..........
Update Inventory in progress.
..................................................   100% Done.

Update Inventory successful.
Successfully Setup Software.

oracle@sles01:/u01/app/12.1.0/grid/addnode> ./addnode.sh -silent "CLUSTER_NEW_NODES={rhel01}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={rhel01-vip}" -ignorePrereq -ignoreSysPrereqs

Starting Oracle Universal Installer...

Checking Temp space: must be greater than 120 MB. Actual 27479 MB Passed

Checking swap space: must be greater than 150 MB. Actual 2032 MB Passed

Prepare Configuration in progress.

Prepare Configuration successful.

.................................................. 9% Done.

You can find the log of this install session at:

/u01/app/oraInventory/logs/addNodeActions2015-11-09_09-57-16PM.log

Instantiate files in progress.

Instantiate files successful.

.................................................. 15% Done.

Copying files to node in progress.

Copying files to node successful.

.................................................. 79% Done.

Saving cluster inventory in progress.

.................................................. 87% Done.

Saving cluster inventory successful.

The Cluster Node Addition of /u01/app/12.1.0/grid was successful.

Please check '/tmp/silentInstall.log' for more details.

As a root user, execute the following script(s):

1. /u01/app/oraInventory/orainstRoot.sh

2. /u01/app/12.1.0/grid/root.sh

Execute /u01/app/oraInventory/orainstRoot.sh on the following nodes:

[rhel01]

Execute /u01/app/12.1.0/grid/root.sh on the following nodes:

[rhel01]

The scripts can be executed in parallel on all the nodes. If there are any policy managed databases managed by cluster, proceed with the addnode procedure without executing the root.sh script. Ensure that root.sh script is executed after all the policy managed databases managed by clusterware are extended to the new nodes.

..........

Update Inventory in progress.

.................................................. 100% Done.

Update Inventory successful.

Successfully Setup Software.

Then, as stated by the addNode.sh, I run the root.sh and I expect it to work:

[oracle@rhel01 install]$ sudo /u01/app/12.1.0/grid/root.sh
Performing root user operation for Oracle 12c

The following environment variables are set as:
    ORACLE_OWNER= oracle
    ORACLE_HOME=  /u01/app/12.1.0/grid
   Copying dbhome to /usr/local/bin ...
   Copying oraenv to /usr/local/bin ...
   Copying coraenv to /usr/local/bin ...

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Relinking oracle with rac_on option
Using configuration parameter file: /u01/app/12.1.0/grid/crs/install/crsconfig_params
2015/11/09 23:18:42 CLSRSC-363: User ignored prerequisites during installation

OLR initialization - successful
2015/11/09 23:19:08 CLSRSC-330: Adding Clusterware entries to file 'oracle-ohasd.conf'

CRS-4133: Oracle High Availability Services has been stopped.
CRS-4123: Oracle High Availability Services has been started.
CRS-4133: Oracle High Availability Services has been stopped.
CRS-4123: Oracle High Availability Services has been started.
CRS-4133: Oracle High Availability Services has been stopped.
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.mdnsd' on 'rhel01'
CRS-2672: Attempting to start 'ora.evmd' on 'rhel01'
CRS-2676: Start of 'ora.mdnsd' on 'rhel01' succeeded
CRS-2676: Start of 'ora.evmd' on 'rhel01' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'rhel01'
CRS-2676: Start of 'ora.gpnpd' on 'rhel01' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'rhel01'
CRS-2676: Start of 'ora.gipcd' on 'rhel01' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rhel01'
CRS-2676: Start of 'ora.cssdmonitor' on 'rhel01' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'rhel01'
CRS-2672: Attempting to start 'ora.diskmon' on 'rhel01'
CRS-2676: Start of 'ora.diskmon' on 'rhel01' succeeded
CRS-2789: Cannot stop resource 'ora.diskmon' as it is not running on server 'rhel01'
CRS-2676: Start of 'ora.cssd' on 'rhel01' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rhel01'
CRS-2672: Attempting to start 'ora.ctssd' on 'rhel01'
CRS-2676: Start of 'ora.ctssd' on 'rhel01' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rhel01' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'rhel01'
CRS-2676: Start of 'ora.asm' on 'rhel01' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'rhel01'
CRS-2676: Start of 'ora.storage' on 'rhel01' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'rhel01'
CRS-2676: Start of 'ora.crsd' on 'rhel01' succeeded
CRS-6017: Processing resource auto-start for servers: rhel01
CRS-2672: Attempting to start 'ora.ons' on 'rhel01'
CRS-2676: Start of 'ora.ons' on 'rhel01' succeeded
CRS-6016: Resource auto-start has completed for server rhel01
CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources
CRS-4123: Oracle High Availability Services has been started.
2015/11/09 23:22:06 CLSRSC-343: Successfully started Oracle clusterware stack

clscfg: EXISTING configuration version 5 detected.
clscfg: version 5 is 12c Release 1.
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
Preparing packages for installation...
cvuqdisk-1.0.9-1
2015/11/09 23:22:23 CLSRSC-325: Configure Oracle Grid Infrastructure for a Cluster ... succeeded

[oracle@rhel01 install]$ sudo /u01/app/12.1.0/grid/root.sh

Performing root user operation for Oracle 12c

The following environment variables are set as:

ORACLE_OWNER= oracle

ORACLE_HOME= /u01/app/12.1.0/grid

Copying dbhome to /usr/local/bin ...

Copying oraenv to /usr/local/bin ...

Copying coraenv to /usr/local/bin ...

Entries will be added to the /etc/oratab file as needed by

Database Configuration Assistant when a database is created

Finished running generic part of root script.

Now product-specific root actions will be performed.

Relinking oracle with rac_on option

Using configuration parameter file: /u01/app/12.1.0/grid/crs/install/crsconfig_params

2015/11/09 23:18:42 CLSRSC-363: User ignored prerequisites during installation

OLR initialization - successful

2015/11/09 23:19:08 CLSRSC-330: Adding Clusterware entries to file 'oracle-ohasd.conf'

CRS-4133: Oracle High Availability Services has been stopped.

CRS-4123: Oracle High Availability Services has been started.

CRS-4133: Oracle High Availability Services has been stopped.

CRS-4123: Oracle High Availability Services has been started.

CRS-4133: Oracle High Availability Services has been stopped.

CRS-4123: Starting Oracle High Availability Services-managed resources

CRS-2672: Attempting to start 'ora.mdnsd' on 'rhel01'

CRS-2672: Attempting to start 'ora.evmd' on 'rhel01'

CRS-2676: Start of 'ora.mdnsd' on 'rhel01' succeeded

CRS-2676: Start of 'ora.evmd' on 'rhel01' succeeded

CRS-2672: Attempting to start 'ora.gpnpd' on 'rhel01'

CRS-2676: Start of 'ora.gpnpd' on 'rhel01' succeeded

CRS-2672: Attempting to start 'ora.gipcd' on 'rhel01'

CRS-2676: Start of 'ora.gipcd' on 'rhel01' succeeded

CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rhel01'

CRS-2676: Start of 'ora.cssdmonitor' on 'rhel01' succeeded

CRS-2672: Attempting to start 'ora.cssd' on 'rhel01'

CRS-2672: Attempting to start 'ora.diskmon' on 'rhel01'

CRS-2676: Start of 'ora.diskmon' on 'rhel01' succeeded

CRS-2789: Cannot stop resource 'ora.diskmon' as it is not running on server 'rhel01'

CRS-2676: Start of 'ora.cssd' on 'rhel01' succeeded

CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rhel01'

CRS-2672: Attempting to start 'ora.ctssd' on 'rhel01'

CRS-2676: Start of 'ora.ctssd' on 'rhel01' succeeded

CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rhel01' succeeded

CRS-2672: Attempting to start 'ora.asm' on 'rhel01'

CRS-2676: Start of 'ora.asm' on 'rhel01' succeeded

CRS-2672: Attempting to start 'ora.storage' on 'rhel01'

CRS-2676: Start of 'ora.storage' on 'rhel01' succeeded

CRS-2672: Attempting to start 'ora.crsd' on 'rhel01'

CRS-2676: Start of 'ora.crsd' on 'rhel01' succeeded

CRS-6017: Processing resource auto-start for servers: rhel01

CRS-2672: Attempting to start 'ora.ons' on 'rhel01'

CRS-2676: Start of 'ora.ons' on 'rhel01' succeeded

CRS-6016: Resource auto-start has completed for server rhel01

CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources

CRS-4123: Oracle High Availability Services has been started.

2015/11/09 23:22:06 CLSRSC-343: Successfully started Oracle clusterware stack

clscfg: EXISTING configuration version 5 detected.

clscfg: version 5 is 12c Release 1.

Successfully accumulated necessary OCR keys.

Creating OCR keys for user 'root', privgrp 'root'..

Operation successful.

Preparing packages for installation...

cvuqdisk-1.0.9-1

2015/11/09 23:22:23 CLSRSC-325: Configure Oracle Grid Infrastructure for a Cluster ... succeeded

Bingo! Let’s check if everything is up and running:

[oracle@rhel01 ~]$ /u01/app/12.1.0/grid/bin/crsctl stat res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       rhel01                   STABLE
               ONLINE  ONLINE       sles01                   STABLE
ora.LISTENER.lsnr
               ONLINE  ONLINE       rhel01                   STABLE
               ONLINE  ONLINE       sles01                   STABLE
ora.asm
               ONLINE  ONLINE       rhel01                   Started,STABLE
               ONLINE  ONLINE       sles01                   Started,STABLE
ora.net1.network
               ONLINE  ONLINE       rhel01                   STABLE
               ONLINE  ONLINE       sles01                   STABLE
ora.ons
               ONLINE  ONLINE       rhel01                   STABLE
               ONLINE  ONLINE       sles01                   STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       sles01                   STABLE
ora.cvu
      1        ONLINE  ONLINE       sles01                   STABLE
ora.oc4j
      1        OFFLINE OFFLINE                               STABLE
ora.rhel01.vip
      1        ONLINE  ONLINE       rhel01                   STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       sles01                   STABLE
ora.sles01.vip
      1        ONLINE  ONLINE       sles01                   STABLE
--------------------------------------------------------------------------------

[oracle@rhel01 ~]$ /u01/app/12.1.0/grid/bin/crsctl stat res -t

--------------------------------------------------------------------------------

Name Target State Server State details

--------------------------------------------------------------------------------

Local Resources

--------------------------------------------------------------------------------

ora.DATA.dg

ONLINE ONLINE rhel01 STABLE

ONLINE ONLINE sles01 STABLE

ora.LISTENER.lsnr

ONLINE ONLINE rhel01 STABLE

ONLINE ONLINE sles01 STABLE

ora.asm

ONLINE ONLINE rhel01 Started,STABLE

ONLINE ONLINE sles01 Started,STABLE

ora.net1.network

ONLINE ONLINE rhel01 STABLE

ONLINE ONLINE sles01 STABLE

ora.ons

ONLINE ONLINE rhel01 STABLE

ONLINE ONLINE sles01 STABLE

--------------------------------------------------------------------------------

Cluster Resources

--------------------------------------------------------------------------------

ora.LISTENER_SCAN1.lsnr

1 ONLINE ONLINE sles01 STABLE

ora.cvu

1 ONLINE ONLINE sles01 STABLE

ora.oc4j

1 OFFLINE OFFLINE STABLE

ora.rhel01.vip

1 ONLINE ONLINE rhel01 STABLE

ora.scan1.vip

1 ONLINE ONLINE sles01 STABLE

ora.sles01.vip

1 ONLINE ONLINE sles01 STABLE

--------------------------------------------------------------------------------

[oracle@rhel01 ~]$ olsnodes -s
sles01  Active
rhel01  Active

[oracle@rhel01 ~]$ ssh rhel01 uname -r
3.8.13-16.2.1.el6uek.x86_64
[oracle@rhel01 ~]$ ssh sles01 uname -r
3.0.101-63-default

[oracle@rhel01 ~]$ ssh rhel01 cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.5 (Santiago)
[oracle@rhel01 ~]$ ssh sles01 cat /etc/issue
Welcome to SUSE Linux Enterprise Server 11 SP4  (x86_64) - Kernel \r (\l).

[oracle@rhel01 ~]$ olsnodes -s

sles01 Active

rhel01 Active

[oracle@rhel01 ~]$ ssh rhel01 uname -r

3.8.13-16.2.1.el6uek.x86_64

[oracle@rhel01 ~]$ ssh sles01 uname -r

3.0.101-63-default

[oracle@rhel01 ~]$ ssh rhel01 cat /etc/redhat-release

Red Hat Enterprise Linux Server release 6.5 (Santiago)

[oracle@rhel01 ~]$ ssh sles01 cat /etc/issue

Welcome to SUSE Linux Enterprise Server 11 SP4 (x86_64) - Kernel \r (\l).

So yes, it works, but remember that it’s not a supported long-term configuration.

In my case I expect to migrate the whole cluster from SLES to OEL in one day.

NOTE: using OEL6 as new target is easy because the interface names do not change. The new OEL7 interface naming changes, if you need to migrate without cluster downtime you need to setup the new OEL7 nodes following this post: http://ask.xmodulo.com/change-network-interface-name-centos7.html

Otherwise, you need to configure a new interface name for the cluster with oifcfg.

HTH

—

Ludovico