Clustering the RMAN catalog on a RAC environment

January 13th, 2009

You have your brand new RAC deployed on a cluster and you want to manage your backups through a recovery catalog.
Suppose you don’t have a dedicate server to host your catalog, perhaps you wouldn’t configure your catalog as a RAC database: so why don’t you use Clusterware to configure your catalog as a single instance in cold failover?

OTN has a very nice whitepaper describing how to protect a single instance database. This can be nicely applied on 10g, 10gR2 or 11g: Using Oracle Clusterware to Protect A Single Instance Oracle Database 11g.

Clusterware is appealing also for traditional cold failover clusters. Licensing allows you to use Clusterware as far as you protect Oracle software or 3rd party software that use Oracle as database backend.

Quick Oracle Dataguard check script

January 5th, 2009

Oracle Dataguard has his own command-line dgmgrl to check the whole dataguard configuration status.
At least you should check that the show configuration command returns SUCCESS.

This is an hypothetic script:

#!/bin/bash
export ORACLE_HOME=/u1/app/oracle/product/10.2.0
export ORACLE_SID=orcldg
result=`echo "show configuration;" | \
  $ORACLE_HOME/bin/dgmgrl sys/strongpasswd | \
  grep -A 1 "Current status for" | grep -v "Current status for"`
if [ "$result" = "SUCCESS" ] ; then
    exit 0
else
    exit 1
fi

Another script should check for the gap between production online log and the log stream received by the standby database. This can be accomplished with v$managed_standby view.
The Total Block Gap between production and standby can be calculated this way:
Sum all blocks from v$archived_logs where seq# between Current Standby Seq# and Current Production Seq#. Then add current block# of the production LGWR process and subtract current block# from RFS standby process. This gives you total blocks even if there is a log sequence gap between sites.
This is NOT the gap of online log APPLIED to the standby database. THIS IS THE GAP OF ONLINE LOG TRANSMITTED TO THE STANDBY RFS PROCESS and can be used to monitor your dataguard transmission from production to disaster recovery environment.

This is an excerpt of such script (please take care that it does not check against RFS failures, so it can fails when RFS is not alive):

#!/u1/app/oracle/product/10.2.0/perl/bin/perl -w
use DBI;
use DBD::Oracle qw(:ora_session_modes);
# DB connection #
my $prod  = "orclprod";
my $stby = "orcldr";
my $prodh;
unless ($prodh = DBI->connect('dbi:Oracle:'.$prod,
  'sys', 'strongpassword',
  {PrintError=>0, AutoCommit => 0,
  ora_session_mode => ORA_SYSDBA}))  {
    print "Error connecting to DB: $DBI::errstr\n";
        exit(1);
}
$prodh->{RaiseError}=1;
 
my $stbyh;
unless ($stbyh = DBI->connect('dbi:Oracle:'.$stby,
  'sys', 'strongpassword',
  {PrintError=>0, AutoCommit => 0,
  ora_session_mode => ORA_SYSDBA}))  {
    print "Error connecting to DB: $DBI::errstr\n";
        $prodh->disconnect;
        exit(1);
}
$stbyh->{RaiseError}=1;
 
my $sth;
### query prod
$sth = $prodh->prepare( < <EOSQL );
        select SEQUENCE#, BLOCK# from v\$managed_standby
        where process='LGWR'
EOSQL
$sth->execute();
my ($psequence, $pblock) = $sth->fetchrow_array();
$sth->finish();
### query stdby
$sth = $stbyh->prepare( < <EOSQL );
        select SEQUENCE#, BLOCK# from v\$managed_standby
        where process='RFS' and client_process='LGWR'
EOSQL
$sth->execute();
my ($ssequence, $sblock) = $sth->fetchrow_array();
$sth->finish();
 
printf ("PROD   : %10d %10d\n", $psequence, $pblock);
printf ("STANDBY: %10d %10d\n", $ssequence, $sblock);
 
$sth = $stbyh->prepare( < <EOSQL );
        select nvl(sum(blocks),0)
        + $pblock - $sblock as BLOCK_GAP
    from v\$archived_log
        where sequence# between $ssequence and $psequence
EOSQL
$sth->execute();
my ($blockgap) = $sth->fetchrow_array();
$sth->finish();
printf ("%-10d blocks gap\n", $blockgap);
 
$stbyh->disconnect;
$prodh->disconnect;

Any comment is appreciated!

Tips: Bash Prompt and Oracle

December 30th, 2008
export PS1=\u@\h:\w\$

I disagree with default bash prompt. Do you? It’s quote common to work with long paths:

ludovico@host:/u01/app/oracle/product/10.2.0/network/admin$ \
/nooo/this/command/line/is/really/long/and/offcourse -I \
-will -wrap -my -command -line

and, when working on multi-database environments I need to check my environment:

env | grep -i oracle
#or
echo $ORACLE_SID
echo $ORACLE_HOME

I currently use this prompt, instead:

export PS1=$'\\n# [ $LOGNAME@\h:$PWD [\\t] [`ohvers` SID:${ORACLE_SID:-"no sid"}] ]\\n# '
 
# [ ludovico@caldara_2k:/u01/app/oracle/product/10.2.0/db_1/network/admin [23:15:58] [10.2.0 SID:orcl] ]
#

What is `ohvers`?? I defined this function to get the version of oracle from my ORACLE_HOME variable:

ohvers ()
{
echo -n $ORACLE_HOME | sed -n 's/.*\/\([[:digit:].]\+\)\/.*/\1/p'
}

Pros:

  • I have a blank line that separate my prompt from previous output
  • I get the system clock (useful when saving my konsole history. Did I say konsole?)
  • I can see my Oracle Environment before launching dangerous commands
  • I have an empty line to start my endless commands
  • I have a lot of sharps “#” : they are fine against wrong copy&paste operations…

Suggestions?

JBoss Portal and MySQL scalability: What The…???

December 10th, 2008

I found several queries running on a MySQL 5.0 database like this one:

SELECT PATH, NAME FROM JBP_OBJECT_NODE  WHERE PK IN (
SELECT NODE_KEY FROM JBP_OBJECT_NODE_SEC WHERE ROLE IN (
SELECT jr.jbp_name FROM jbp_users ju, jbp_role_membership jrm,
jbp_roles jr
WHERE jrm.jbp_uid = ju.jbp_uid
AND jr.jbp_rid = jrm.jbp_rid
AND ju.jbp_uname = 'LUDOVICO'
AND ju.jbp_enabled = 1));

This query is related to JBoss Portal and does a full scan on table JBP_OBJECT_NODE.

It has bad performances (>0.8 sec) with just a few records:

mysql> select count(*) from JBP_OBJECT_NODE;
+———-+
| count(*) |
+———-+
|    33461 |
+———-+

If I rewrite the query using an inner join (à la Oracle, please forgive me) instead of a subquery I get an index scan:

SELECT DISTINCT a.PATH, a.NAME /* , b.NODE_KEY */ FROM  JBP_OBJECT_NODE
a, JBP_OBJECT_NODE_SEC b
WHERE a.pk=b.NODE_KEY
AND b.ROLE IN (
SELECT jr.jbp_name FROM jbp_users ju, jbp_role_membership jrm,
jbp_roles jr
WHERE jrm.jbp_uid = ju.jbp_uid
AND jr.jbp_rid = jrm.jbp_rid
AND ju.jbp_uname = 'UTDEMO'
AND ju.jbp_enabled = 1);

With 30k records the execution time falls down from 0.8 secs to 0.01 secs…
That’s NOT all! I found this open bug:

https://jira.jboss.org/jira/browse/JBPORTAL-2040

With many users registered in, the JBoss Portal Admin console tooks over a minute to show a single page…

I don’t like portals…

MySQL 5.1 GA by December 6th!

November 28th, 2008

Ohhh, I can confirm this man is Giuseppe Maxia! and we can trust him if he say that MySQL 5.1 will be GA by December 6th!

http://www.flickr.com/photos/datacharmer/3031124529

More than one year in release candidate, I think it is enough!

Happy partitioning, happy a lot of things!!

Oracle RAC Standard Edition to achieve low cost and high performance

November 28th, 2008

I finished today to create a new production environment based on 2 Linux serverX86_64 and running Oracle RAC 10gR2. (I know, there is 11g right now, but I’m a conservative!)
Wheeew, I just spent a couple of hours applying all the recommended patches!
We choosed 2 nodes with a maximum of 2 multi-core processors each one so we can license Standard Edition instead of Enterprise Edition. 64bits addressing allow us to allocate many gigabytes of SGA. I’m starting with 5Gb but I think we’ll need more. And a set of 6x300Gb 15krpms disks (it can be expanded with more disks and more shelves).
This configuration keeps low the total cost of ownership but achieves best performance.
Due to disks layout, costs and needed usable storage, we had to configure one huge RAID5 on the SAN with multi-path. I decided anyway to create 2 ASM disk groups (ASM is mandatory for Standard Edition RAC), one for the DB, the second one for the recovery area. With spare disks we should have enough availability and even if it’s a RAID5 I saw good write performances (>150M/s).

Welcome new RAC, I hope we’ll feel good together!

JBOSS Cluster isolation and multicasting

November 24th, 2008

I configured two JBoss clusters in the same LAN: a production and a test environment.
I decided to configure every single cluster with a dedicate private LANs using a restricted netmask to isolate production and test connectivity, so I assigned
192.168.100.0/255.255.255.0 to test and
192.168.200.0/255.255.255.0 to production.
I configured Apache and mod_jk to loadbalance activities between cluster instances.

The page UsingMod_jk1.2WithJBoss (http://www.jboss.org/community/docs/DOC-12525) is a good tutorial to achieve this.

What problems should I expect?
JBoss uses UDP multicasting to replicate informations across cluster nodes: even if I isolate TCP traffic, JBoss will “ear” messages sent from other clusters and will log a lot of warnings like the following:

… WARN [NAKACK] [...] discarded message from non-member ….

I had to change BOTH multicast ip address and port (attributes mcast_addr and mcast_port) in the following configuration files:

  • ./deploy/jboss-web-cluster.sar/META-INF/jboss-service.xml
  • ./deploy/jmx-console.war/WEB-INF/web.xml
  • ./deploy/cluster-service.xml
  • ./deploy/ejb3-clustered-sfsbcache-service.xml
  • ./deploy/ejb3-entity-cache-service.xml

Good luck!

It’s time to trouble…

November 21st, 2008

Sometimes it’s hard to find enough time to write something or even to only THINK about writing something

The following are the projects I have to complete before the deadline of December 17th (at least if I still want to go on vacation…)

  • A totally new Oracle 10gR2 RAC SE on Linux (OCFS2, ASM) including jboss frontends, backups, monitoring, documentation. (Servers are ready today).
  • A Disaster recovery architecture based on Dataguard with scripts based on rsync to do filesystem replication, with failover and failback, including backups, monitoring, documentation. (The server in DR site is reachable via network today).
  • A 17 server infrastructure (among others a RAC 10gR2 on linux) transfer from Milan datacenter to here. It’s planned for december 11th but I have to crosscheck backup and contingency requirements.
  • A 14 server infrastructure (based on Windows and SqlServer) transfer from Milan datacenter to here. To be planned in december.
  • A totally new cold failover cluster based on linux with Oracle DBMS and E-business suite (Servers will be provided soon, I hope!).
  • A new standalone Windows Server 64bit to outstand the 32bit allocation bottleneck for a 500Gb oracle database (Server will be provided not before december 10th).
  • Normally manage the day-by-day work, including replying to e-mails and answering the phone.

AARGH!!

System triggers, stats$user_log and side effects

October 9th, 2008

Sometimes people get advice from internet: both Metalink or well-known consulting sites.
If people need a fix or a feature, they use to trust advices.

Last week I heard a collegue about a 10g RAC database with performance problems and, since I never lay on my chair, I probed both AWR and ADDM . I suddenly recognized heavy enqueues and physical reads
over a segment named STATS$USER_LOG. “Strange”, I said, “I cannot remember this name in neither perfstat or catalog segments”.
Then I searched the Internet and the Metalink and found the same thing in BOTH metalink.oracle.com and www.dba-oracle.com: a trick to trace logon and logoffs into a table using system triggers.

Look at this code:

create or replace trigger
logon_audit_trigger
AFTER LOGON ON DATABASE
BEGIN
insert into stats$user_log values(
user,
sys_context('USERENV','SESSIONID'),
sys_context('USERENV','HOST'),
sysdate,
to_char(sysdate, 'hh24:mi:ss'),
[...]
);
COMMIT;
END;
/

Cool, every single access is kept into stats$user_log.

Let’s see the logoff trigger:

create or replace trigger
logoff_audit_trigger
BEFORE LOGOFF ON DATABASE
BEGIN
-- ***************************************************
-- Update the last action accessed
-- ***************************************************
update stats$user_log [...]
--***************************************************
-- Update the last program accessed
-- ***************************************************
update stats$user_log [...]
-- ***************************************************
[ ... many, many updates ...]
-- ***************************************************
update stats$user_log [...]
-- ***************************************************
-- Compute the elapsed minutes
-- ***************************************************
update stats$user_log set elapsed_minutes =
round((logoff_day - logon_day)*1440)
where
sys_context('USERENV','SESSIONID') = session_id;
COMMIT;
END;
/

That’s all. It inserts a row when someone logons. It updates MANY rows when someone logoffs.
There is no match between the record inserted and the records updated (but the session_id).
Neither indexes or constraints.

What’s the matter?

What happens if we have many logons?

SQL> select num_rows from dba_tables where table_name='STATS$USER_LOG';

NUM_ROWS
———-
3053931

What happens if the execution plan does a full scan?

SQL> explain plan for update stats$user_log [...]

Explained.

SQL> @?/rdbms/admin/utlxpls
---------------------------------------------
| Id | Operation | Name |
---------------------------------------------
| 0 | UPDATE STATEMENT | |
| 1 | UPDATE | STATS$USER_LOG |
| 2 | TABLE ACCESS FULL| STATS$USER_LOG |
---------------------------------------------

How many reads should it take?

SQL> select bytes/1024/1024 Mb from dba_Segments where segment_name='STATS$USER_LOG';

MB
———-
237

The database performace will decrease constantly and very slowly…..
Remember: never trust a solution if it involves a change on the system.

Sometimes they come back

October 7th, 2008

A collegue called me yesterday.
Collegue: “Hi, I cannot acces a customer database, it’s urgent. Can you help me?”
Me: “Did you ever connect to the database?”
Collegue: “Off course, but last monday netadmins reconfigured a router: that’s when the problem started.”
Me: “Reconfigured what?
Collegue: “A router: the router lost the config.”

I tried the connection to the database: network timeout received.

Me: “Can the customer connect from his network?”
Collegue: “Yes, it doesn’t work only from our network.”
Me: “Is there a NAT?”
Collegue: “Yes, the real address is different.”
Me: “Is the db server running Windows”?
Collegue: “I forgot! It’s a Windows NT.”
Me: “Mmh, I should ask netadmins. Perhaps you can workaround putting “USE_SHARED_SOCKET” in the registry… Is the release at least 9i?”
Collegue: “It’s a 7.3″
Me: “@#”&%!?”