How to collect Oracle Application Server performance data with DMS and RRDtool

RRDize everything, chapter 1

If you are managing some Application Server deployments you should have wondered how to check and collect performance data.
As stated in documentation, you can gather performance metrics with the dmstool utility.
AFAIK, this can be done from 9.0.2 release upwards, but i’m concerned DMS will not work on Weblogic.

Mainly, you should have an external server that acts as collector (it could be a server in the Oracle AS farm as well): copy the dms.jar library from an Oracle AS installation to your collector and use it as you would use dmstool:

There are three basilar methods to get data:

Get all metrics at once:

Get only the interesting metrics:

Get metrics included into specific DMS tables:

What youraddress:// is, it depends on the component you are trying to connect:

If you are trying to connect to the OHS (Apache), be careful to allow remote access from the collector by editing the dms.conf file.

Now that you can query dms data, you should store it somewhere.
Personally, I did a first attempt with dmstool -dump format=xml. I wrote a parser in PHP with SimpleXML extension and I did a lot of inserts into a MySQL database. After a few months the whole data collected from tens of servers was too much to be mantained…
To avoid the maintenance of a DWH-grade database I investigated and found RRDTool. Now I’m asking how could I live without it!

I then wrote a parser in awk that parse the output of the dms.jar call and invoke an rrdtool update command.
I always use dms.jar -table command. The output has always the same format:

So I written an awk file that works for me.
use it this way:

And this is the code for update_metric_rrd:

Once you have all your rrd files populated, it’s easy to script automatic reporting. You would probably want a graph with the request count served by your Apache cluster, along with its linear regression:

This is the result:
OHS request completed
OHHHHHHHHHHHH!!!! COOL!!!!

That’s all for DMS capacity planning. Stay tuned, more about rrdtool is coming!

More about Dataguard and how to check it

After my post Quick Oracle Dataguard check script I have some considerations to add:
to check the gap of applied log stream by MRP0 process it’s sufficient to replace this query in the perl script I posted:

with this new one:

To check this you have to meet the following condition: You should have real-time apply enabled (and possibly NODELAY clause specified in your recover statement). Check it with this query:

It should be “MANAGED REAL TIME APPLY”.
If not using realtime apply your MRP0 process will wait until you have a new archive, so even if you have redo transport mode set to LGWR you’ll wait for standby log completion. Your gap of applied redo stream will be at least one sequence#.

With transport mode set to LGWR and real-time apply the output of the perl script is similar to this one:

The whole gap between your primary and standby database should be LOW.

Clustering the RMAN catalog on a RAC environment

You have your brand new RAC deployed on a cluster and you want to manage your backups through a recovery catalog.
Suppose you don’t have a dedicate server to host your catalog, perhaps you wouldn’t configure your catalog as a RAC database: so why don’t you use Clusterware to configure your catalog as a single instance in cold failover?

OTN has a very nice whitepaper describing how to protect a single instance database. This can be nicely applied on 10g, 10gR2 or 11g: Using Oracle Clusterware to Protect A Single Instance Oracle Database 11g.

Clusterware is appealing also for traditional cold failover clusters. Licensing allows you to use Clusterware as far as you protect Oracle software or 3rd party software that use Oracle as database backend.

Quick Oracle Dataguard check script

Oracle Dataguard has his own command-line dgmgrl to check the whole dataguard configuration status.
At least you should check that the show configuration command returns SUCCESS.

This is an hypothetic script:

Another script should check for the gap between production online log and the log stream received by the standby database. This can be accomplished with v$managed_standby view.
The Total Block Gap between production and standby can be calculated this way:
Sum all blocks from v$archived_logs where seq# between Current Standby Seq# and Current Production Seq#. Then add current block# of the production LGWR process and subtract current block# from RFS standby process. This gives you total blocks even if there is a log sequence gap between sites.
This is NOT the gap of online log APPLIED to the standby database. THIS IS THE GAP OF ONLINE LOG TRANSMITTED TO THE STANDBY RFS PROCESS and can be used to monitor your dataguard transmission from production to disaster recovery environment.

This is an excerpt of such script (please take care that it does not check against RFS failures, so it can fails when RFS is not alive):

Any comment is appreciated!

Tips: Bash Prompt and Oracle

You may want to check the NEW VERSION of this prompt here.

I disagree with default bash prompt. Do you? It’s quote common to work with long paths:

and, when working on multi-database environments I need to check my environment:

I currently use this prompt, instead:

What is ohvers?? I defined this function to get the version of oracle from my ORACLE_HOME variable:

Pros:

  • I have a blank line that separate my prompt from previous output
  • I get the system clock (useful when saving my konsole history. Did I say konsole?)
  • I can see my Oracle Environment before launching dangerous commands
  • I have an empty line to start my endless commands
  • I have a lot of sharps “#” : they are fine against wrong copy&paste operations…

Suggestions?

JBoss Portal and MySQL scalability: What The…???

I found several queries running on a MySQL 5.0 database like this one:

This query is related to JBoss Portal and does a full scan on table JBP_OBJECT_NODE.

It has bad performances (>0.8 sec) with just a few records:

mysql> select count(*) from JBP_OBJECT_NODE;
+———-+
| count(*) |
+———-+
|    33461 |
+———-+

If I rewrite the query using an inner join (à la Oracle, please forgive me) instead of a subquery I get an index scan:

With 30k records the execution time falls down from 0.8 secs to 0.01 secs…
That’s NOT all! I found this open bug:

https://jira.jboss.org/jira/browse/JBPORTAL-2040

With many users registered in, the JBoss Portal Admin console tooks over a minute to show a single page…

I don’t like portals…

Oracle RAC Standard Edition to achieve low cost and high performance

I finished today to create a new production environment based on 2 Linux serverX86_64 and running Oracle RAC 10gR2. (I know, there is 11g right now, but I’m a conservative!)
Wheeew, I just spent a couple of hours applying all the recommended patches!
We choosed 2 nodes with a maximum of 2 multi-core processors each one so we can license Standard Edition instead of Enterprise Edition. 64bits addressing allow us to allocate many gigabytes of SGA. I’m starting with 5Gb but I think we’ll need more. And a set of 6x300Gb 15krpms disks (it can be expanded with more disks and more shelves).
This configuration keeps low the total cost of ownership but achieves best performance.
Due to disks layout, costs and needed usable storage, we had to configure one huge RAID5 on the SAN with multi-path. I decided anyway to create 2 ASM disk groups (ASM is mandatory for Standard Edition RAC), one for the DB, the second one for the recovery area. With spare disks we should have enough availability and even if it’s a RAID5 I saw good write performances (>150M/s).

Welcome new RAC, I hope we’ll feel good together!

It’s time to trouble…

Sometimes it’s hard to find enough time to write something or even to only THINK about writing something

The following are the projects I have to complete before the deadline of December 17th (at least if I still want to go on vacation…)

  • A totally new Oracle 10gR2 RAC SE on Linux (OCFS2, ASM) including jboss frontends, backups, monitoring, documentation. (Servers are ready today).
  • A Disaster recovery architecture based on Dataguard with scripts based on rsync to do filesystem replication, with failover and failback, including backups, monitoring, documentation. (The server in DR site is reachable via network today).
  • A 17 server infrastructure (among others a RAC 10gR2 on linux) transfer from Milan datacenter to here. It’s planned for december 11th but I have to crosscheck backup and contingency requirements.
  • A 14 server infrastructure (based on Windows and SqlServer) transfer from Milan datacenter to here. To be planned in december.
  • A totally new cold failover cluster based on linux with Oracle DBMS and E-business suite (Servers will be provided soon, I hope!).
  • A new standalone Windows Server 64bit to outstand the 32bit allocation bottleneck for a 500Gb oracle database (Server will be provided not before december 10th).
  • Normally manage the day-by-day work, including replying to e-mails and answering the phone.

AARGH!!

System triggers, stats$user_log and side effects

Sometimes people get advice from internet: both Metalink or well-known consulting sites.
If people need a fix or a feature, they use to trust advices.

Last week I heard a collegue about a 10g RAC database with performance problems and, since I never lay on my chair, I probed both AWR and ADDM . I suddenly recognized heavy enqueues and physical reads
over a segment named STATS$USER_LOG. “Strange”, I said, “I cannot remember this name in neither perfstat or catalog segments”.
Then I searched the Internet and the Metalink and found the same thing in BOTH metalink.oracle.com and www.dba-oracle.com: a trick to trace logon and logoffs into a table using system triggers.

Look at this code:

create or replace trigger
logon_audit_trigger
AFTER LOGON ON DATABASE
BEGIN
insert into stats$user_log values(
user,
sys_context('USERENV','SESSIONID'),
sys_context('USERENV','HOST'),
sysdate,
to_char(sysdate, 'hh24:mi:ss'),
[...]
);
COMMIT;
END;
/

Cool, every single access is kept into stats$user_log.

Let’s see the logoff trigger:

create or replace trigger
logoff_audit_trigger
BEFORE LOGOFF ON DATABASE
BEGIN
-- ***************************************************
-- Update the last action accessed
-- ***************************************************
update stats$user_log [...]
--***************************************************
-- Update the last program accessed
-- ***************************************************
update stats$user_log [...]
-- ***************************************************
[ ... many, many updates ...]
-- ***************************************************
update stats$user_log [...]
-- ***************************************************
-- Compute the elapsed minutes
-- ***************************************************
update stats$user_log set elapsed_minutes =
round((logoff_day - logon_day)*1440)
where
sys_context('USERENV','SESSIONID') = session_id;
COMMIT;
END;
/

That’s all. It inserts a row when someone logons. It updates MANY rows when someone logoffs.
There is no match between the record inserted and the records updated (but the session_id).
Neither indexes or constraints.

What’s the matter?

What happens if we have many logons?

SQL> select num_rows from dba_tables where table_name='STATS$USER_LOG';

NUM_ROWS
———-
3053931

What happens if the execution plan does a full scan?

SQL> explain plan for update stats$user_log […]

Explained.

SQL> @?/rdbms/admin/utlxpls
---------------------------------------------
| Id | Operation | Name |
---------------------------------------------
| 0 | UPDATE STATEMENT | |
| 1 | UPDATE | STATS$USER_LOG |
| 2 | TABLE ACCESS FULL| STATS$USER_LOG |
---------------------------------------------

How many reads should it take?

SQL> select bytes/1024/1024 Mb from dba_Segments where segment_name='STATS$USER_LOG';

MB
———-
237

The database performace will decrease constantly and very slowly…..
Remember: never trust a solution if it involves a change on the system.

Plot Oracle historical statistics within SQL*Plus

More than often I’m asked to investigate “what happened yesterday when performance problems appeared”.

Sometimes I have the Enterprise Manager DB Console licensed, sometimes not. Sometimes I have direct SQL*Net access to the database that I may use to produce custom reports with my LAMP self-developed application. But it may happen that only an ssh access is granted to the db server.

That’s why I started to develop some little scripts to plot the trends of database timed statistics.

Let’s see this one:

SQL> @sysstat.sql
Enter a sysstat to search for: physical reads


STAT_ID     STAT_NAME
----------- ------------------------------------------

2263124246 physical reads
4171507801 physical reads cache
297908839 physical reads cache prefetch
2589616721 physical reads direct
2564935310 physical reads direct (lob)
2663793346 physical reads direct temporary tablespace
473165409 physical reads for flashback new
3102888545 physical reads prefetch warmup
531193461 physical reads retry corrupt

9 rows selected.

Enter the desired stat_id: 2263124246
Enter the start date (YYYYMMDD) [defaults today] : 20080922
Enter the end date date (YYYYMMDD) [defaults today] : 20080922

STAT_NAME        START    END
---------------- -------- --------
physical reads   20080922 20080922

BEGIN_INTERVAL_TIME           VALORE PLOTTED_VALUE
------------------------- ---------- -------------------------
22-SEP-08 12.00.12.122 AM          0
22-SEP-08 01.00.28.253 AM     120092
22-SEP-08 02.00.05.039 AM      35780
22-SEP-08 03.00.55.595 AM       4792
22-SEP-08 04.00.43.725 AM       4905
22-SEP-08 05.00.31.855 AM       7300
22-SEP-08 06.00.17.017 AM     234596
22-SEP-08 07.00.08.132 AM      24651
22-SEP-08 08.00.50.936 AM     481884
22-SEP-08 09.00.33.488 AM     130201
22-SEP-08 10.00.03.805 AM    1300306 **
22-SEP-08 11.00.07.764 AM     491857
22-SEP-08 12.00.31.548 PM     304702
22-SEP-08 01.01.04.880 PM    1023664 *
22-SEP-08 02.00.17.822 PM    8588180 ************
22-SEP-08 03.00.36.969 PM    2201615 ***
22-SEP-08 04.01.01.397 PM   17237098 *************************
22-SEP-08 05.00.39.262 PM    1606300 **
22-SEP-08 06.00.03.829 PM     451568
22-SEP-08 07.00.31.461 PM     137684
22-SEP-08 08.00.05.966 PM     203803
22-SEP-08 09.00.24.829 PM     536394
22-SEP-08 10.00.12.945 PM   10209783 **************
22-SEP-08 11.00.35.123 PM    6151663 *********

24 rows selected.

Oh! At 4.00 PM we had a lot of physical reads. Nice.

This is the code:

Ciao

Ludovico