How to fix CPU usage problem in 12c due to DBMS_FEATURE_AWR

I love my job because I always have suprises. This week’s surprise has been another problem related to SQL Plan Directives in 12c. Because it is a common problem that potentially affects ALL the customers, I am glad to share the solution on my blog 😀

Symptom of the problem: High CPU usage on the server

My customer’s DBA team has spotted a consistent high CPU utilisation on its servers:

spd_awr_high_cpu_sar

Everyday, at the same time, and for 20-40 minutes, the servers hosting the Oracle databases run literally out of CPU.

spd_awr_high_cpu_em

 

Troubleshooting

Ok, it would be too easy to give the solution now. If you cannot wait, jump at the end of this post. But what I like more is to explain how I came to it.

First, I gave a look at the processes consuming CPU. Most of the servers have many consolidated databases on them. Surprisingly, this is what I have found:

spd_awr_high_cpu_m001It seems that the source of the problem is not a single database, but all of them. Isn’t it? And I see another pattern here: the CPU usage comes always from the [m001] process, so it is not related to a user process.

My customer has Diagnostic Pack so it is easy to go deeper, but you can get the same result with other free tools like s-ash, statspack and snapper. However, this is what I have found in the Instance Top Activity:

spd_awr_high_cpu_instOk, everything comes from a single query with sql_id auyf8px9ywc6j. This is the full sql_text:

It looks like something made by a DBA, but it comes from the MMON.

Looking around, it seems closely related to two PL/SQL calls that I could find in the SQL Monitor and that systematically fail every day:

spd_cpu_sql_monitorDBMS_FEATURE_AWR function calls internally the SQL auyf8px9ywc6j.

The MOS does not know anything about that query, but the internet does:

spd_awr_franckOh no, not Franck again! He always discovers new stuff and blogs about it before I do 🙂

In his blog post, he points out that the query fails because of error ORA-12751 (resource plan limiting CPU usage) and that  it is a problem of Adaptive Dynamic Sampling. Is it true?

What I like to do when I have a problematic sql_id, is to run sqld360 from Mauro Pagano, but the resulting zip file does not contain anything useful, because actually there are no executions and no plans.

During the execution of the statement (or better, during the period with high CPU usage), there is an entry in v$sql, but no plans associated:

And this is very likely because the statement is still parsing, and all the time is due to the Dynamic Sampling. But because the plan is not there yet, I cannot check it in the DBMS_XPLAN.DISPLAY_CURSOR.

I decided then to trace it with those two statements:

At the next execution I see indeed the Adaptive Dynamic Sampling in the trace file, the errror due to the exhausted CPU in the resource plan, and the directives that caused the Adaptive Dynamic Sampling:

 

 

So, there are some SQL Plan Directives that force the CBO to run ADS for this query.

This query touches three tables, so instead of relying on the DIRECTIVE_IDs, it’s better to get the directives by object name:

Solution

At this point, the solution is the same already pointed out in one of my previous blog posts: disable the directives individually!

This very same PL/SQL block must be run on ALL the 12c databases affected by this Adaptive Dynamic Sampling problem on the sql_id auyf8px9ywc6j.

If you have just migrated the database to 12c, it would make even more sense to programmatically “inject” the disabled SQL Plan Directives into every freshly created or upgraded 12c database (until Oracle releases a patch for this non-bug).

It comes without saying that the next execution has been very quick, consuming almost no CPU and without using ADS.

HTH

Ludovico

 

Cloning a PDB with ASM and Data Guard (no ADG) without network transfer

Ok, if you’re reading this post, you may want to read also the previous one that explains something more about the problem.

Briefly said, if you have a CDB running on ASM in a MAA architecture and you do not have Active Data Guard, when you clone a PDB you have to “copy” the datafiles somehow on the standby. The only solution offered by Oracle (in a MOS Note, not in the documentation) is to restore the PDB from the primary to the standby site, thus transferring it over the network. But if you have a huge PDB this is a bad solution because it impacts your network connectivity. (Note: ending up with a huge PDB IMHO can only be caused by bad consolidation. I do not recommend to consolidate huge databases on Multitenant).

So I’ve worked out another solution, that still has many defects and is almost not viable, but it’s technically interesting because it permits to discover a little more about Multitenant and Data Guard.

The three options

At the primary site, the process is always the same: Oracle copies the datafiles of the source, and it modifies the headers so that they can be used by the new PDB (so it changes CON_ID, DBID, FILE#, and so on).

On the standby site, by opposite, it changes depending on the option you choose:

Option 1: Active Data Guard

If you have ADG, the ADG itself will take care of copying the datafile on the standby site, from the source standby pdb to the destination standby pdb. Once the copy is done, the MRP0 will continue the recovery. The modification of the header block of the destination PDB is done by the MRP0 immediately after the copy (at least this is what I understand).

ADG_PDB_copy

Option 2: No Active Data Guard, but STANDBYS=none

In this case, the copy on the standby site doesn’t happen, and the recovery process just add the entry of the new datafiles in the controlfile, with status OFFLINE and name UNKNOWNxxx.  However, the source file cannot be copied anymore, because the MRP0 process will expect to have a copy of the destination datafile, not the source datafile. Also, any tentative of restore of the datafile 28 (in this example) will give an error because it does not belong to the destination PDB. So the only chance is to restore the destination PDB from the primary.
NOADG_PDB_STANDBYS_NONE_copy

Option 3: No Active Data Guard, no STANDBYS=none

This is the case that I want to explain actually. Without the flag STANDBYS=none, the MRP0 process will expect to change the header of the new datafile, but because the file does not exist yet, the recovery process dies.
We can then copy it manually from the source standby pdb, and restart the recovery process, that will change the header. This process needs to be repeated for each datafile. (that’s why it’s not a viable solution, right now).

NOADG_PDB_copy

Let’s try it together:

The Environment

Primary

Standby

The current user PDB (any resemblance to real people is purely coincidental 😉 #haveUSeenMaaz):

Cloning the PDB on the primary

First, make sure that the source PDB is open read-only

Then, clone the PDB on the primary without the clause STANDBYS=NONE:

Review the clone on the Standby

At this point, on the standby the alert log show that the SYSTEM datafile is missing, and the recovery process stops.

One remarkable thing, is that in the standby controlfile, ONLY THE SYSTEM DATAFILE exists:

We need to fix the datafiles one by one, but most of the steps can be done once for all the datafiles.

Copy the source PDB from the standby

What do we need to do? Well, the recovery process is stopped, so we can safely copy the datafiles of  the source PDB from the standby site because they have not moved yet. (meanwhile, we can put the primary source PDB back in read-write mode).

Copy the datafiles:

Do the magic

Now there’s the interesting part: we need to assign the datafile copies of the maaz PDB to LUDO.

Sadly, the OMF will create the copies on the bad location (it’s a copy, to they are created on the same location as the source PDB).

We cannot try to uncatalog and recatalog the copies, because they will ALWAYS be affected to the source PDB. Neither we can use RMAN because it will never associate the datafile copies to the new PDB. We need to rename the files manually.

It’s better to uncatalog the datafile copies before, so we keep the catalog clean:

Then, because we cannot rename files on a standby database with standby file management set to AUTO, we need to put it temporarily to MANUAL.

standby_file_management is not PDB modifiable, so we need to do it for the whole CDB.

then we need to set back the standby_file_management=auto or the recover will not start:

We can now restart the recovery.

The recovery process will:
– change the new datafile by modifying the header for the new PDB
– create the entry for the second datafile in the controlfile
– crash again because the datafile is missing

We already have the SYSAUX datafile, right? So we can alter the name again:

This time all the datafiles have been copied (no user datafile for this example) and the recovery process will continue!! 🙂 so we can hit ^C and start it in background.

The Data Guard configuration reflects the success of this operation.

Do we miss anything?

Of course, we do!! The datafile names of the new PDB reside in the wrong ASM path. We need to fix them!

 

I know there’s no practical use of this procedure, but it helps a lot in understanding how Multitenant has been implemented.

I expect some improvements in 12.2!!

Cheers

Ludo