Oracle Home Management – part 4: Challenges and Opportunities of the New Release Model

Starting with the upcoming next release (18c), the Oracle Database will be a yearly release. (18c, 19c, etc). New yearly releases will contain only new features ready to go, and eventually some new features for performance improvements (plus bug fixes and security fixes from the previous version.)

Quarterly, instead of Patch Set Updates (PSU) and Bundle Patches (BP), there will be the new Release Updates (RU). They will contain critical fixes, optimizer changes, minor functional enhancements, bug fixes, security fixes. The new Release Updates will be equivalent to what we have now with Bundle Patches.

The Release Updates will be released during the whole lifetime of the feature release, according to the roadmap (2 years or 5 years depending on whether the release is in Long Term Support (LTS) or not). There will be a Long Term Support release every few years. The first two will probably be Oracle 19c and Oracle 23c (I am deliberately supposing that the c will still be relevant 🙂 ).

Beside Release Updates, there will be the new Release Update Revisions (RUR), that according to what I have read until now, will be released “at least” quarterly. Release Update Revisions will contain only regression fixes for bugs introduced by RUs and new security fixes, very close to what we have now with Patch Set Updates.

Release Update Revisions will cover ONLY 6 months, after that it will be necessary to upgrade to a newer Release Update or to a newer major release. Oracle introduced this change to reduce the complexity of their release management.

This leads to a few important things:

  • There will be no more than two RURs for each RU (e.g. 18.2 will have only 18.2.1 and 18.2.2)
  • If applying a RUR, after 6 months at latest, the DBs must be patched to a greater level of RU.
  • Applying the second RUR of each RU (e.g. 18.2.2 -> 18.3.2 -> 18.4.2) is the most conservative approach whilst keeping up to date with the latest critical fixes.

On top of that, one-off patches will still exist. For more information,  please read the note Release Update Introduction and FAQ (Doc ID 2285040.1)

new-release-modelHow will the new release model impact the patching strategy?

It is clear that it will be complex to keep the same major upgrade frequency as today (I expect it to increase). There have been from 3 to 5 years between each major release so far, and switching to a yearly release is a big change.

But the numbering will be easier: 18.3.2 is much more readable/maintainable than 12.2.0.3.BP180719 and, despite it does not contain an explicit date, it keeps it easy to understand the “distance” with the latest release.

So we will have, on one side, the need to upgrade more frequently. But on the other side, the upgrades might be easier than how they are now. One thing is sure, however: we will deal with many more Oracle Homes with different patch levels.

The new release model will bring us a unique opportunity to reinvent our procedures and scripts for Oracle Home management, to achieve a standardized and automated way to solve common problems like:

  • Multiple Oracle Homes coexistence (environment, naming conventions)
  • Automated binaries setup (via golden images or other automatic provisioning)
  • Database patches
  • Database upgrades

In the next post, I will show my idea of how Oracle Homes could be managed (with either the current or the new release model), making their coexistence easier for the DBAs.

Bonus: calculating the distance between releases

For a given release YY.x.z, the distance from its first release is ( x + z -1 ) quarters.

E.g.18.3.2 will be ( 3 + 2 – 1 ) = 4 quarters after the initial release date.

Across versions, assuming that each yearly release will be released in the same quarter, the distance between versions YY1.x1.z1 and YY2.x2.z2  is:

( YY2 – YY1 ) * 4 + ( x2 + z2 ) – ( x1 + z1 ) quarters

E.g. : between 18.4.1 and 20.1.2 the distance will be:

( 20 – 18 ) * 4 + ( 1 + 2 ) – ( 4 + 1 ) = 6 quarters

 

Oracle Home Management – part 3: Strengths and limitations of Rapid Home Provisioning

In the previous post I mentioned that having a central repository storing the Golden Images would be the best solution for the Oracle Home provisioning.

In this context, Oracle provides Rapid Home Provisioning: a product included in Oracle Grid Infrastructure that automates home provisioning and patching of Oracle Database and Grid Infrastructure Homes, databases and also generic software.

rhp-conceptOracle Rapid Home Provisioning simplifies tremendously the software provisioning: you can use it to create golden images starting from existing installations and then deploy them locally, across different nodes, on local or remote clusters, standalone servers, etc.

Having a central store with enforced naming conventions ensures software standardization across the whole Oracle farm, and makes patching easier with less risks. Also, it allows to patch existing databases, moving them to Oracle Homes with a higher patch level, and taking care of service draining and rolling upgrades when RAC or RAC One Node deployments exist. Multiple databases can be patched in a single batch using one single rhpctl command.

I will not explain the technical details of Rapid Home Provisioning implementation operation. I already did a webinar a couple of years ago for the RAC SIG:

Burt Clouse, the RHP product manager, did a presentation as well about Rapid Home Provisioning 12c Release 2, that highlights some new features that the product was missing in the first release:

More details about the new features can be found here:

https://blogs.oracle.com/db_maintenance/whats-new-in-122-for-rapid-home-provisioning-and-maintenance

Close to be the perfect product, but…

If rapid home provisioning is so powerful, what makes it less appealing for most users?

In my opinion (read: very own personal opinion 🙂 ), there are two main factors:

First: The technology stack RHP is relying on is quite complex

Although Rapid Home Provisioning 12c Release 2 allows Oracle Home deployments on standalone servers (it was not the case with 12c Release 1), the Rapid Home Provisioning sever itself relies on Oracle Grid Infrastructure 12cR2. That means that there must be skills in the company to manage the full stack: Clusterware, ASM, ACFS, NFS, GNS, SCAN, etc. as well as the RHP Server itself.

Second: remote provisioning requires Lifecycle Management Pack (extra-cost) option licensed on all the RHP targets

If Oracle Homes are deployed on the same cluster that hosts the RHP Server, the product can be used at no extra cost. But if you have many clusters, or using standalone servers for your Oracle databases, then RHP can become pricey very quickly: the price per processor for Lifecycle Management Pack is 12’000$, plus support (pricelist April 2018). So, buying this management pack just to introduce Rapid Home Provisioning in your company might be an excessive investment.

Of course, depending on your needs, you can evaluate it and leverage its full potential and make a bigger return of investment.

Or, you might explore if it is viable to configure each cluster as Rapid Home Provisioning Server: in this case it would be free, but it will have the additional complexity layer on all your clusters.

For small companies, simple architectures and especially where Standard Edition is deployed (no Management Pack for Standard Edition!), a self-made, simpler solution might be a better choice.

In the next post, before going into the details of a hypothetical self-made implementation, I will introduce my thoughts about the New Oracle Database Release Model.

 

Oracle Home Management – part 2: Common patching patterns

(*) Multiple times in this blog post I refer to a problem with new Oracle Home installs and rollback scripts. The problem has been fixed with PSU Jan 2017, I did not notice it before, sorry. Thanks to Martin Berger for the information

Let’s see some common approaches to Oracle Home patching.

First, how patches are applied

No, I will not talk about how to use opatch 🙂 It is an overview of the “high-level” methods… when you have multiple servers and (eventually) multiple databases per server.

Worst approach (big bang)

1.Stop everything

2.In-place binaries patching

3.Database patching, “big bang” mode

4.Start everything

With this approach, you have a big downtime, a maintenance window hard to get (all applications are down at the same time), no control over a single database and no easy rollback in case your binaries get compromised/corrupted by the patch apply.

in-place-patching

Another bad approach (new install and out-of-place patching)

1.Re-install binaries manually in a new path

2.Patch the new binaries

3.Stop, change OH, patch databases one by one

4.Decommission old binaries

out-of-place-patchingThis approach is much better than the previous one, but still has some pitfalls:

  • If you have many servers and environments: doing it frequently might be a challenge
  • Rollback scripts are not copied automatically: the datapatch will fail unless you copy them by hand (*)
  • New installs introduce potential human error, unless you use unattended install with your own scripts
  • Do you like to run opatch apply all the time, after all?

Better approach (software cloning)

This approach is very close to the previous one, with the exception that the new Oracle Home is not installed from scratch, but rather cloned from an existing one. This way, the rollback scripts used by the datapatch binary will be there and there will be no errors when patching the databases. (*)

The procedure for Oracle Home cloning is described in the Oracle Documentation, here.

Another cool thing is that you can clone Oracle Homes across different nodes, so that you might have the same patch level everywhere without repeating the tedious tasks of upgrading the opatch, patching the binaries, etc. etc.

But still, you have to identify which Oracle Home you have to clone and keep track of the latest version.

Best approach (Golden Images)

The best approach would consist in having a central repository for your software, where you store every version of your Oracle Homes, one for each patch level.

Having a central repository allows to install the software ONCE and use a “clone, patch and store it” strategy. You can, for example, use only one server to do all the patching and then distribute your software images to the different database servers.

This is the concept of Golden Images used by Rapid Home Provisioning that will be in the scope of my next blog post.

 

Second, which patches are applied

Now that we have seen some Oracle Home patching approaches, is it worth to know which patches are important in a patching strategy.

It is better that you get used to the differences between PSU/BP and RU/RUR, by reading this valuable post from Mike Dietrich:

Differences between PSU / BP and RU / RUR

I will make the assumption that in every case, the critical patches should be applied quarterly, or at least once per year, in order to fix security bugs.

The conservative approach (stability and performance over improvements)

Prior to 12.2, in order to guarantee security and stability, the best approach was to apply only PSUs each quarter.

From 12.2, the most conservative approach is to apply the latest Release Update Review on top of the oldest as possible Release Update. Confusing? Things will be clearer when I’ll write about the 18c New Release Model in a few days…

The cowboy approach (improvements over stability and performance)

Sometimes Bundle Patches and Release Updates contain cool backports from the new releases; sometimes they contain just more bug fixes than the PSUs and RURs; sometimes they fix important stuff like disabling bad transformations that lead to wrong result bugs or other annoying bugs.

Personally, I prefer to include such improvements in my patching strategy: I regularly apply RU for releases >=12.2 and BP for releases <=12.1. Don’t call me cowboy, however 🙂

The incumbent approach (or why you cannot avoid one-offs)

It does not matter your patch frequency: sometimes you hit a bug, and the only solution is either to apply the one-off patch or the workaround, if available.

If you apply the one-off patch for a specific bug, from an Oracle Home maintenance point of view, it would be better to

  • apply the same one-off everywhere (read, all your Oracle Homes with the very same release), this makes your environment homogeneous.

or

  • use a clone of the Oracle Home with the one-off as basis to apply the release update and distribute it to the other servers.

Why?

Again, it is a problem with rollback scripts (*), with patch conflicts and also, of number of versions to maintain:2018-05-03 16_26_38-Diaporama PowerPoint - [Présentation1]Less paths, less error-prone!

There is, however, the alternative to one-offs: implementing the workaround instead of applying the patch. Most of the time the workaround consist in disabling “something” through parameters, or worse, hidden parameters (the underscore parameters that the support says you should not set, but advise to do  all the time as workaround :-))

It might be a good idea to use the workaround instead of apply tha patch if you already know that the bug will be fixed in the next Release Update (for example), or that the workaround is so easy to implement that it is not worth to create another version of Oracle Home that will require special attention at the next quarter.

If you apply workarounds, anyway, be sure that you comment EXACTLY why, when and who, so you can decide to unset it at the next parameter review or maintenance… e.g.

Makes sense?

 

Oracle Home Management – part 1: “Patch soon, patch often” vs. reality

With this post, I am starting a new blog series about Oracle Database home management, provisioning, patching… Best (and worst) practices, common practices and blueprints from my point of view as consultant and, sometimes, as operational DBA.

I hope to find the time to continue (and finish) it 🙂

How often should you upgrade/patch?

Database patching and upgrading is not an easy task, but it is really important.

Many companies do not have a clear patching strategy, for several reasons.

  • Patching is time consuming
  • It is complex
  • It introduces some risks
  • It is not always really necesary
  • It leads to human errors

Oracle, of course, recommends to apply the patches quarterly, as soon as they are released. But the reality is that it is (still) very common to find customers that do not apply patches regularly.

Look at this:

With January 2018 Bundle Patch, you can fix 1883 bugs, including 56 “wrong results” bugs! I hope I will talk more about this kind of bugs, but for now consider that if you are not patching often, you are taking serious risks, including putting at risk your data consistency.

I will not talk about bugs, upgrade procedures, new releases here… For this, I recommend to follow Mike Dietrich’s blog: Upgrade your Database – NOW!

I would like rather to talk, as the title of this blog series states, about the approaches of maintaining the Oracle Homes across your Oracle server farm.

Common worst practices in maintaining homes

Maintaining a plethora of Oracle Homes across different servers requires thoughtful planning. This is a non-exhaustive list of bad practices that I see from time to time.

  • Installing by hand every new Oracle Home
  • Applying different patch levels on Oracle Homes with the same path
  • Not tracking the installed patches
  • Having Oracle Home paths hard-coded in the operational scripts
  • Not minding about Oracle Home path naming convention
  • Not minding about Oracle Home internal names
  • Copying Oracle Homes without minding about the Central Inventory

All these worst practices lead to what I like to call “patching madness”… that monster that makes regular patching very difficult / impossible.

THIS IS A SITUATION THAT YOU NEED TO AVOID:

A better approach, would be starting having some naming conventions, e.g.:

In the next blog post, I will talk about common patching patterns and their pitfalls.

 

DBMS_AUDIT_MGMT.CLEAN_AUDIT_TRAIL not working on 12c? Here’s why…

It is bad to realize, after a few years, that my customer’s Audit Cleanup procedures are not working properly for every database…

NOTE: The post is based on standard audit, not unified audit.

My customer developed a quite nice procedure for database housekeeping (including diag dest, OS audit trail, recyclebin, DB audit…)

But after some performance problems, I have come across the infamous sql_id 4ztz048yfq32s:

This SQL comes from the “Failed Logon Attempts” metric in Enterprise Manager.

I’ve checked the specific database, and the table SYS.AUD$ was containing way too many rows, dating before our purge time:

The cleanup procedure does basically this:

But despite a retention window of 31 days, the rows are still there:

(today is 27.04.2018, so the oldest records are more than 1 year old)

I’ve checked with ASH, the actual delete statement executed by the clean_audit_trail procedure is:

So, the DBID clause is OK, but the NTIMESTAMP# clause is  not!

Why?

Long story long (hint, it’s a bug: 19958239)
Update 30.05.2018 the solution is explained in this Doc: 2068066.1, thanks John)

The cleanup metadata is stored into the view DBA_AUDIT_MGMT_LAST_ARCH_TS. Its structure in 11g was:

But in 12c, there are 2 new columns:

When the database is upgraded from 11g to 12c, the two new columns are set to “0” by default.

But when the procedure DBMS_AUDIT_MGMT.SET_LAST_ARCHIVE_TIMESTAMP is executed, the actual dbid is used, and new lines appear:

It is clear now that the DELETE statement is not constructed properly. It should get the LAST_ARCHIVE_TS of the actual DBID being purged… but it takes the other one.

According to my tests, it does not use neither the correct timestamp for the dbid, nor get the oldest timestamp: it uses instead the timestamp of the first record found by the clause “WHERE AUDIT_TRAIL=’STANDARD AUDIT TRAIL'”. It depends on the physical location of the row in the table! Clearly a big mess… (PS, not sure 100%, but this is what I suppose)

So, I have tried to modify the archive time for DBID 0:

Trying to execute the cleanup again, now leads to a better timestamp:

I have then tried to play a little bit with the DBA_AUDIT_MGMT_LAST_ARCH_TS view (and the underlying table DAM_LAST_ARCH_TS$).

First, I’ve faked the DBID:

Then, I have tried to increase the retention timestamp (500 days):

Finally, I have tried to purge the audit trail with both DBIDs:

As I expected, in both cases the the cleanup generated the delete with the timestamp of the fake DBID:

Is it possible to delete the unwanted records from the view DBA_AUDIT_MGMT_LAST_ARCH_TS?

Not only is possible, but I recommend it:

Afterwards, the timestamp in the where condition is correct and remains correct after subsequent executions of DBMS_AUDIT_MGMT.SET_LAST_ARCHIVE_TIMESTAMP.

Conclusions, IMPORTANT FOR THE DATABASE OPERATIONS:

The upgrade causes the unwanted lines with DBID=0 in the DBA_AUDIT_MGMT_LAST_ARCH_TS view.

Moreover, any duplicate changes the DBID: any subsequent execution of DBMS_AUDIT_MGMT.SET_LAST_ARCHIVE_TIMESTAMP in the duplicated database will lead to additional lines in the view.

This is what I plan to do now:

  • Whenever I upgrade from 11g to 12c, cleanup the data from DBA_AUDIT_MGMT_LAST_ARCH_TS and schedule the cleanup for DBID 0 as well
  • Whenever I duplicate a database, I execute a DELETE (without clauses) from DBA_AUDIT_MGMT_LAST_ARCH_T and a truncate of the table SYS.AUD$ (it is a duplicate, after all!)

HTH

Basic Vagrantfile for multiple groups of VMs

In case you want to prepare multiple sets of machines quickly using Vagrant, ready for different setups, this might be something for you:

The nice thing, (beside speeding up the creation and basic configuration) is the organization of the directories. The configuration at the beginning of the script will result in 5 virtual machines:

It is based, in part (but modified and simplified a lot), from  the RAC Attack automation scripts by Alvaro Miranda.

I have a more complex version that automates all the tasks for a full multi-cluster RAC environment, but if this is your requirement, I would rather check oravirt scripts on github (https://github.com/oravirt) . They are much more powerful and complete (and complex…) than my Vagrantfile. 🙂

Cheers

BP and Patch 22652097: set optimizer_adaptive_statistics to FALSE explicitly or it might not work!

Update 14.03.2018: After some exchanges with Nigel Bayliss, the behaviour described here has been filed as unpublished bug 27626925: OPTIMIZER ADAPTIVE STATS DEFAULT FALSE NOT HONORED WHEN ENABLED IN OCT OR JAN BP. It will be fixed starting with April’s bundle patch.

 

According to Nigel’s blog post:

The Oracle 12.1.0.2 October 2017 BP and the Adaptive Optimizer

if you installled the patch 22652097 prior to apply the Bundle Patch 171018, the BP apply in the database should recognize that the patch was already in place and keep it activated. This is done through the fix control 26664361.

When fix_control 26664361:0 -> Patch 22652097 is not enabled: the parameter optimizer_adaptive_features (OAF) works

When fix_control 26664361:1 -> Patch 22652097 is enabled; optimizer_adaptive_features is discarded and the two new parameters have the priority: optimizer_adaptive_plans (OAP) and optimizer_adaptive_statistics (OAS).

But at my customer, I had another behavior.

My patching story might be very similar to yours!

When I started upgrading my customer’s database to 12c in early 2015, I experienced very soon the infamous problems with SQL Plan Directives (SPD) and Adaptive Dynamic Sampling (ADS) that I described in my paper: ADAPTIVE FEATURES OR: HOW I LEARNED TO STOP WORRYING AND TROUBLESHOOT THE BOMB .

Early fixes

When I was new to the problem, the quick fix for the problematic applications was to set OAF to FALSE.

Later, I discovered some more details and decided to opt for setting:

In other cases, I disabled the specific directives that were causing problems.

But many databases did not have so many problems, and I left the defaults.

Patch 22652097 on top of BP170718 

At some point, me and my customer decided to apply the fix 22652097, on top of BP170718 that was our current patch level at that time.

The patch installation on a test database was complaining about the optimizer_adaptive_feature set: this parameter was not used anymore. This issue is nicely explained by Flora in her post Patch 22652097 in 12.1 makes optimizer_adaptive_features parameter obsolete.

In order to apply that patch on the remaining databases, we did:

  • alter system reset optimizer_adaptive_features;
  • alter system reset “_optimizer_dsdir_usage_control”;
  • Applied the patch on binaries and datapatch on the databases.

The result at this point was that:

  • optimizer_adaptive_features was not set
  • optimizer_adaptive_plans was set to true
  • optimizer_adaptive_statistics was set to false.

It might seems superflous to say, but it’s not, the SQL Plan Directives were not used anymore: no Adaptice Dynamic Sampling and no performance problems.

Bundle Patch 180116

Three weeks ago, we installled the last Bundle Patch in order to fix some Grid Infrastructure problems, and the BP, as described in Nigel’s note (and Mike Dietrich and many other bloggers :-)) contains the patch 22652097.

According to Nigel’s post, the patch installation should have detected that the patch 22652097 was already there and activate it.

And indeed, after we applied the BP, the fix_control 26664361 was set to 1 (that means that the patch 22652097 is enabled). So we went live with this setup without additional checks.

One week later, we started experiencing performance problems again. I noticed immediately that the Adaptive Dynamic Sampling was very aggressive again, and the SQL Plan Directives used again.

But the fix was there AND ENABLED!

After a few tests, I realized that the SPD is not used anymore if I set optimizer_adaptive_statistics EXPLICITLY to false.

optimizer_adaptive_statistics must be set explicitly, the default does not work

And here’s the proof:

I use once again the great SPD example by Tim Hall (sorry Tim, it’s not the first time that I steal your work 🙂 ) . You can find here:

SQL Plan Directives in Oracle Database 12c Release 1 (12.1)

After applying the BP, I have the default parameter, not set explicitly, and the fix_control enabled:


If I run the test statement (again, find it here https://oracle-base.com/articles/12c/sql-plan-directives-12cr1) the directives are used:


but then I set the parameter explicitly:

and the SPD usage (and consequently, ADS), are gone:

Conclusion

Set the parameter EXPLICITLY when you apply the BP that contains the fix.

And ALWAYS test the behavior!

You can check how many statements use the dynamic sampling by following this short blog post by Dominic Brooks:

Which of my sql statements are using dynamic sampling?

HTH

The story of ACME and its CRM with serious SQL injections problems

Preface/Disclaimer

This story is real, but I had to mask some names and introduce some minor changes so that real people are not easy to recognize and the whole story does not sound offensive to anyone. This post is not technic, so my non-technical English will be fully exposed. Sorry for the many errors 🙂

ACME, The Company

ACME is a big, global company. It has a huge revenue and there are almost no competitors on the market that get close to it in terms of fame and earnings.

Its core business is heavily supported by its CRM system, that holds all the customers, contracts, prospects, suppliers…

FOOBAR CRM, The CRM system

Despite the CRM is not ACME’s core business, the data in there is really, really precious. Without prospects and customer data, the sales cannot close the deals.

The CRM application (let’s call it FOOBAR CRM) runs on a legacy architecture and it is as old as the company itself.

The architecture is the “old good style” web application that was common in the early 2000’s… : browser front-end (OK, you might think that it is not so old, huh?) , PHP application backed by Apache, MySQL database.

As you can see, quite old but not so uncommon.

One of the big concerns, as in every application lifecycle, is to maintain good code quality. At the beginning of the PHP era, when PHP was still popular, there was a lack of good frameworks (I’m not even sure if there are now, I’m sure Zend Framework was a cool MVC framework but it came out many years later). The result is that now the code maintenance of the application is literally a pain in the a**.

The customer is a noob in development, so when it has been founded and needed a CRM system, the management delegated the development to an external company (let’s call it FOOBAR).

FOOBAR, The software house

The company FOOBAR is as old as the ACME company. Respective founders were relatives: they started the business together and now that the founders left, the partnership is working so well that FOOBAR is also one the biggest resellers of ACME products (despite its business is loosely related to ACME’s business). FOOBAR is also at the same time a partner and a customer, and some member of its board are also part of ACME’s board.

What is important here, is that the advices coming from the “common board members” are considered much more important than the advices coming from ACME’s employees, customers and marketing department.

The code maintenability

ACME has started small, with a small “oldish” CRM system. But some years later ACME experienced a huge increase of customers, product portfolio, employees, revenues etc..

In order to cope with the increasing workload of the application, they scaled everything up/out: there are now tens of web servers nicely load balanced, some webcache servers, and they introduced Galera cluster in conjunction with some replicated servers to scale out the database workload.

The global business of ACME also required to open the FOOBAR CRM application to the internet, exposing it to a wide range of potential attacks.

In order to cope with increasing needs, FOOBAR proposed an increasing number of modules, pieces of code, tools to expand the CRM system. To maximize the profits, FOOBAR decided to employ only junior developers, unexperienced and not familiar at all with development of applications using big RDBMS systems and a very scarse sense of secure programming.

That’s not all!

In order to develop new features faster, ACME and FOOBAR have an agreement that let the end users develop their own modules in PHP code and plug them in the application, most of the times directly in production (you may think: that’s completely crazy, this should NEVER happen in a serious company! You know what? I agree 100%).

Uh, I forgot to mention, the employees that use the CRM application and have some development skills are VERY, VERY happy to have the permission to code on their own, because they can develop features or solve bugfixes on their own, depending on their needs.

Result: the code is completely out of control: few or no unit tests, no integration tests at all, poor security, tons of bugs.

The big SQL Injection problem

Among many bugs, the SQL injection is the most common. It started with some malicious users trying to play around with injection techniques, but now the attacks are happening more and more frequently:

  • The attacks come from many hackers (not related to each other)
  • Some hackers try to get money for that, some other just steal data, some other want just to mess up and low down ACME’s reputation…

everytime an attack is successful, ACME looses more and more contracts (and money).

The fix, up to now, was to track the hacker IP address AFTER the attack and add it to the firewall blacklist (not so clever, huh?).

Possible Solutions (according to the security experts)

ACME mandated an external company to do an assessment. The external company proposed a few things:

  • SOLUTION 1: Change completely the CRM software and use something more modern, modular, secure and developed by a company that hires top talents. There are tons of cloud vendors that offer CRM software as a Service, and other big companies with proven on-premises CRM solutions.
  • SOLUTION 2: Keep the current solution, but with a few caveats:
    • All the code accessing the database must be reviewed to avoid injections
    • only the experienced developers should have the right to write new code (possibly employees of the software house, that will be accountable for new vulnerabilities)
  • SOLUTION 3: Install content-sensitive firewalls and IDS that detect SQL Injection patterns and block them before they reach the web server and/or the database layer.

What the CRM users think

User ALPHA (the shadow IT guy): “We cannot afford to implement any of the solutions: we, as users, need the agility to develop new things for ourselves! And what if there is a bug? If I have to wait a fix from the software house, I might loose customers or contracts before the CRM is available again!”

User BRAVO (the skeptical): “SQL Injection is a complex problem, you cannot solve it just by fixing the current bugs and revoke the grants to develop new code to the non-developers”

User CHARLIE (the lawyer): “When I’ve been hired, I’ve been told that I had the right to drink coffee and develop my own modules. I would never work for a company that would not allow me to drink coffee! Drinking coffee and creating vulnerabilities, are both rights!”

User DELTA (the average non-sense): “The problem is not the vulnerable code, but all those motherf****** of hackers that try to inject malicious code. We should cure mental illness of geeks so they do not transform themselves in hackers.”

User ECHO (the hacker specialist): “If we ask stackoverflow to provide the IP addresses of the people that search for SQL injection code examples, we might preventively block their IP addresses on our external firewall!”

User FOXTROT (the false realist): “Hacker attacks happen, and there’s not much we can do against them. If we fix the code and implement security constraints, there will always be hackers trying to find vulnerabilities. You miss the real problem! We must cure this geeks/hackers insanity first!”

User GOLF (the non-sense paragon): “You concentrate on contracts lost because of SQL Injections, but the food in our restaurant sucks, and our sales also lose contracts because they struggle to fight stomach ache”.

User HOTEL (the denier): “I’ve never seen the logs that show the SQL Injections, I am sure it is a complot of the no-code organizations meant to sell us some WYSIWIG products”.

User INDIA (the unheard): “Why can’t we just follow what the Security Experts suggest and see if it fixes the problem?”

What the management thinks

“We send thought and prayers to all our sales, you are not alone and you’ll never be. (… and thanks for the amazing party, FOOBAR, the wine was delicious!)”

What ACME did to solve the problem

Absolutely nothing.

Forecast

More SQL Injections.

 

UPDATE 20.02.2018

Many people asked me who was the ACME customer that had the SQL injection problem. None. It is an analogy to the US mass shootings that happen more and more frequently, the last one at the time of writing: https://en.wikipedia.org/wiki/Stoneman_Douglas_High_School_shooting

This post is intended to show that, if explained as it was an IT problem, the solution would sound so easy that nobody would have any doubts about the steps that must be done.

Unfortunately, it is not the case, and the US is condamned to have more and more mass shootings because nobody wants to fix the problem. 🙁

My own Dbvisit Replicate integration with Grid Infrastructure

I am helping my customer for a PoC of Dbvisit Replicate as a logical replication tool. I will not discuss (at least, not in this post) about the capabilities of the tool itself, its configuration or the caveats that you should beware of when you do logical replication. Instead, I will concentrate on how we will likely integrate it in the current environment.

My role in this PoC is to make sure that the tool will be easy to operate from the operational point of view, and the database operations, here, are supported by Oracle Grid Infrastructure and cold failover clusters.

Note: there are official Dbvisit  online resources  about how to configure Dbvisit Replicate in a cluster. I aim to complement those informations, not copy them.

Quick overview

If you know Dbvisit replicate, skip this paragraph.

There are three main components of Dbvisit Replicate: The FETCHER, the MINE and the APPLY processes. The FETCHER gets the redo stream from the source and sends it to the MINE process. The MINE process elaborates the redo streams and converts it in proprietary transaction log files (named plog). The APPLY process gets the plog files and applies the transactions on the destination database.

From an architectural point of view, MINE and APPLY do not need to run close to the databases that are part of the configuration. The FETCHER process, by opposite, needs to be local to the source database online log files (and archived logs).

Because the MINE process is the most resource intensive, it is not convenient to run it where the databases reside, as it might consume precious CPU resources that are licensed for Oracle Database. So, first step in this PoC: the FETCHER processes will run on the cluster, while MINE and APPLY will run on a dedicated Virtual Machine.

dbvisit_gi_overview

Clustering considerations

  • the FETCHER does NOT need to run on the server of the source database: having access to the online logs through the ASM instance is enough
  • to avoid SPoF, the fetcher should be a cluster resource that can relocate without problems
  • to simplify the configuration, the FETCHER configuration and the Dbvisit binaries should be on a shared filesystem (the FETCHER does not persist any data, just the logs)
  • the destination database might be literally anywhere: the APPLY connects via SQL*Net, so a correct name resolution and routing to the destination database are enough

so the implementation steps are:

  1. create a shared filesystem
  2. install dbvisit in the shared filesystem
  3. create the Dbvisit Replicate configuration on the dedicated VM
  4. copy the configuration files on the cluster
  5. prepare an action script
  6. configure the resource
  7. test!

Convention over configuration: the importance of a strong naming convention

Before starting the implementation, I decided to put all the caveats related to the FETCHER  resource relocation on paper:

  • Where will the configuration files reside? Dbvisit has an important variable: the Configuration Name. All the operations are done by passing a configuration file named /{PATH}/{CONFIG_NAME}/{CONFIG_NAME}-{PROCESS_TYPE}.ddc to the dbvrep binary. So, I decided to put ALL the configuration directories under the same path: given the Configuration Name, I will always be able to get the configuration file path.
  • How will the configuration files relocate from one node to the other? Easy here: they won’t. I will use an ACFS filesystem
  • How can I link the cluster resource with its configuration name? Easy again: I call my resources dbvrep.CONFIGNAME.PROCESS_TYPE. e.g. dbvrep.FROM_A_TO_B.fetcher
  • How will I manage the need to use a new version of dbvisit in the future? Old and new versions must coexist: Instead of using external configuration files, I will just use a custom resource attribute named DBVREP_HOME inside my resource type definition. (see later)
  • What port number should I use? Of course, many fetchers started on different servers should not have conflicts. This is something that might be either planned or made dynamic. I will opt for the first one. But instead of getting the port number inside the Dbvisit configuration, I will use a custom resource attribute: DBVREP_PORT.

Considerations on the FETCHER listen address

This requires a dedicated paragraph. The Dbvisit documentation suggest to  create a VIP, bind on the VIP address and create a dependency between the FETCHER resource and the VIP. Here is where my configuration will differ.

Having a separate VIP per FETCHER resource might, potentially, lead to dozens of VIPs in the cluster. Everything will depend on the success of the PoC and on how many internal clients will decide to ask for such implementation. Many VIPs == many interactions with network admins for address reservation, DNS configurations, etc. Long story short, it might slow down the creation and maintenance of new configurations.

Instead, each FETCHER will listen to the local server address, and the action script will take care of:

  • getting the current host name
  • getting the current ASM instance
  • changing the settings of the specific Dbvisit Replicate configuration (ASM instance and FETCHER listen address)
  • starting the FETCHER

Implementation

Now that all the caveats and steps are clear, I can show how I implemented it:

Create a shared filesystem

Install dbvisit in the shared filesystem

Create the Dbvisit Replicate configuration on the dedicated VM

Copy the configuration files from the Dbvisit VM to the cluster

Prepare an action script

Configure the resource

Test!

 

Also the relocation worked as expected: when the settings are modified through:

The MINE process get the change dynamically, so no need to restart it.

Last consideration

Adding a hard dependency between the DB and the FETCHER will require to stop the DB with the force option or to always stop the fetcher before the database. Also, the start of the DB will pullup the FETCHER (pullup:always) and the opposite as well. We will consider furtherly if we will use this dependency or if we will manage it differently (e.g. through the action script).

The hard dependency declared without the global keyword, will always start the fetcher on the server where the database runs. This is not required, but it might be nice to see the fetcher on the same node. Again, a consideration that we will discuss furtherly.

HTH

Ludovico

Get the Most out of Oracle Data Guard – The material

Here we go: as usual, the feedback that I usually get after my talks (specifically, after POUG High Five conference), is if I will share my demo scripts and material.

Sadly, the demos I am doing for my presentation “Get the most out of Oracle Data Guard” are quite tied to an environment built for the purpose of the demos. So, do not expect to get scripts easy to use as is, but rather to get some ideas beyond the demo themselves.

I hope they will help to get the whole picture.

Of course, if you need to implement a cloning strategy based on Data Guard or any other solution that I describe in this post, please feel free to contact me, I will be glad to help you implement it in your environment.

Slides

Demo 1

Video:

Scripts:

 

Demo 2

Video:


Scripts:

 

Demo 3

Video:

Scripts:

Preparation:

snap_acfs.pl

 

snap_databasae.pl

clone_from_snap.pl

Cheers

Ludovico