June 27, 2016

Oracle GoldenGate EventMarker System (EMS) to help facilitate Oracle Real Application Testing (RAT)

In this blog I want to share some neat tricks available to help facilitate Oracle RAT testing on the new target system or new target DR system when using Oracle GoldenGate as the migration replication tool.  We don't need to cover in detail why Oracle GoldenGate is a tool that every IT shop should have in its IT tool box, as I have already beat that topic to death...it is simple, Oracle GoldenGate is a tool to help make many impossible migration projects possible. This tool and its capabilities have helped tens of thousands of migrations possible and successful.

Now, onto the topic of interest. When doing Oracle-to-Oracle migrations, we all know how important it is to validate the new target platform, and validation should be done before cutover, not post cutover.

If using Oracle GoldenGate as the migration replication tool (which you should be!), you have access to some less common/unknown but helpful features of Oracle GoldenGate that can help with the data synchronization requirement for Oracle RAT (Real Application Testing) capture and replay on the new target state environment.  RAT is best run when the data is in sync between source and target at time of capture. We must remember, RAT captures the sources transactions with the data/bind used for replay. TO avoid high replay error percentage (or complete replay failover), the data should be in sync at time of replay (in sync as of the time of the capture start).

With the full tool set from Oracle: Oracle RDBMS, Oracle GoldenGate, Oracle RAT and Oracle Data Guard (not required), you can satisfy the data in sync issue and can validate the new target state platform and tune and tweak as much as needed before cutover (as many times as you like to proof performance and scalability). The end result.....an extremely successful cutover for your customers and application owners (and usually a very short, if any at all, war room presence), and hopefully IT will be looked at a as a hero instead of a villian.

In Oracle GoldenGate is a little known feature called EventMarker System (EMS). In Oracle GoldenGate, this is nothing more than a set of parameters to existing mappings or the mapping configuration that can be used to interact with replication due to events and respond to those events.
Lots and lots of uses for EMS, but lets stick with using it to control our replication to the target to allow us to stop replication at the right time for the replay to occur with data sync at time of capture.

Usually the triggering event for EMS is some type of DML transaction that occurred in the source and that matches a criteria we specify. The criteria is quite flexible with EMS (you code it using parameters and options), but for simplicity, lets say our criteria is a DML transaction against a table in the source that has certain column values in the transaction. We will look for this type of transaction on the table, and evaluate the data in the transaction and respond or not respond when column values match our criteria.

In our case, when the RAT capture on the source starts, we also want to insert a record to the control table that will be extracted and pumped over to the target system. The target system will be applying data to keep the target in sync and once it sees this EMS transaction, it will respond by stopping all apply processes (replicats). This will ensure that our target is at the same point in time data-wise as the source system so our replay will have very few errors due to out of sync data on the target.

We will also use Flashback features of Oracle RDBMS on the primary target system or the target's DR environment which is already running and being synced with Data Guard. We can choose to run the replay on the primary or snapshot standby depending on our needs and lag threshold for the target itself. IF data must be keep in sync regularly due to high volume, we can run the replay on the target system DR standby.


The entire process from a high level looks like this:

1) Build the new target stat environment
    - with a shell DB on the new target platform/version as you like
    - configure the shell DB as you like (physical and storage layout for tablespaces
2) Implement the requirements for Oracle GoldenGate capture on the legacy source
    - database level supplemental logging
    - userid with privs for Oracle GoldenGate
   - correct init.ora parameters needed for replication
3) Install Oracle GoldenGate on the source system, configure and start extraction
    - create subdirs
    - create configuration files
    - schematrandata (or trandata for all schemas/tables in replication scope)
4) Start up capture and validate extraction is:
    -  running and stable
    -  no long running transactions existed before extract started and are still running (they will be
       skipped)
5) Install Oracle GoldenGate on the target system and configure for target apply
     - start the manager process for communications
6) Create the Pump on the source system and start pumping the CDC data to the target system
7) Start a consistent export on the source system (using flashback_scn)
    - for all schemas/tables in replication scope (this step may be in phases depending on issues)
    - also, a full metadata export may be needed
8) Import the consistent exported data into the target system
9) Create replicats to start applying the data and starting after the SCN (CSN in GoldenGate)
     - special command to start the replicats from a particular CSN
10) Apply any RAT patches needed to the source or target database platform/versions
11) Create the table needed for EventMarker system (a simple control table in the source / target)
      - and add table-level supplemental logging (not required since it is not a critical table,
        but a good practice)
12) Implement the Oracle GoldenGate EventMarker system mappings
      -  in the source extract, pump(s), replicat(s)
13) Restart all the Oracle GoldenGate processes to reload the mappings/configuration now
      including EMS parameter/mappings
14) Let replication caught up on the target (if you desire to validate replication is working)
      -  not required at all, but a good idea at first to ensure all replication is working)
15) Test EMS a time or two just to be sure the mappings work as configured
     - mappings can stop extract, pump or replicats or all
     - suggest only stopping replicats, or pumps but leave extracts running at all times
15) Start RAT capture and insert the control table DML on the source
      -  a simple insert with 4-5 column values as part of it
          - the insert uses key words for certain columns to instuct replication to stop
          - the column values will match criteria we setup in the mappings for evaluation and matching
      - can do the insert before the RAT command or after
16) While RAT capture is running, verify the replicats have all seen the DML and acted upon it by           stopping

17) Once RAT capture is complete, do the remaining steps needed to process the capture
18)  Now we are ready for replay or should be:
       - replicats should all be stopped at the same point in time due to our EMS and control DML
19) Decide where to run the replay, on the primary or standby system
      - for primary, stop all OGG processes, create a GRP on the target and replay
      - for standby, stop DG shipping and apply, stop DG, create GRP and open as snapshot
        - perorm replay on standby

Note: Flashback DB is our friend here. We can flashback and replay the capture as many times as needed to tune/tweak the target system to our liking

20) Flashback DB to same state as before replay (where OGG was at in apply state)
21) Restart Oracle GoldenGate processes on the target to catch up
22) Repeat as necessary to make your migration a complete success.


Sample EMS mappings and control table:

....still updating....have to run off to do a project (coming soon, 7-10-2016)

Good luck and let me know if you have any questions or feedback.

Gregory Thomas



June 21, 2016

Oracle GoldenGate Active-Active for Oracle-to-Oracle (Part 1)

Since the acquisition of GoldenGate by Oracle in 2009, Oracle has replaced the Streams product with the GoldenGate product as the highest tech stack/configuration on the Oracle MAA chart. Using GoldenGate for active-active allows for the highest level for a system/application. It is strongly suggested that you first work through the Oracle MAA tiers (starting at the bottom) with your tech stack to ensure that lower levels of MAA cannot, or will not, completely meet your HA/DR requirements. Moving to an active-active architecture with Oracle GoldenGate is certainly possible (this product makes the impossible possible in many cases), however, moving to an active-active solution is best when viewed as a strategic initiative (you are driving/working towards the final solution), and not simply a quick investment and turn on and your active-active solution is ready to go.

Active-Active HA/DR solutions are certainly more complex and can lead to higher administration efforts than lower level MAA tiers solutions, such as those with a RAC and Active Data Guard on both sides.

With that said, Oracle GoldenGate is a truly amazing and can't live without it product for every IT shop. In my humble opinion, every IT shop should have the GoldenGate product as pat of its IT tool belt; it uses range from data integration of any kind, to migrations/upgrades/hardware refresh and finally for the ultimate in HA/DR for systems/applications.

In this blog, we will cover a simple setup for implementing an active-active solution using GoldenGate for replicating a test schema bi-directionaly and using GoldenGate's built in CDR functionality for time-based conflict detection and resolution. CDR is formally known as "Conflict Detection and Resolution", and is a pre-defined set of conflict detection and resolutions capabilities built into the Oracle GoldenGate product itself. There are several options to CDR, however, time-based detection and resolution is by far the most common and the one we will use since it is simple to understand (which record is older and which is newer is simple using a timestamp).

To get started, we must define active-active, as this term can be widely interpreted in the wild, and, in Oracle GoldenGate fashion, we must define the systems as a topology, as there are limitations and non-supported objects/functionality with various topologies. These limitations are not specific to Oracle GoldenGate, as these are generally limitations to all products that provide logical replication. To be transparent, logical replication is not a complete full-proof science yet, therefore, there are limitations and areas of concern (or redesign) that must be addressed to determine if the solution will work and if the solution is manageable (cost effective).

Active-Active can mean different things to different people (gotta love us humans). Therefore, we need to define what active-active means to us in terms of a GoldenGate topology and the solution we have in mind. The classification (what topology we call it) is determined by the number of systems involved, and their roles in the replication setup:

bi-directional          --> two systems only, both act as a source and target, both allow transactions
n-tier (peer-to-peer) --> three or more systems, all acting as sources and targets, all allow transactions

In this blog, to keep it shorter, we will demonstrate active-active bi-directional;  this topology offers the most functionality because we can capture and replicate both DML and DDL to both systems and in both directions. We will also focus on only Oracle-to-Oracle replication and using modern versions of Oracle (at least 11.2.0.3+ or higher on either side).  One of Oracle GoldenGate's strongest features is that it is heterogeneous, and it always has been, and it keeps getting better and better thanks to Oracle and their great people. This means it can not only replicate data from different relational platforms (SQLServer to Oracle, DB2 to Oracle, MySQL to Oracle, and many others, and vice versa, etc), but also between different Oracle versions.  For example, it is possible (and common, we do this all the time), is to replicate Oracle 9i/10g to a new hardware platform and Oracle version, such as 11.2.0.4 or 12c (single instance or PDB).  This is type of migration work is going on all the time around the world and GoldenGate is the market leader in this space for these projects (see how GoldenGate is the "swiss army knife" in your IT tool box).

Now on to the requirements for an active-active bi-directional setup. There are requirements for logical replication in general, and data model issues we must address/deal with to ensure a successful bi-directional implementation. And we must keep in mind, that each side (both systems) will act as both a source and target in the solution. Therefore, we must address the requirement for both a source system and target system in the solution setup. In this blog we will use modern versions of Oracle RDBMS, let me state, it is possible in most cases to do bi-directional using older Oracle version for one side or both, but these solutions usually require much more configuration and tweaking for bi-directional to work (and more limitations as well, which does not mean it will not work, just more workarounds or detours needed).

Source requirements for logical replication (we will use the requirements for integrated capture):

1) Source Database (capture requirements)
    a) RDBMS version supported by GoldenGate Integrated capture
    b) Database is in archive log mode (LogMiner requires it)
    c) Databaase level suppplemental logging has been enabled (minimum)
    d) Init.ora parameters for GoldenGate set
         - goldengate_replication=true
         - streams_pool_size = 1.6G
    e) One-off patches for GoldenGate applied to database for integrated processes
        - PSU level specific one-off for Integrated processes in the databases
    f) Oracle sequences addressed for uniqueness between the sites
        - common to use odd/even strategy (odd for one site, even for the other)
        - there are other configurations as well that support n-tier
    g) Oracle user created as the extract user and privileges for extract granted
        - privileges are specific to the Oracle RDBMS and GoldenGate version (check the install guide)
    h) Table level supplemental logging on all schemas/tables involved in replication
        - SCHEMATRANDATA (or TRANDATA)
        - SCHEMATRANDATA is recommended over TRANDATA with modern Oracle RDBMS
        - once the schemas/objects exists

2) Target Database requirements (apply requirements)
    a) Init.ora parameters for GoldenGate set
         - goldengate_replication=true
         - streams_pool_size = 1.6G
    b) Oracle sequences addressed for uniqueness between the sites
        - common to use odd/even strategy (odd for one site, even for the other)
        - there are other configurations as well that support n-tier
    c) One-off patches for GoldenGate applied to database for integrated processes
        - PSU level specific one-off for Integrated processes in the databases
    d) Oracle user created as the extract user and privileges for extract granted
        - privileges are specific to the Oracle RDBMS and GoldenGate version (check the install guide)

The above are hard requirements, as mentioned, logical replication is not a perfect science yet (getting closer), as such, there will likely be other changes/tweaks needed based on the data and data model in use that will need to be addressed in the setup.

For active-active (or any logical replication), there are requirements to meet for a source to capture and target to apply, and then there are areas of concern to address as a result of logical replication (SQL apply). When setting up active-active, we want to address and prevent as many replication issues as we can with our configuration/setup (this includes any changes tot eh data model), most of our time will be spent configuring (changing) to avoid issues completely as well as addressing setup/configuration to handle the issues automatically (such as CDR parameters).

Logical replication issues that must be addressed for performance or data integrity:

1) All tables needed a PK or UI constraint
    - for row uniqueness
    - for apply performance (WHERE clause used on apply)
2) Triggers cannot double fire on the target
    - the resulting data changes from triggers firing was most likely replicated  as well
3) Cascading constraint operations are replicated most likely (in most cases)
4) Error handling (conflict detection) and reporting
    - how to handle apply errors
    - how to report and monitor apply errors
    - each time CDR is invoked, this is a performance hit to the apply
       - works well, but we need and should want to know when, where and how frequent is it

Now lets get into the GoldenGate configuration pieces for both systems, remember, both systems will function as a source and target.

GoldenGate is a very extensible/configurable product, we will not go into the details here, we will simply look at a simple example and we will not cover all the potential ways of doing things in GoldenGate (such as initial loads or data movement for instantiation).  The product can do, or be configured to do almost (I said almost), anything you need.
Once the initial configuration is shown, we will look at using other product features to streamline a configuration. Issues such as hundreds or thousands of schemas or tables, different date columns used for different schemas tables. We will look at how macros/variables in GoldenGate can help save you a tremendous amount of configuration time when needing to make changes to hundreds or thousands of mappings in an implementation. There is also a blog post on these topics as well.

In this example, we have created the test schema on both sides before replication (extract/pump/replicat) was setup, just to keep it simple. There are many ways to do this online and in-place with an existing replication implementation that is up and running (again, the powerful capabilities of GoldenGate). Our schema also included a common TIMESTAMP column in each table with the same name to make the mappings easier and the ability to use a wildcard mapping with the CDR parameters. If the date columns to be used have different names, then you will need separate mappings for each, or use macros and pass the column name as a variable (more on that much later in the blog), or see the blog post on OGG include files, macros and variables.

Our test schema is called "GGTEST" and our GoldenGate user in both databases is "OGG".
Our GoldenGate user is used for both capture and apply in this simple in both databases.

To keep it simple, we will not use or show many fancy features or capabilities that are possible, such as Eventmarker, heartbeat, etc.; we will also use the same process names for simplicity.

Create the GGTEST schema and objects on both sides. We used a script that creates the user and objects and loads a small amount of seed data in the schema. After run, the schema and data on both sides are identical.  Since they are identical on both sides, and no transactions are occurring, we can setup the replication from this point forward which is simple to do.

Now the GoldenGate steps:

1) Create the GGTEST schema on both systems

 SQL> @create-GGTEST-objects.sql  (creates GGTEST and then connects as GGTEST)

2) Enable table-level supplemental logging on the schemas in both databases
    - requirement for logical replication (and must be done via GGSCI)

    # ./ggsci
    # GGSCI> DBLOGIN USERID ogg PASSWORD oggpwd
    # GGSCI> ADD SCHEMANTRANDATA GGTEST

3) Change the sequences on each system (in the GGTEST schema)
    - on the first system, to start at 1 and increment by 2 (results in "odd" numbers only)
         - run ALTER SEQUENCE on each sequence in the application schema
    - on the second system, to start at 2 and increment by 2 (results in "even" numbers only)
         - run ALTER SEQUENCE on each sequence in the application schema

3) Create the GoldenGate PRM files (configuration files) for each process type on both sides

Manager PRM (mgr.prm):

PORT 7809
DYNAMICPORTLIST 7810-7820

PURGEOLDEXTRACTS ./dirdat/e1*, USECHECKPOINTS, MINKEEPDAYS 7
PURGEOLDEXTRACTS ./dirdat/r1*, USECHECKPOINTS, MINKEEPDAYS 7
AUTOSTART JAGENT*
AUTORESTART ER *, RETRIES 10 , WAITMINUTES 5, RESETMINUTES 60

LAGCRITICALMINUTES 10
LAGREPORTMINUTES 10


GLOBALS:

GGSCHEMA OGG
CHECKPOINTTABLE OGG.CHECKPOINTS

-- File creation security file mask (for trail files and log files)
OUTPUTFILEUMASK 0002

-- SYSLOG facility logging level (logging to SYSLOG facility)
SYSLOG NONE


Extract PRM (extest.prm):

EXTRACT extest

SETENV (ORACLE_HOME="/u01/app/ora11g/product/11.2.0/dbhome_3")
SETENV (ORACLE_SID="DBM1")
SETENV (NLS_LANG="AMERICAN_AMERICA.AL32UTF8")

USERID ogg PASSWORD oggpwd

DISCARDFILE ./dirrpt/extest.dsc, APPEND, MEGABYTES 10
DISCARDROLLOVER AT 00:01 ON SUNDAY

TRANLOGOPTIONS INTEGRATEDPARAMS (max_sga_size 1024, parallelism 2)
TRANLOGOPTIONS EXCLUDEUSER OGG
GETAPPLOPS
IGNOREREPLICATES

DDL &
   INCLUDE OTHER &
   INCLUDE MAPPED

-- CDR parameter
LOGALLSUPCOLS
-- NOCOMPRESSUPDATES
-- NOCOMPRESSDELETES
-- GETUPDATEBEFORES

STATOPTIONS RESETREPORTSTATS, REPORTFETCH, REPORTDETAIL
REPORTCOUNT EVERY 30 MINUTES, RATE
REPORTROLLOVER AT 00:01 ON SUNDAY
ROLLOVER AT 00:00
WARNLONGTRANS 4H, CHECKINTERVAL 4H

EXTTRAIL ./dirdat/e1

-- TABLE/SEQUENCE Mappings

-- TABLE GGTEST.*;
TABLE GGTEST.*, GETBEFORECOLS &
(ON UPDATE KEYINCLUDING(OGGTS), &
 ON DELETE KEYINCLUDING(OGGTS));


Pump PRM:

EXTRACT pmxtest

SETENV (NLS_LANG="AMERICAN_AMERICA.AL32UTF8")
SETENV (ORACLE_HOME="/u01/app/ora11g/product/11.2.0/dbhome_3")
SETENV (ORACLE_SID="DBM1")

USERID ogg PASSWORD oggpwd

DISCARDFILE ./dirrpt/pmtest.dsc, APPEND, MEGABYTES 10
DISCARDROLLOVER AT 00:01 ON SUNDAY

RMTHOST  targetb, MGRPORT 7809, COMPRESS
RMTTRAIL ./dirdat/r1

-- TABLE/SEQUENCE Include/Exclude
PASSTHRU

TABLE GGTEST.*;


Replicat PRM:

REPLICAT rptest

SETENV (NLS_LANG="AMERICAN_AMERICA.AL32UTF8")
SETENV (ORACLE_HOME="/u01/app/ora11g/product/11.2.0/dbhome_3")
SETENV (ORACLE_SID="DBM1")

USERID ogg PASSWORD oggpwd

DISCARDFILE ./dirrpt/rptest.dsc, APPEND, MEGABYTES 10
DISCARDROLLOVER AT 00:01 ON SUNDAY

DBOPTIONS INTEGRATEDPARAMS(parallelism 8)
-- DBOPTIONS SUPPRESSTRIGGERS  (this is done by default in OGG 12c)

DDL &
  INCLUDE OTHER &
  INCLUDE MAPPED

DDLOPTIONS UPDATEMETADATA
DDLOPTIONS REPORT

STATOPTIONS RESETREPORTSTATS, REPORTFETCH, REPORTDETAIL
REPORTCOUNT EVERY 30 MINUTES, RATE
REPORT AT 00:00
REPORTROLLOVER AT 00:00 ON SUNDAY

ASSUMETARGETDEFS

-- APPLYNOOPUPDATES
-- ALLOWNOOPUPDATES
-- BATCHSQL

-- MAP Mappings

-- MAP GGTEST.*, TARGET GGTEST.*;
MAP GGTEST.*, TARGET GGTEST.*, &
COMPARECOLS( &
   ON UPDATE KEYINCLUDING (OGGTS), &
   ON DELETE KEYINCLUDING (OGGTS)), &
 RESOLVECONFLICT (INSERTROWEXISTS, (DEFAULT, USEMAX (OGGTS))), &
 RESOLVECONFLICT (UPDATEROWEXISTS, (DEFAULT, USEMAX (OGGTS))), &
 RESOLVECONFLICT (UPDATEROWMISSING,(DEFAULT, OVERWRITE)), &
 RESOLVECONFLICT (DELETEROWEXISTS, (DEFAULT, OVERWRITE)), &
 RESOLVECONFLICT (DELETEROWMISSING,(DEFAULT, DISCARD));


4) Create the GoldenGate processes on both sides (using GGSCI)

# ./ggsci
GGSCI> start mgr   (manager must be started and running)

Extract:

DBLOGIN USERIDALIAS ogguser
-- DBLOGIN USERID ogg PASSWORD oggpwd
register extract extest database
ADD EXTRACT extest, INTEGRATED TRANLOG, BEGIN NOW
ADD EXTTRAIL ./dirdat/e1, EXTRACT extest, MEGABYTES 1000

Pump (not need to login to source db):

ADD EXTRACT pmtest, EXTTRAILSOURCE ./dirdat/e1
ADD RMTTRAIL ./dirdat/r1, EXTRACT pmtest, MEGABYTES 1000

Replicat:

--ADD REPLICAT rptest, EXTTRAIL ./dirdat/r1, CHECKPOINTTABLE OGG.CHECKPOINTS
ADD REPLICAT rptest, INTEGRATED, EXTTRAIL ./dirdat/r1

GoldenGate configuration for source component:

Extract:

DBLOGIN USERIDALIAS ogguser
-- DBLOGIN USERID ogg PASSWORD oggpwd
register extract extest database
ADD EXTRACT extest, INTEGRATED TRANLOG, BEGIN NOW
ADD EXTTRAIL ./dirdat/e1, EXTRACT extest, MEGABYTES 1000

Pump (not need to login to source db):

ADD EXTRACT pmtest, EXTTRAILSOURCE ./dirdat/e1
ADD RMTTRAIL ./dirdat/r1, EXTRACT pmtest, MEGABYTES 1000


GoldenGate configuration for target component:

Replicat:

--ADD REPLICAT rptest, EXTTRAIL ./dirdat/r1, CHECKPOINTTABLE OGG.CHECKPOINTS
ADD REPLICAT rptest, INTEGRATED, EXTTRAIL ./dirdat/r1


We now have our schemas created and in sync between the two systems (same number of objects in the schema and each table with same number of rows and column values - critical to have an insync starting point), have meet all the logical replication requirements on both sides (so each can be a source and target), and we have created our Oracle GoldenGate processes and configuration files; now we are ready to start the processes and validate they are running.  Once that is done we will do some test DML and DDL on each side (one at time at this point) to validate replication is working in both directions.

5) On each system, start the Oracle GoldenGate processes (on each side)
    -  recall, no transactions are occuring against our test schema yet

GGSCI> start mgr
GGSCI> start er *
GGSCI> info all

Once started, verify using "info all" that are processes are running and nothing is in an "ABEND" state.  If any process has ABENDED, we need to check the ggserr.log file and possible the report file for the process and fix the issue.

Our Oracle GoldenGate processes are now running on both sides. Each system is running one extract and one pump to capture and send the CDC data to the other system. In addition, each system is running one replicat to apply transactions coming from the other system.  Each system will have the following processes running (Manager must always be running):

manager (mgr)
extract (extest)
pump (pmtest)
replicat (rptest)

At this point, we can test replication, and we will do so from each side. Pick a particular side and do a few inserts into a table in the GGTEST schema. Then verify the inserts (rows) exists on the other system. Now, update and delete some rows from the first system and validate those changes have been to the second system.

Now, create a new table in the GGTEST schema and verify it replicated.  You can do this via CREATE TABLE or via various CTAS commands, both will work.

Repeat all the above transactions on the second system and validate all changes are replicated to the first system.

If all goes well, we have bi-directional up and running for this schema.

Now to cover somethings we have not covered yet in more detail depth that are worth mentioning, and may be needed for replication and CDR to work effectively:
  1. Oracle GoldenGate 12c will suppress triggers from firing on the target side by default (remember our need to prevent double firing of triggers)
    - no need to disable or modify triggers with newer versions of GoldenGate and RDBMS             versions
  2. Oracle sequences were changed to an odd/event strategy, we cannot replicate sequences in a bi-directional configuration. However, Oracle GoldenGate can replicate the sequences changes in other topologies, especially helpful for migrations when using the uni-direction topology
  3. Oracle GoldenGate will defer cascading constraint checks until the end of the transaction, as cascading DML will be replicated in most cases (unless going to another schema not in the config)
  4. The OGGTS column in each table, which we are using to detect and resolve conflicts must be updated on each insert (default value), update transactions for each table, or we must force the column value to be logged along with the data for every transaction. This can be accomplished via one of the following:
      - add a default value on the OGGTS column as a timestamp(6) data type
      - modify the application to update the OGGTS column with every UPDATE
      - force the column value to be logged along with the other transaction data by creating a               custom supplemental logging group forcing the database to log the OGGTS column value in       the redo
Now that replication has been verified in both directions (both DML and DDL are captured and applied in both directions), we can now set up/create a few scenarios to test our CDR parameters/configuration (Conflict Detection and resolution) to see if works and validate resolution works as expected. It is highly recommend to test using a test schema as you can create the conflict scenarios much easier than you can using your application.  Remember, we are using time-based conflict detection and time-based resolution.

CDR is large topic (with different types of conflict detection and resolution parameters) so we will not cover it in detail, but just know that when row is updated on the source, that row change is captured and propogated to the target. On the target, with time-based CDR configured, Oracle GoldenGate will pull the existing value of the row as it exists on the target and compare that value to the old value that was on the target before the update (before value it is called). If they match, there is no conflict, if they do not match, a conflict has been detected and our CDR rules/processing kicks in and Oracle GoldenGate tries to resolve the conflict automatically according to our CDR setup.

There are basically 3 types of conflicts which are common in logical replication:
1) Insert fails to do a PK/UI conflict (ORA-1)
2) Update fails due to a "no data found" error (ORA-1403)
3) Delete fails due to  a "no data found" error (ORA-1403)

In bi-directional replication, we still have to address these basic conflict types, but with CDR, we have to deal with a new scenario:
     an update fails due to CDR columns not matching

In this scenario, the update that is captured on the source contains two sets of values. These values are called the BEFORE and AFTER images. The before values are the values of all changed columns before the update on the source, the AFTER values are the new column values on the source.
For CDR to work in our case, we have parameters specified so that both the BEFORE and AFTER values are transferred to the target. When the replicat goes to the apply the update, it checks the values of the CDR columns in the target before the update and compares with the BEFORE values in the transaction, if they match, the update is applied (no conflict). If they do not match, a conflict has occurred and CDR will process as we have it configured. In our case, the update with later time will win.

In order to see if CDR is kicking in or not, we have to look at the process report files being generated by the replicat process. It will have CDR counters as well as other counters per table being replicated, so there is not shortage of information. The only issue is that report files are text files and hard to parse to get a big picture view, and report files can be overwritten as time goes on depending on the activity and configuration settings being used (you might lose some time periods of reporting).

There are other ways to load these statistics into a database table for real time updating, and real-time viewing/analyzing. This is done via exception mappings along with CDR mappings to report this information to a database table for easy querying and analyzing. Exception reporting is a larger topic and has a blog for it as well. I will update this blog with a part 2 and 3 to cover using these additional mapping features.  Using additional features in the mappings starts to make for very long mappings, which also become harder to maintain if you have lots of mappings. This is where the other GoldenGate features using variables, macros and include files come in handy.  These features tend to make the configuration a bit more complex, but once understood, they can save you many hours of configuration change typing and avoid syntax errors.

Now, how to create conflicts with our test schema. Hopefully you noticed, but in the extract config for each system, we are excluding a user: OGG. This is the extract user and the apply user for both systems.  To create a situation where a conflict is detected and resolved by CDR, we will log onto on one system (system B) as "OGG" and update the timestamp (OGGTS) to a newer value for a few rows in a table. Since "OGG" is not captured by extract (due to our EXCLUDEUSER), these updates to the few rows will not be captured and sent to the other system  (system A) as part of replication (we can also tag redo as well, but lets keep it simple for now). At this point we have the same number of rows in the table, but the OGGTS value is different for a few rows on system B.
Now, get on the other system (system A) as the test user "GGTEST" and update the time stamp for all records in the table to a current value.  The updates for the rows will be captured on system A and sent to system B to be applied. While system B is applying the updates, it notices that the BEFORE value for OGGTS from system A do not match the current values for OGGTS in system B for a few records (as it is processing them).  This is the detection of a "conflict" piece, and as a result, our CDR parameters (rules) will kick in to try to resolve according to our parameters.  In our case, we use a simple time-based check and resolution method.

Below is a recap of our parameters (mappings) for the apply and the check and resolution that will be used are highlighted in "red":

MAP GGTEST.*, TARGET GGTEST.*, COMPARECOLS(
   ON UPDATE KEYINCLUDING (OGGTS),  
   ON DELETE KEYINCLUDING (OGGTS)),
 RESOLVECONFLICT (INSERTROWEXISTS, (DEFAULT, USEMAX (OGGTS))),
 RESOLVECONFLICT (UPDATEROWEXISTS, (DEFAULT, USEMAX (OGGTS))),
 RESOLVECONFLICT (UPDATEROWMISSING,(DEFAULT, OVERWRITE)),
 RESOLVECONFLICT (DELETEROWEXISTS, (DEFAULT, OVERWRITE)),
 RESOLVECONFLICT (DELETEROWMISSING,(DEFAULT, DISCARD));

A brief recap of what each parameter is:

"ON UPDATE KEYINCLUDING (OGGTS),"   
  --> on updates, compare OGGTS values

"RESOLVECONFLICT (UPDATEROWEXISTS, (DEFAULT, USEMAX (OGGTS))),"
   If the row to be updated exists in the target, and if the BEFORE value for OGGTS (from system A) does not match the existing value in the target for OGGTS (system B), then a conflict has been detected, resolve by using the change (or no change) with a greater OGGTS value.
In our case, the timestamp from system A is greater than the existing value in system B, so the update will be applied to system B and the CDR counters will be updated to reflect a conflict and resolution.

If our CDR parameters (rules) had the following resolution, the update would not have been applied to the target and Oracle GOldenGate would update the counters for CDR and move on to the next operation or transaction:

"RESOLVECONFLICT (UPDATEROWEXISTS, (DEFAULT, USEMIN (OGGTS))),"

To continue testing, we can log on and delete records from one side, and then try to update and/or delete those records from the other side, we can also delete records from one insert, insert the same records back in the table and create duplicate row conflicts on the other side.

This is a very easy way to test your CDR rules and results before rolling to production and finding out that they did not work at all or as expected when it really matters. All you have to do is be careful what userid you use to make data changes that re not replicated back to the other system so that you can create various scenarios of data divergence.


Closing!

A word of caution, as logical replication is not a perfect science yet, it is also extremely situational.
Regardless, you will have to meet the requirements for logical replication. However,the replication implementation may require tweaking based on the actual data and transactions occurring in the environment. This comes to light many times in implementations we encounter with customer. The implementation is setup in a lower environment and tested and tweaked until working well with a very small set of transaction test cases and/or volume. Then, this same configuration is rolled out to the production environment and major (or minor) issues are encountered immediately or soon after go-live. This is a case of the sensitivity of Oracle GoldenGate to the actual environment it is running in, transaction types, data used and transactional volume.

Also, please get help if you can, you will be much more successful with assistance that is experienced and has context around what to avoid and do to minimize issues.

Good luck and I hope you push your investment in Oracle GoldenGate to the max to achieve its tremendous return on investment.










April 8, 2016

Migrating SAP Using Oracle GoldenGate


SAP: Making The Impossible Project...Possible!

Recently I got involved in a large and very critical project to lay out and implement both a platform migration and Oracle RDBMS upgrades for a large SAP implementation.  The project goal was to use Oracle GoldenGate (CDC/logical replication) to migrate all the SAP databases to new hardware platform (complete hardware stack refresh) and upgrade the RDBMS versions and implement newer RDBMS features all in the same project.

Sounds complex, and trust me, it can be managed to mitigate, but this is exactly what Oracle GoldenGate can help you achieve. However, when coupled with a few other essential Oracle products/features (Flashback, DataGuard and RAT), the project risks can be managed and mitigated by thorough testing and validation before any cutover cutover.  Actually, it can tested and validated until all the application owners/teams are absolutely comfortable and ready for the cutover, more so than the DBAs (which were already anxious with some hesitation).

As with most critical enterprise applications, the SAP eco system was absolutely business critical across the enterprise and therefore the project had the following requirements to mitigate project risks and reduce downtime for the cutover event(s):

  1)  a significantly reduced downtime window/duration for the cutover to the new environment
       - less than 24 hours for the cutovers (all in one 24 hour period, or multiple cut over events)

  2) ability to migrate the databases independently to the new environment (split migrations/cutovers)
       - split the database migrations over multiple periods, each several weeks apart
       - validate application functionality with a split environment (some still legacy and some new)

  3) "fall back" or "fail back" replication from the new system back to the legacy environment
       -  any database could fail back independently if needed

  4) performance validation of the complete stack for each system before cutover
      - avoid WAR room as much as possible: bad sql, communication issues, general DB performance

  5) backup/recovery and DR setup and validated before production cutover
      - all aspects of backup/recovery and DR tested and validated before final cutover

Fallback ("fail back") replication is intended mainly to:
  1) reduce migration risks of any data loss in case of a fail back to the legacy environment
       - all transactions in the new environment are captured and replicated back to the legacy system
  2) reduce the fail back window in case of fall back to the legacy system production environment
       - fail back replication for up to 30 days

One last fall back requirement was that any of the
databases of the SAP ecosystem could fall back on its own, independently of the others
(cross environment and platform compatibility worked and tested).


The migration of the entire SAP ecosystem consisted of migrating to a new hardware platform/OS (Endian conversion needed) and a new/upgraded Oracle RDBMS version, with the goal
of migrating to a newly sized/designed and supported hardware/OS platform and RDBMS
version; while also utilizing many of the new features available in the new RDBMS/RAC
version for both database performance,  management, HA and DR.

I am happy to report (very happy and relieved that we did all the extra work/diligence to make it
a huge success) that we have just completed the production migration to the new environment
and the entire project was a complete and overwhelming success.  The post cut over WAR room
on Monday lasted about 5-6 hours to address less than a half dozen issues that popped up, all of
which required tweaks to the SAP application configuration as a result of moving to Oracle
RAC from a big iron and single instance configuration in legacy.  A few other errors popped
up that already existed in the legacy environment and the app team decided to fix at that time.

The cut over itself took 6 hours of technical time (detailed below), which was more than we
wanted for the technical pieces, but the BASIS / other app teams needed lots of time to reload
the configuration and do their validations for the new app servers once the databases were
officially cut over to the new Oracle RAC environment.  The database cut over for the Oracle GoldenGate replicated databases (6 of them totaling 70TB of data) was 3.0 hours for the
databases. For the database cut over, 95% of the cut over time was spent running business validation/compare scripts and table row count scripts against old and new databases. The
remaining time in the cut over was for the application reconfiguration and validation steps.
The application reconfiguration and technical validation was another 4 hours, after which both functional/interface validations and business validations were performed for a total of 13 hours
of down time. During the functional and business validations the SAP applications were live
and running and any interfaces could have connected.

The real value of using Oracle GoldenGate for the database migration is that all work can be completed and validated ahead of time and with no downtime of the existing production systems,
even DR was setup and validated before final cut over.  We used DR for RAT testing before cut
over which is highly recommended.

Major work stream durations for the production cut over:

3 hours for database cut over and data validation for all databases
   - databases were cutover in 10-15 minutes (thanks to Oracle GoldenGate)
   - remaining time was for static row counts and business queries used
     to validate data for a static period
4 hours of application re-configuration and validation
3 hours of functional/interface validation for critical processes
2 hours of business validation of critical processes
-----------------------------------------------------------------------------
12 hours (SAP applications open and live in the new eco system)

From a high level, the migration process consisted of the following (of course done in lower environments first to work out the processes, tasks and ordering of tasks, and then a few dry
runs before doing the production migration to fine tune the steps and timings:

1) Sizing, design and acquisition of the new hardware (servers, storage and networking)

2) Building of the new database serrvers (RAC cluster) and app server cloud environments
     - all Oracle RAC systems and new app servers on VMWare

3) Building the new databases (shell= DBs) on the new environment
    - patching to the latest SAP patches (which include Oracle RDBMS patches)
    - install Oracle GoldenGate on source and ensure it is running

4) Installation and configuration of Oracle GoldenGate
     - on both source and target systems for CDC (for both DDL and DML)
     - GoldenGate configured with EMS mappings as well as application mappings

5) Online data migration, in phases and with no downtime of legacy databases
     (platform conversion needed - moving from Solaris to Linux)
      - Oracle DataPump of selected sets of tables with same flashback SCN for consistency
        - legacy DR could not be used (across the WAN and was 8-10 times slower - very old)
        - no storage cloning capabilities and no spare servers to restore production to
          in the same data center location
        - data migration was completed with no downtime for the existing production systems
          and was done in pieces (chunks) as to not impact existing production

6) Replicating the source transactions (CDC data) from legacy to the newly staged databases
     - using Oracle GoldenGate (real time CDC) to keep them synchronized with legacy production
       for a significantly reduced migration window
      - using GoldenGate and its EMS capabilities enhanced our ability to use
        Real Application Testing (RAT), highly recommended

7) Validating both data and metadata for all databases as a result of of the platform migration
     and the use of logical replication for each database
     - validation of all database objects in the new database
        -  users,profiles,views, synonyms, dblinks, directories, tables, indexes,sequences, etc
          - all applicable to SAP SR3 schema and data model
     - row counts (static row counts for all application tables both before cut over and at cut over)
     - data validation using Oracle Veridata leading up to cut over day (bit for bit comparison)
     - scripts to validate business data in each SAP module database
       (provided by business analysts and application teams)

8) Re-configuration of the applications for the new app servers and new database servers

9) Technical validation of the new environment for all applications

10) Performance validation of the new environment (RAT captures on legacy production)
     - legacy production captures (multiple times periods and duration of RAT captures)
       - used GoldenGate's EMS capabilities to insert markers to stop replication to the target
         at specific time periods corresponding to the different RAT capture periods
     - database performance testing and validation using Oracle Real Application Test (RAT)
        replays
        - Oracle GoldenGate, Data Guard (snapshot standby) and Oracle Real Application Testing
           are invaluable for these tasks (make it much easier to do and can be done in parallel)
     - load stress testing  and validation using Oracle Real Application Test on both the new
       production and DR systems (invaluable again)
    - validate load balancing
    - SQL tuning, patching and tweaks for performance at OS, network, RDBMS, etc

11) Functional testing of applications, interfaces, external loads, etc
       - validate major application components are functioning and all integration end points
          are functioning across the new Linux platform for all databases

12) HA/DR testing and validation of the new environment (end to end validation)
      - DataGuard setup for each database, and both role switches and forced fail overs tested
         - backup/restores tested using fail overs and restoring lost primary databases
      - application app servers failed over and tested against DR environment
      - cluster/database HA tests of the new environment, many different HA tests completed
        for the entire stack

13) Fall back or "fail back" logical replication implemented and tested
      (from new production back to legacy production)
      - tested only in Pre-prod (this is the one piece that could only be tested with pre-prod legacy)

14) Fall back of independent components (database and app servers in the SAP ecosystem)
       - validate major application components are functioning with cross platform components
       - different pieces of the eco system failed back to legacy with other remaining future state

15) Business validation
      - ensuring application works and critical processes all work

16) Return databases back to replicated state (Oracle Flashback Database) in staged production
      to resume Oracle GoldenGate catch up CDC replication
      - sync CDC from legacy production to staged production to keep synchronized
      - during testing periods only (at cut over this was the go live starting point)
         - OGG has keep all the datbases in sync with legacy production, waiting for final cut over

17) Everything validated and verified before cut over, wait for cut over time!
      - repeat this process as many times as possible to ensure a near flawless migration
        (with a large SAP eco system with many integration/communication points, some issues will                come up....remember, not for the faint of heart...stay cool and more importantly,
         have the right team in place)

That covers the high level process we went through for the SAP migration to the new environment
and new Oracle database environment. Definitely, testing and dry runs in lower environments are invaluable for your success. All tasks get documented and timings established before you talk to
the business about an appropriate outage and the over risks.

This was an awesome project. I must warn, replicating SAP with any type of logical replication is probably not for the faint of heart. You will need to tweak the configuration to get apply
performance and issues with come up that must be dealt with (especially with the SAP BW
system).

Overall, using Oracle RDBMS/RAC, Oracle GoldenGate, Oracle DataGuard and Oracle Real Application Testing (RAT) made this project a huge success with a significantly reduced total
effort and resources required (RAT is a life safer to be blunt). These products made an
"impossible project possible" within the time frame/budget left in the overall transformation project with some 30+ GoldenGate migrations a huge success.

By using Oracle GoldenGate in the project, we were able to consolidate 3-4 different projects into one (moving to a new hardware platform, new Oracle RDBMS version (upgraded to a supported version), new Grid version, Oracle RAC/ASM, enhancements to the physical and logical database design (new storage, partitioning, BIGFile table spaces, Secure Files, some different indexing strategies for specific tables due to RAC, etc).

This new consolidated project was now more complex and a bit longer than any of the 3-4 individual projects by themselves and a bit riskier due to the added scope and complexity of all the moving parts and using logical replication for the data migration. However, do not let this discourage you, as these risks (all of them) can be managed and mitigated. In addition, with the use of the Oracle products mentioned, you can help reduce the risks and effort involved.  Also, by consolidating the projects into one, we reduced the amount of testing time significantly as compared to 3 or 4 projects; basically we were able to test once for all pieces.

For now, thank you Oracle, for the Oracle GoldenGate product (this is an awesome product and a market leading product), and thank you SAP for allowing the use of Oracle GoldenGate to migrate SAP to a new environment (of course with a certified SAP migration specialist on hand, as well as Oracle GoldenGate expert).

Using Oracle GoldenGate to migrate SAP to a new environment can literally make the "impossible project possible".  Many of the risks mitigated would otherwise still be present and would be nearly impossible to mitigate and would require a large and lengthy WAR room at post cut over. And, GoldenGate allowed for a significantly reduced outage window for the database migration pieces, typically from 16-48 hours to 1-2 hours, which is significant to the business and critical processes supported by the SAP implementation and the potential cost of downtime to any or all parts of the SAP eco system.

Replicating the SAP data model using logical replication is both easy (no foreign keys or triggers and only one sequence, and most tables have a PK or UI on them), and also a bit complicated at the same time due to the fact that SAP uses a single schema for all application data (SAPSR3).  In a nut shell, all the data is in one schema for each database. When using Oracle GoldenGate, having multiple schemas makes instantiating and establishing apply at different time periods a bit easier because you can do it in pieces (by schema in most cases) rather than customizing the mappings for each table or piece with mappings to evaluate CSN (it can be done, but we needed a simple config) . To instantiate an SAP database from a production and without any downtime and little impact on it makes for some challenges when all the data is one schema in each database.  In this case, we used a static table listing to pull over all the tables in pieces (groups by SCN), setup apply for each piece, then added an apply that used a wildcard to catch any new tables and other things such as OTHER DDL category.  If using 11.2.0.4+, definitely use the integrated apply if at all possible and patch SAP to the highest you can because the RDBMS patches are included and integrated apply requires RDBMS patches in addition to using Oracle GoldenGate 12c. However, please please, be aware, that if doing fall back replication to an older Oracle version, integrated apply likely will not work for you and you will need a different apply configuration for the fall back replication (if using it).  In our case, integrated apply worked really well, but we needed classic replicats (coordinated) for fallback replication. Therefore, instead of having two apply configurations and having to tune each strategy, we stuck with the classic replicat approach to allow use to easily and quickly reverse the replication with no additional configuration needed for the apply on the older 11.2.0.2 databases.  We used integrated capture where we could.

I will continue to update this post to also include some additional details regarding how others do it (Oracle ACS and their 02O), or triple "O" service) and how we deviated from it slightly in the initial load portion. We did not use GoldenGate initial load for the initial data migration and instantiation and used other Oracle approaches for the initial data migration (scripted framework for using DataPump).  The initial load capabilities in Oracle GoldenGate are definitely worth using and work really well and performance is acceptable in most cases.

More to come on this in late May or June.....trying to relax a bit now and take advantage of some free time.

Further reading...white paper from Oracle on SAP Migration:

http://www.oracle.com/technetwork/articles/systems-hardware-architecture/sap-migration-163923.pdf


Take care, thank you for reading my two cents worth and good luck in your migration projects.

To end this blog, I want to throw out a leadership quote by Red Adair: "If you think it's expensive to hire a professional to do the job, wait until you hire an amateur."  Read more at: http://www.brainyquote.com/quotes/quotes/r/redadair195665.html