June 27, 2016

Oracle GoldenGate EventMarker System (EMS) to help facilitate Oracle Real Application Testing (RAT)

In this blog I want to share some neat tricks available to help facilitate Oracle RAT testing on the new target system or new target DR system when using Oracle GoldenGate as the migration replication tool.  We don't need to cover in detail why Oracle GoldenGate is a tool that every IT shop should have in its IT tool box, as I have already beat that topic to death...it is simple, Oracle GoldenGate is a tool to help make many impossible migration projects possible. This tool and its capabilities have helped tens of thousands of migrations possible and successful.

Now, onto the topic of interest. When doing Oracle-to-Oracle migrations, we all know how important it is to validate the new target platform, and validation should be done before cutover, not post cutover.

If using Oracle GoldenGate as the migration replication tool (which you should be!), you have access to some less common/unknown but helpful features of Oracle GoldenGate that can help with the data synchronization requirement for Oracle RAT (Real Application Testing) capture and replay on the new target state environment.  RAT is best run when the data is in sync between source and target at time of capture. We must remember, RAT captures the sources transactions with the data/bind used for replay. TO avoid high replay error percentage (or complete replay failover), the data should be in sync at time of replay (in sync as of the time of the capture start).

With the full tool set from Oracle: Oracle RDBMS, Oracle GoldenGate, Oracle RAT and Oracle Data Guard (not required), you can satisfy the data in sync issue and can validate the new target state platform and tune and tweak as much as needed before cutover (as many times as you like to proof performance and scalability). The end result.....an extremely successful cutover for your customers and application owners (and usually a very short, if any at all, war room presence), and hopefully IT will be looked at a as a hero instead of a villian.

In Oracle GoldenGate is a little known feature called EventMarker System (EMS). In Oracle GoldenGate, this is nothing more than a set of parameters to existing mappings or the mapping configuration that can be used to interact with replication due to events and respond to those events.
Lots and lots of uses for EMS, but lets stick with using it to control our replication to the target to allow us to stop replication at the right time for the replay to occur with data sync at time of capture.

Usually the triggering event for EMS is some type of DML transaction that occurred in the source and that matches a criteria we specify. The criteria is quite flexible with EMS (you code it using parameters and options), but for simplicity, lets say our criteria is a DML transaction against a table in the source that has certain column values in the transaction. We will look for this type of transaction on the table, and evaluate the data in the transaction and respond or not respond when column values match our criteria.

In our case, when the RAT capture on the source starts, we also want to insert a record to the control table that will be extracted and pumped over to the target system. The target system will be applying data to keep the target in sync and once it sees this EMS transaction, it will respond by stopping all apply processes (replicats). This will ensure that our target is at the same point in time data-wise as the source system so our replay will have very few errors due to out of sync data on the target.

We will also use Flashback features of Oracle RDBMS on the primary target system or the target's DR environment which is already running and being synced with Data Guard. We can choose to run the replay on the primary or snapshot standby depending on our needs and lag threshold for the target itself. IF data must be keep in sync regularly due to high volume, we can run the replay on the target system DR standby.


The entire process from a high level looks like this:

1) Build the new target stat environment
    - with a shell DB on the new target platform/version as you like
    - configure the shell DB as you like (physical and storage layout for tablespaces
2) Implement the requirements for Oracle GoldenGate capture on the legacy source
    - database level supplemental logging
    - userid with privs for Oracle GoldenGate
   - correct init.ora parameters needed for replication
3) Install Oracle GoldenGate on the source system, configure and start extraction
    - create subdirs
    - create configuration files
    - schematrandata (or trandata for all schemas/tables in replication scope)
4) Start up capture and validate extraction is:
    -  running and stable
    -  no long running transactions existed before extract started and are still running (they will be
       skipped)
5) Install Oracle GoldenGate on the target system and configure for target apply
     - start the manager process for communications
6) Create the Pump on the source system and start pumping the CDC data to the target system
7) Start a consistent export on the source system (using flashback_scn)
    - for all schemas/tables in replication scope (this step may be in phases depending on issues)
    - also, a full metadata export may be needed
8) Import the consistent exported data into the target system
9) Create replicats to start applying the data and starting after the SCN (CSN in GoldenGate)
     - special command to start the replicats from a particular CSN
10) Apply any RAT patches needed to the source or target database platform/versions
11) Create the table needed for EventMarker system (a simple control table in the source / target)
      - and add table-level supplemental logging (not required since it is not a critical table,
        but a good practice)
12) Implement the Oracle GoldenGate EventMarker system mappings
      -  in the source extract, pump(s), replicat(s)
13) Restart all the Oracle GoldenGate processes to reload the mappings/configuration now
      including EMS parameter/mappings
14) Let replication caught up on the target (if you desire to validate replication is working)
      -  not required at all, but a good idea at first to ensure all replication is working)
15) Test EMS a time or two just to be sure the mappings work as configured
     - mappings can stop extract, pump or replicats or all
     - suggest only stopping replicats, or pumps but leave extracts running at all times
15) Start RAT capture and insert the control table DML on the source
      -  a simple insert with 4-5 column values as part of it
          - the insert uses key words for certain columns to instuct replication to stop
          - the column values will match criteria we setup in the mappings for evaluation and matching
      - can do the insert before the RAT command or after
16) While RAT capture is running, verify the replicats have all seen the DML and acted upon it by           stopping

17) Once RAT capture is complete, do the remaining steps needed to process the capture
18)  Now we are ready for replay or should be:
       - replicats should all be stopped at the same point in time due to our EMS and control DML
19) Decide where to run the replay, on the primary or standby system
      - for primary, stop all OGG processes, create a GRP on the target and replay
      - for standby, stop DG shipping and apply, stop DG, create GRP and open as snapshot
        - perorm replay on standby

Note: Flashback DB is our friend here. We can flashback and replay the capture as many times as needed to tune/tweak the target system to our liking

20) Flashback DB to same state as before replay (where OGG was at in apply state)
21) Restart Oracle GoldenGate processes on the target to catch up
22) Repeat as necessary to make your migration a complete success.


Sample EMS mappings and control table:

....still updating....have to run off to do a project (coming soon, 7-10-2016)

Good luck and let me know if you have any questions or feedback.

Gregory Thomas



June 21, 2016

Oracle GoldenGate Active-Active for Oracle-to-Oracle (Part 1)

Since the acquisition of GoldenGate by Oracle in 2009, Oracle has replaced the Streams product with the GoldenGate product as the highest tech stack/configuration on the Oracle MAA chart. Using GoldenGate for active-active allows for the highest level for a system/application. It is strongly suggested that you first work through the Oracle MAA tiers (starting at the bottom) with your tech stack to ensure that lower levels of MAA cannot, or will not, completely meet your HA/DR requirements. Moving to an active-active architecture with Oracle GoldenGate is certainly possible (this product makes the impossible possible in many cases), however, moving to an active-active solution is best when viewed as a strategic initiative (you are driving/working towards the final solution), and not simply a quick investment and turn on and your active-active solution is ready to go.

Active-Active HA/DR solutions are certainly more complex and can lead to higher administration efforts than lower level MAA tiers solutions, such as those with a RAC and Active Data Guard on both sides.

With that said, Oracle GoldenGate is a truly amazing and can't live without it product for every IT shop. In my humble opinion, every IT shop should have the GoldenGate product as pat of its IT tool belt; it uses range from data integration of any kind, to migrations/upgrades/hardware refresh and finally for the ultimate in HA/DR for systems/applications.

In this blog, we will cover a simple setup for implementing an active-active solution using GoldenGate for replicating a test schema bi-directionaly and using GoldenGate's built in CDR functionality for time-based conflict detection and resolution. CDR is formally known as "Conflict Detection and Resolution", and is a pre-defined set of conflict detection and resolutions capabilities built into the Oracle GoldenGate product itself. There are several options to CDR, however, time-based detection and resolution is by far the most common and the one we will use since it is simple to understand (which record is older and which is newer is simple using a timestamp).

To get started, we must define active-active, as this term can be widely interpreted in the wild, and, in Oracle GoldenGate fashion, we must define the systems as a topology, as there are limitations and non-supported objects/functionality with various topologies. These limitations are not specific to Oracle GoldenGate, as these are generally limitations to all products that provide logical replication. To be transparent, logical replication is not a complete full-proof science yet, therefore, there are limitations and areas of concern (or redesign) that must be addressed to determine if the solution will work and if the solution is manageable (cost effective).

Active-Active can mean different things to different people (gotta love us humans). Therefore, we need to define what active-active means to us in terms of a GoldenGate topology and the solution we have in mind. The classification (what topology we call it) is determined by the number of systems involved, and their roles in the replication setup:

bi-directional          --> two systems only, both act as a source and target, both allow transactions
n-tier (peer-to-peer) --> three or more systems, all acting as sources and targets, all allow transactions

In this blog, to keep it shorter, we will demonstrate active-active bi-directional;  this topology offers the most functionality because we can capture and replicate both DML and DDL to both systems and in both directions. We will also focus on only Oracle-to-Oracle replication and using modern versions of Oracle (at least 11.2.0.3+ or higher on either side).  One of Oracle GoldenGate's strongest features is that it is heterogeneous, and it always has been, and it keeps getting better and better thanks to Oracle and their great people. This means it can not only replicate data from different relational platforms (SQLServer to Oracle, DB2 to Oracle, MySQL to Oracle, and many others, and vice versa, etc), but also between different Oracle versions.  For example, it is possible (and common, we do this all the time), is to replicate Oracle 9i/10g to a new hardware platform and Oracle version, such as 11.2.0.4 or 12c (single instance or PDB).  This is type of migration work is going on all the time around the world and GoldenGate is the market leader in this space for these projects (see how GoldenGate is the "swiss army knife" in your IT tool box).

Now on to the requirements for an active-active bi-directional setup. There are requirements for logical replication in general, and data model issues we must address/deal with to ensure a successful bi-directional implementation. And we must keep in mind, that each side (both systems) will act as both a source and target in the solution. Therefore, we must address the requirement for both a source system and target system in the solution setup. In this blog we will use modern versions of Oracle RDBMS, let me state, it is possible in most cases to do bi-directional using older Oracle version for one side or both, but these solutions usually require much more configuration and tweaking for bi-directional to work (and more limitations as well, which does not mean it will not work, just more workarounds or detours needed).

Source requirements for logical replication (we will use the requirements for integrated capture):

1) Source Database (capture requirements)
    a) RDBMS version supported by GoldenGate Integrated capture
    b) Database is in archive log mode (LogMiner requires it)
    c) Databaase level suppplemental logging has been enabled (minimum)
    d) Init.ora parameters for GoldenGate set
         - goldengate_replication=true
         - streams_pool_size = 1.6G
    e) One-off patches for GoldenGate applied to database for integrated processes
        - PSU level specific one-off for Integrated processes in the databases
    f) Oracle sequences addressed for uniqueness between the sites
        - common to use odd/even strategy (odd for one site, even for the other)
        - there are other configurations as well that support n-tier
    g) Oracle user created as the extract user and privileges for extract granted
        - privileges are specific to the Oracle RDBMS and GoldenGate version (check the install guide)
    h) Table level supplemental logging on all schemas/tables involved in replication
        - SCHEMATRANDATA (or TRANDATA)
        - SCHEMATRANDATA is recommended over TRANDATA with modern Oracle RDBMS
        - once the schemas/objects exists

2) Target Database requirements (apply requirements)
    a) Init.ora parameters for GoldenGate set
         - goldengate_replication=true
         - streams_pool_size = 1.6G
    b) Oracle sequences addressed for uniqueness between the sites
        - common to use odd/even strategy (odd for one site, even for the other)
        - there are other configurations as well that support n-tier
    c) One-off patches for GoldenGate applied to database for integrated processes
        - PSU level specific one-off for Integrated processes in the databases
    d) Oracle user created as the extract user and privileges for extract granted
        - privileges are specific to the Oracle RDBMS and GoldenGate version (check the install guide)

The above are hard requirements, as mentioned, logical replication is not a perfect science yet (getting closer), as such, there will likely be other changes/tweaks needed based on the data and data model in use that will need to be addressed in the setup.

For active-active (or any logical replication), there are requirements to meet for a source to capture and target to apply, and then there are areas of concern to address as a result of logical replication (SQL apply). When setting up active-active, we want to address and prevent as many replication issues as we can with our configuration/setup (this includes any changes tot eh data model), most of our time will be spent configuring (changing) to avoid issues completely as well as addressing setup/configuration to handle the issues automatically (such as CDR parameters).

Logical replication issues that must be addressed for performance or data integrity:

1) All tables needed a PK or UI constraint
    - for row uniqueness
    - for apply performance (WHERE clause used on apply)
2) Triggers cannot double fire on the target
    - the resulting data changes from triggers firing was most likely replicated  as well
3) Cascading constraint operations are replicated most likely (in most cases)
4) Error handling (conflict detection) and reporting
    - how to handle apply errors
    - how to report and monitor apply errors
    - each time CDR is invoked, this is a performance hit to the apply
       - works well, but we need and should want to know when, where and how frequent is it

Now lets get into the GoldenGate configuration pieces for both systems, remember, both systems will function as a source and target.

GoldenGate is a very extensible/configurable product, we will not go into the details here, we will simply look at a simple example and we will not cover all the potential ways of doing things in GoldenGate (such as initial loads or data movement for instantiation).  The product can do, or be configured to do almost (I said almost), anything you need.
Once the initial configuration is shown, we will look at using other product features to streamline a configuration. Issues such as hundreds or thousands of schemas or tables, different date columns used for different schemas tables. We will look at how macros/variables in GoldenGate can help save you a tremendous amount of configuration time when needing to make changes to hundreds or thousands of mappings in an implementation. There is also a blog post on these topics as well.

In this example, we have created the test schema on both sides before replication (extract/pump/replicat) was setup, just to keep it simple. There are many ways to do this online and in-place with an existing replication implementation that is up and running (again, the powerful capabilities of GoldenGate). Our schema also included a common TIMESTAMP column in each table with the same name to make the mappings easier and the ability to use a wildcard mapping with the CDR parameters. If the date columns to be used have different names, then you will need separate mappings for each, or use macros and pass the column name as a variable (more on that much later in the blog), or see the blog post on OGG include files, macros and variables.

Our test schema is called "GGTEST" and our GoldenGate user in both databases is "OGG".
Our GoldenGate user is used for both capture and apply in this simple in both databases.

To keep it simple, we will not use or show many fancy features or capabilities that are possible, such as Eventmarker, heartbeat, etc.; we will also use the same process names for simplicity.

Create the GGTEST schema and objects on both sides. We used a script that creates the user and objects and loads a small amount of seed data in the schema. After run, the schema and data on both sides are identical.  Since they are identical on both sides, and no transactions are occurring, we can setup the replication from this point forward which is simple to do.

Now the GoldenGate steps:

1) Create the GGTEST schema on both systems

 SQL> @create-GGTEST-objects.sql  (creates GGTEST and then connects as GGTEST)

2) Enable table-level supplemental logging on the schemas in both databases
    - requirement for logical replication (and must be done via GGSCI)

    # ./ggsci
    # GGSCI> DBLOGIN USERID ogg PASSWORD oggpwd
    # GGSCI> ADD SCHEMANTRANDATA GGTEST

3) Change the sequences on each system (in the GGTEST schema)
    - on the first system, to start at 1 and increment by 2 (results in "odd" numbers only)
         - run ALTER SEQUENCE on each sequence in the application schema
    - on the second system, to start at 2 and increment by 2 (results in "even" numbers only)
         - run ALTER SEQUENCE on each sequence in the application schema

3) Create the GoldenGate PRM files (configuration files) for each process type on both sides

Manager PRM (mgr.prm):

PORT 7809
DYNAMICPORTLIST 7810-7820

PURGEOLDEXTRACTS ./dirdat/e1*, USECHECKPOINTS, MINKEEPDAYS 7
PURGEOLDEXTRACTS ./dirdat/r1*, USECHECKPOINTS, MINKEEPDAYS 7
AUTOSTART JAGENT*
AUTORESTART ER *, RETRIES 10 , WAITMINUTES 5, RESETMINUTES 60

LAGCRITICALMINUTES 10
LAGREPORTMINUTES 10


GLOBALS:

GGSCHEMA OGG
CHECKPOINTTABLE OGG.CHECKPOINTS

-- File creation security file mask (for trail files and log files)
OUTPUTFILEUMASK 0002

-- SYSLOG facility logging level (logging to SYSLOG facility)
SYSLOG NONE


Extract PRM (extest.prm):

EXTRACT extest

SETENV (ORACLE_HOME="/u01/app/ora11g/product/11.2.0/dbhome_3")
SETENV (ORACLE_SID="DBM1")
SETENV (NLS_LANG="AMERICAN_AMERICA.AL32UTF8")

USERID ogg PASSWORD oggpwd

DISCARDFILE ./dirrpt/extest.dsc, APPEND, MEGABYTES 10
DISCARDROLLOVER AT 00:01 ON SUNDAY

TRANLOGOPTIONS INTEGRATEDPARAMS (max_sga_size 1024, parallelism 2)
TRANLOGOPTIONS EXCLUDEUSER OGG
GETAPPLOPS
IGNOREREPLICATES

DDL &
   INCLUDE OTHER &
   INCLUDE MAPPED

-- CDR parameter
LOGALLSUPCOLS
-- NOCOMPRESSUPDATES
-- NOCOMPRESSDELETES
-- GETUPDATEBEFORES

STATOPTIONS RESETREPORTSTATS, REPORTFETCH, REPORTDETAIL
REPORTCOUNT EVERY 30 MINUTES, RATE
REPORTROLLOVER AT 00:01 ON SUNDAY
ROLLOVER AT 00:00
WARNLONGTRANS 4H, CHECKINTERVAL 4H

EXTTRAIL ./dirdat/e1

-- TABLE/SEQUENCE Mappings

-- TABLE GGTEST.*;
TABLE GGTEST.*, GETBEFORECOLS &
(ON UPDATE KEYINCLUDING(OGGTS), &
 ON DELETE KEYINCLUDING(OGGTS));


Pump PRM:

EXTRACT pmxtest

SETENV (NLS_LANG="AMERICAN_AMERICA.AL32UTF8")
SETENV (ORACLE_HOME="/u01/app/ora11g/product/11.2.0/dbhome_3")
SETENV (ORACLE_SID="DBM1")

USERID ogg PASSWORD oggpwd

DISCARDFILE ./dirrpt/pmtest.dsc, APPEND, MEGABYTES 10
DISCARDROLLOVER AT 00:01 ON SUNDAY

RMTHOST  targetb, MGRPORT 7809, COMPRESS
RMTTRAIL ./dirdat/r1

-- TABLE/SEQUENCE Include/Exclude
PASSTHRU

TABLE GGTEST.*;


Replicat PRM:

REPLICAT rptest

SETENV (NLS_LANG="AMERICAN_AMERICA.AL32UTF8")
SETENV (ORACLE_HOME="/u01/app/ora11g/product/11.2.0/dbhome_3")
SETENV (ORACLE_SID="DBM1")

USERID ogg PASSWORD oggpwd

DISCARDFILE ./dirrpt/rptest.dsc, APPEND, MEGABYTES 10
DISCARDROLLOVER AT 00:01 ON SUNDAY

DBOPTIONS INTEGRATEDPARAMS(parallelism 8)
-- DBOPTIONS SUPPRESSTRIGGERS  (this is done by default in OGG 12c)

DDL &
  INCLUDE OTHER &
  INCLUDE MAPPED

DDLOPTIONS UPDATEMETADATA
DDLOPTIONS REPORT

STATOPTIONS RESETREPORTSTATS, REPORTFETCH, REPORTDETAIL
REPORTCOUNT EVERY 30 MINUTES, RATE
REPORT AT 00:00
REPORTROLLOVER AT 00:00 ON SUNDAY

ASSUMETARGETDEFS

-- APPLYNOOPUPDATES
-- ALLOWNOOPUPDATES
-- BATCHSQL

-- MAP Mappings

-- MAP GGTEST.*, TARGET GGTEST.*;
MAP GGTEST.*, TARGET GGTEST.*, &
COMPARECOLS( &
   ON UPDATE KEYINCLUDING (OGGTS), &
   ON DELETE KEYINCLUDING (OGGTS)), &
 RESOLVECONFLICT (INSERTROWEXISTS, (DEFAULT, USEMAX (OGGTS))), &
 RESOLVECONFLICT (UPDATEROWEXISTS, (DEFAULT, USEMAX (OGGTS))), &
 RESOLVECONFLICT (UPDATEROWMISSING,(DEFAULT, OVERWRITE)), &
 RESOLVECONFLICT (DELETEROWEXISTS, (DEFAULT, OVERWRITE)), &
 RESOLVECONFLICT (DELETEROWMISSING,(DEFAULT, DISCARD));


4) Create the GoldenGate processes on both sides (using GGSCI)

# ./ggsci
GGSCI> start mgr   (manager must be started and running)

Extract:

DBLOGIN USERIDALIAS ogguser
-- DBLOGIN USERID ogg PASSWORD oggpwd
register extract extest database
ADD EXTRACT extest, INTEGRATED TRANLOG, BEGIN NOW
ADD EXTTRAIL ./dirdat/e1, EXTRACT extest, MEGABYTES 1000

Pump (not need to login to source db):

ADD EXTRACT pmtest, EXTTRAILSOURCE ./dirdat/e1
ADD RMTTRAIL ./dirdat/r1, EXTRACT pmtest, MEGABYTES 1000

Replicat:

--ADD REPLICAT rptest, EXTTRAIL ./dirdat/r1, CHECKPOINTTABLE OGG.CHECKPOINTS
ADD REPLICAT rptest, INTEGRATED, EXTTRAIL ./dirdat/r1

GoldenGate configuration for source component:

Extract:

DBLOGIN USERIDALIAS ogguser
-- DBLOGIN USERID ogg PASSWORD oggpwd
register extract extest database
ADD EXTRACT extest, INTEGRATED TRANLOG, BEGIN NOW
ADD EXTTRAIL ./dirdat/e1, EXTRACT extest, MEGABYTES 1000

Pump (not need to login to source db):

ADD EXTRACT pmtest, EXTTRAILSOURCE ./dirdat/e1
ADD RMTTRAIL ./dirdat/r1, EXTRACT pmtest, MEGABYTES 1000


GoldenGate configuration for target component:

Replicat:

--ADD REPLICAT rptest, EXTTRAIL ./dirdat/r1, CHECKPOINTTABLE OGG.CHECKPOINTS
ADD REPLICAT rptest, INTEGRATED, EXTTRAIL ./dirdat/r1


We now have our schemas created and in sync between the two systems (same number of objects in the schema and each table with same number of rows and column values - critical to have an insync starting point), have meet all the logical replication requirements on both sides (so each can be a source and target), and we have created our Oracle GoldenGate processes and configuration files; now we are ready to start the processes and validate they are running.  Once that is done we will do some test DML and DDL on each side (one at time at this point) to validate replication is working in both directions.

5) On each system, start the Oracle GoldenGate processes (on each side)
    -  recall, no transactions are occuring against our test schema yet

GGSCI> start mgr
GGSCI> start er *
GGSCI> info all

Once started, verify using "info all" that are processes are running and nothing is in an "ABEND" state.  If any process has ABENDED, we need to check the ggserr.log file and possible the report file for the process and fix the issue.

Our Oracle GoldenGate processes are now running on both sides. Each system is running one extract and one pump to capture and send the CDC data to the other system. In addition, each system is running one replicat to apply transactions coming from the other system.  Each system will have the following processes running (Manager must always be running):

manager (mgr)
extract (extest)
pump (pmtest)
replicat (rptest)

At this point, we can test replication, and we will do so from each side. Pick a particular side and do a few inserts into a table in the GGTEST schema. Then verify the inserts (rows) exists on the other system. Now, update and delete some rows from the first system and validate those changes have been to the second system.

Now, create a new table in the GGTEST schema and verify it replicated.  You can do this via CREATE TABLE or via various CTAS commands, both will work.

Repeat all the above transactions on the second system and validate all changes are replicated to the first system.

If all goes well, we have bi-directional up and running for this schema.

Now to cover somethings we have not covered yet in more detail depth that are worth mentioning, and may be needed for replication and CDR to work effectively:
  1. Oracle GoldenGate 12c will suppress triggers from firing on the target side by default (remember our need to prevent double firing of triggers)
    - no need to disable or modify triggers with newer versions of GoldenGate and RDBMS             versions
  2. Oracle sequences were changed to an odd/event strategy, we cannot replicate sequences in a bi-directional configuration. However, Oracle GoldenGate can replicate the sequences changes in other topologies, especially helpful for migrations when using the uni-direction topology
  3. Oracle GoldenGate will defer cascading constraint checks until the end of the transaction, as cascading DML will be replicated in most cases (unless going to another schema not in the config)
  4. The OGGTS column in each table, which we are using to detect and resolve conflicts must be updated on each insert (default value), update transactions for each table, or we must force the column value to be logged along with the data for every transaction. This can be accomplished via one of the following:
      - add a default value on the OGGTS column as a timestamp(6) data type
      - modify the application to update the OGGTS column with every UPDATE
      - force the column value to be logged along with the other transaction data by creating a               custom supplemental logging group forcing the database to log the OGGTS column value in       the redo
Now that replication has been verified in both directions (both DML and DDL are captured and applied in both directions), we can now set up/create a few scenarios to test our CDR parameters/configuration (Conflict Detection and resolution) to see if works and validate resolution works as expected. It is highly recommend to test using a test schema as you can create the conflict scenarios much easier than you can using your application.  Remember, we are using time-based conflict detection and time-based resolution.

CDR is large topic (with different types of conflict detection and resolution parameters) so we will not cover it in detail, but just know that when row is updated on the source, that row change is captured and propogated to the target. On the target, with time-based CDR configured, Oracle GoldenGate will pull the existing value of the row as it exists on the target and compare that value to the old value that was on the target before the update (before value it is called). If they match, there is no conflict, if they do not match, a conflict has been detected and our CDR rules/processing kicks in and Oracle GoldenGate tries to resolve the conflict automatically according to our CDR setup.

There are basically 3 types of conflicts which are common in logical replication:
1) Insert fails to do a PK/UI conflict (ORA-1)
2) Update fails due to a "no data found" error (ORA-1403)
3) Delete fails due to  a "no data found" error (ORA-1403)

In bi-directional replication, we still have to address these basic conflict types, but with CDR, we have to deal with a new scenario:
     an update fails due to CDR columns not matching

In this scenario, the update that is captured on the source contains two sets of values. These values are called the BEFORE and AFTER images. The before values are the values of all changed columns before the update on the source, the AFTER values are the new column values on the source.
For CDR to work in our case, we have parameters specified so that both the BEFORE and AFTER values are transferred to the target. When the replicat goes to the apply the update, it checks the values of the CDR columns in the target before the update and compares with the BEFORE values in the transaction, if they match, the update is applied (no conflict). If they do not match, a conflict has occurred and CDR will process as we have it configured. In our case, the update with later time will win.

In order to see if CDR is kicking in or not, we have to look at the process report files being generated by the replicat process. It will have CDR counters as well as other counters per table being replicated, so there is not shortage of information. The only issue is that report files are text files and hard to parse to get a big picture view, and report files can be overwritten as time goes on depending on the activity and configuration settings being used (you might lose some time periods of reporting).

There are other ways to load these statistics into a database table for real time updating, and real-time viewing/analyzing. This is done via exception mappings along with CDR mappings to report this information to a database table for easy querying and analyzing. Exception reporting is a larger topic and has a blog for it as well. I will update this blog with a part 2 and 3 to cover using these additional mapping features.  Using additional features in the mappings starts to make for very long mappings, which also become harder to maintain if you have lots of mappings. This is where the other GoldenGate features using variables, macros and include files come in handy.  These features tend to make the configuration a bit more complex, but once understood, they can save you many hours of configuration change typing and avoid syntax errors.

Now, how to create conflicts with our test schema. Hopefully you noticed, but in the extract config for each system, we are excluding a user: OGG. This is the extract user and the apply user for both systems.  To create a situation where a conflict is detected and resolved by CDR, we will log onto on one system (system B) as "OGG" and update the timestamp (OGGTS) to a newer value for a few rows in a table. Since "OGG" is not captured by extract (due to our EXCLUDEUSER), these updates to the few rows will not be captured and sent to the other system  (system A) as part of replication (we can also tag redo as well, but lets keep it simple for now). At this point we have the same number of rows in the table, but the OGGTS value is different for a few rows on system B.
Now, get on the other system (system A) as the test user "GGTEST" and update the time stamp for all records in the table to a current value.  The updates for the rows will be captured on system A and sent to system B to be applied. While system B is applying the updates, it notices that the BEFORE value for OGGTS from system A do not match the current values for OGGTS in system B for a few records (as it is processing them).  This is the detection of a "conflict" piece, and as a result, our CDR parameters (rules) will kick in to try to resolve according to our parameters.  In our case, we use a simple time-based check and resolution method.

Below is a recap of our parameters (mappings) for the apply and the check and resolution that will be used are highlighted in "red":

MAP GGTEST.*, TARGET GGTEST.*, COMPARECOLS(
   ON UPDATE KEYINCLUDING (OGGTS),  
   ON DELETE KEYINCLUDING (OGGTS)),
 RESOLVECONFLICT (INSERTROWEXISTS, (DEFAULT, USEMAX (OGGTS))),
 RESOLVECONFLICT (UPDATEROWEXISTS, (DEFAULT, USEMAX (OGGTS))),
 RESOLVECONFLICT (UPDATEROWMISSING,(DEFAULT, OVERWRITE)),
 RESOLVECONFLICT (DELETEROWEXISTS, (DEFAULT, OVERWRITE)),
 RESOLVECONFLICT (DELETEROWMISSING,(DEFAULT, DISCARD));

A brief recap of what each parameter is:

"ON UPDATE KEYINCLUDING (OGGTS),"   
  --> on updates, compare OGGTS values

"RESOLVECONFLICT (UPDATEROWEXISTS, (DEFAULT, USEMAX (OGGTS))),"
   If the row to be updated exists in the target, and if the BEFORE value for OGGTS (from system A) does not match the existing value in the target for OGGTS (system B), then a conflict has been detected, resolve by using the change (or no change) with a greater OGGTS value.
In our case, the timestamp from system A is greater than the existing value in system B, so the update will be applied to system B and the CDR counters will be updated to reflect a conflict and resolution.

If our CDR parameters (rules) had the following resolution, the update would not have been applied to the target and Oracle GOldenGate would update the counters for CDR and move on to the next operation or transaction:

"RESOLVECONFLICT (UPDATEROWEXISTS, (DEFAULT, USEMIN (OGGTS))),"

To continue testing, we can log on and delete records from one side, and then try to update and/or delete those records from the other side, we can also delete records from one insert, insert the same records back in the table and create duplicate row conflicts on the other side.

This is a very easy way to test your CDR rules and results before rolling to production and finding out that they did not work at all or as expected when it really matters. All you have to do is be careful what userid you use to make data changes that re not replicated back to the other system so that you can create various scenarios of data divergence.


Closing!

A word of caution, as logical replication is not a perfect science yet, it is also extremely situational.
Regardless, you will have to meet the requirements for logical replication. However,the replication implementation may require tweaking based on the actual data and transactions occurring in the environment. This comes to light many times in implementations we encounter with customer. The implementation is setup in a lower environment and tested and tweaked until working well with a very small set of transaction test cases and/or volume. Then, this same configuration is rolled out to the production environment and major (or minor) issues are encountered immediately or soon after go-live. This is a case of the sensitivity of Oracle GoldenGate to the actual environment it is running in, transaction types, data used and transactional volume.

Also, please get help if you can, you will be much more successful with assistance that is experienced and has context around what to avoid and do to minimize issues.

Good luck and I hope you push your investment in Oracle GoldenGate to the max to achieve its tremendous return on investment.










April 8, 2016

Migrating SAP Using Oracle GoldenGate


SAP: Making The Impossible Project...Possible!

Recently I got involved in a large and very critical project to lay out and implement both a platform migration and Oracle RDBMS upgrades for a large SAP implementation.  The project goal was to use Oracle GoldenGate (CDC/logical replication) to migrate all the SAP databases to new hardware platform (complete hardware stack refresh) and upgrade the RDBMS versions and implement newer RDBMS features all in the same project.

Sounds complex, and trust me, it can be managed to mitigate, but this is exactly what Oracle GoldenGate can help you achieve. However, when coupled with a few other essential Oracle products/features (Flashback, DataGuard and RAT), the project risks can be managed and mitigated by thorough testing and validation before any cutover cutover.  Actually, it can tested and validated until all the application owners/teams are absolutely comfortable and ready for the cutover, more so than the DBAs (which were already anxious with some hesitation).

As with most critical enterprise applications, the SAP eco system was absolutely business critical across the enterprise and therefore the project had the following requirements to mitigate project risks and reduce downtime for the cutover event(s):

  1)  a significantly reduced downtime window/duration for the cutover to the new environment
       - less than 24 hours for the cutovers (all in one 24 hour period, or multiple cut over events)

  2) ability to migrate the databases independently to the new environment (split migrations/cutovers)
       - split the database migrations over multiple periods, each several weeks apart
       - validate application functionality with a split environment (some still legacy and some new)

  3) "fall back" or "fail back" replication from the new system back to the legacy environment
       -  any database could fail back independently if needed

  4) performance validation of the complete stack for each system before cutover
      - avoid WAR room as much as possible: bad sql, communication issues, general DB performance

  5) backup/recovery and DR setup and validated before production cutover
      - all aspects of backup/recovery and DR tested and validated before final cutover

Fallback ("fail back") replication is intended mainly to:
  1) reduce migration risks of any data loss in case of a fail back to the legacy environment
       - all transactions in the new environment are captured and replicated back to the legacy system
  2) reduce the fail back window in case of fall back to the legacy system production environment
       - fail back replication for up to 30 days

One last fall back requirement was that any of the
databases of the SAP ecosystem could fall back on its own, independently of the others
(cross environment and platform compatibility worked and tested).


The migration of the entire SAP ecosystem consisted of migrating to a new hardware platform/OS (Endian conversion needed) and a new/upgraded Oracle RDBMS version, with the goal
of migrating to a newly sized/designed and supported hardware/OS platform and RDBMS
version; while also utilizing many of the new features available in the new RDBMS/RAC
version for both database performance,  management, HA and DR.

I am happy to report (very happy and relieved that we did all the extra work/diligence to make it
a huge success) that we have just completed the production migration to the new environment
and the entire project was a complete and overwhelming success.  The post cut over WAR room
on Monday lasted about 5-6 hours to address less than a half dozen issues that popped up, all of
which required tweaks to the SAP application configuration as a result of moving to Oracle
RAC from a big iron and single instance configuration in legacy.  A few other errors popped
up that already existed in the legacy environment and the app team decided to fix at that time.

The cut over itself took 6 hours of technical time (detailed below), which was more than we
wanted for the technical pieces, but the BASIS / other app teams needed lots of time to reload
the configuration and do their validations for the new app servers once the databases were
officially cut over to the new Oracle RAC environment.  The database cut over for the Oracle GoldenGate replicated databases (6 of them totaling 70TB of data) was 3.0 hours for the
databases. For the database cut over, 95% of the cut over time was spent running business validation/compare scripts and table row count scripts against old and new databases. The
remaining time in the cut over was for the application reconfiguration and validation steps.
The application reconfiguration and technical validation was another 4 hours, after which both functional/interface validations and business validations were performed for a total of 13 hours
of down time. During the functional and business validations the SAP applications were live
and running and any interfaces could have connected.

The real value of using Oracle GoldenGate for the database migration is that all work can be completed and validated ahead of time and with no downtime of the existing production systems,
even DR was setup and validated before final cut over.  We used DR for RAT testing before cut
over which is highly recommended.

Major work stream durations for the production cut over:

3 hours for database cut over and data validation for all databases
   - databases were cutover in 10-15 minutes (thanks to Oracle GoldenGate)
   - remaining time was for static row counts and business queries used
     to validate data for a static period
4 hours of application re-configuration and validation
3 hours of functional/interface validation for critical processes
2 hours of business validation of critical processes
-----------------------------------------------------------------------------
12 hours (SAP applications open and live in the new eco system)

From a high level, the migration process consisted of the following (of course done in lower environments first to work out the processes, tasks and ordering of tasks, and then a few dry
runs before doing the production migration to fine tune the steps and timings:

1) Sizing, design and acquisition of the new hardware (servers, storage and networking)

2) Building of the new database serrvers (RAC cluster) and app server cloud environments
     - all Oracle RAC systems and new app servers on VMWare

3) Building the new databases (shell= DBs) on the new environment
    - patching to the latest SAP patches (which include Oracle RDBMS patches)
    - install Oracle GoldenGate on source and ensure it is running

4) Installation and configuration of Oracle GoldenGate
     - on both source and target systems for CDC (for both DDL and DML)
     - GoldenGate configured with EMS mappings as well as application mappings

5) Online data migration, in phases and with no downtime of legacy databases
     (platform conversion needed - moving from Solaris to Linux)
      - Oracle DataPump of selected sets of tables with same flashback SCN for consistency
        - legacy DR could not be used (across the WAN and was 8-10 times slower - very old)
        - no storage cloning capabilities and no spare servers to restore production to
          in the same data center location
        - data migration was completed with no downtime for the existing production systems
          and was done in pieces (chunks) as to not impact existing production

6) Replicating the source transactions (CDC data) from legacy to the newly staged databases
     - using Oracle GoldenGate (real time CDC) to keep them synchronized with legacy production
       for a significantly reduced migration window
      - using GoldenGate and its EMS capabilities enhanced our ability to use
        Real Application Testing (RAT), highly recommended

7) Validating both data and metadata for all databases as a result of of the platform migration
     and the use of logical replication for each database
     - validation of all database objects in the new database
        -  users,profiles,views, synonyms, dblinks, directories, tables, indexes,sequences, etc
          - all applicable to SAP SR3 schema and data model
     - row counts (static row counts for all application tables both before cut over and at cut over)
     - data validation using Oracle Veridata leading up to cut over day (bit for bit comparison)
     - scripts to validate business data in each SAP module database
       (provided by business analysts and application teams)

8) Re-configuration of the applications for the new app servers and new database servers

9) Technical validation of the new environment for all applications

10) Performance validation of the new environment (RAT captures on legacy production)
     - legacy production captures (multiple times periods and duration of RAT captures)
       - used GoldenGate's EMS capabilities to insert markers to stop replication to the target
         at specific time periods corresponding to the different RAT capture periods
     - database performance testing and validation using Oracle Real Application Test (RAT)
        replays
        - Oracle GoldenGate, Data Guard (snapshot standby) and Oracle Real Application Testing
           are invaluable for these tasks (make it much easier to do and can be done in parallel)
     - load stress testing  and validation using Oracle Real Application Test on both the new
       production and DR systems (invaluable again)
    - validate load balancing
    - SQL tuning, patching and tweaks for performance at OS, network, RDBMS, etc

11) Functional testing of applications, interfaces, external loads, etc
       - validate major application components are functioning and all integration end points
          are functioning across the new Linux platform for all databases

12) HA/DR testing and validation of the new environment (end to end validation)
      - DataGuard setup for each database, and both role switches and forced fail overs tested
         - backup/restores tested using fail overs and restoring lost primary databases
      - application app servers failed over and tested against DR environment
      - cluster/database HA tests of the new environment, many different HA tests completed
        for the entire stack

13) Fall back or "fail back" logical replication implemented and tested
      (from new production back to legacy production)
      - tested only in Pre-prod (this is the one piece that could only be tested with pre-prod legacy)

14) Fall back of independent components (database and app servers in the SAP ecosystem)
       - validate major application components are functioning with cross platform components
       - different pieces of the eco system failed back to legacy with other remaining future state

15) Business validation
      - ensuring application works and critical processes all work

16) Return databases back to replicated state (Oracle Flashback Database) in staged production
      to resume Oracle GoldenGate catch up CDC replication
      - sync CDC from legacy production to staged production to keep synchronized
      - during testing periods only (at cut over this was the go live starting point)
         - OGG has keep all the datbases in sync with legacy production, waiting for final cut over

17) Everything validated and verified before cut over, wait for cut over time!
      - repeat this process as many times as possible to ensure a near flawless migration
        (with a large SAP eco system with many integration/communication points, some issues will                come up....remember, not for the faint of heart...stay cool and more importantly,
         have the right team in place)

That covers the high level process we went through for the SAP migration to the new environment
and new Oracle database environment. Definitely, testing and dry runs in lower environments are invaluable for your success. All tasks get documented and timings established before you talk to
the business about an appropriate outage and the over risks.

This was an awesome project. I must warn, replicating SAP with any type of logical replication is probably not for the faint of heart. You will need to tweak the configuration to get apply
performance and issues with come up that must be dealt with (especially with the SAP BW
system).

Overall, using Oracle RDBMS/RAC, Oracle GoldenGate, Oracle DataGuard and Oracle Real Application Testing (RAT) made this project a huge success with a significantly reduced total
effort and resources required (RAT is a life safer to be blunt). These products made an
"impossible project possible" within the time frame/budget left in the overall transformation project with some 30+ GoldenGate migrations a huge success.

By using Oracle GoldenGate in the project, we were able to consolidate 3-4 different projects into one (moving to a new hardware platform, new Oracle RDBMS version (upgraded to a supported version), new Grid version, Oracle RAC/ASM, enhancements to the physical and logical database design (new storage, partitioning, BIGFile table spaces, Secure Files, some different indexing strategies for specific tables due to RAC, etc).

This new consolidated project was now more complex and a bit longer than any of the 3-4 individual projects by themselves and a bit riskier due to the added scope and complexity of all the moving parts and using logical replication for the data migration. However, do not let this discourage you, as these risks (all of them) can be managed and mitigated. In addition, with the use of the Oracle products mentioned, you can help reduce the risks and effort involved.  Also, by consolidating the projects into one, we reduced the amount of testing time significantly as compared to 3 or 4 projects; basically we were able to test once for all pieces.

For now, thank you Oracle, for the Oracle GoldenGate product (this is an awesome product and a market leading product), and thank you SAP for allowing the use of Oracle GoldenGate to migrate SAP to a new environment (of course with a certified SAP migration specialist on hand, as well as Oracle GoldenGate expert).

Using Oracle GoldenGate to migrate SAP to a new environment can literally make the "impossible project possible".  Many of the risks mitigated would otherwise still be present and would be nearly impossible to mitigate and would require a large and lengthy WAR room at post cut over. And, GoldenGate allowed for a significantly reduced outage window for the database migration pieces, typically from 16-48 hours to 1-2 hours, which is significant to the business and critical processes supported by the SAP implementation and the potential cost of downtime to any or all parts of the SAP eco system.

Replicating the SAP data model using logical replication is both easy (no foreign keys or triggers and only one sequence, and most tables have a PK or UI on them), and also a bit complicated at the same time due to the fact that SAP uses a single schema for all application data (SAPSR3).  In a nut shell, all the data is in one schema for each database. When using Oracle GoldenGate, having multiple schemas makes instantiating and establishing apply at different time periods a bit easier because you can do it in pieces (by schema in most cases) rather than customizing the mappings for each table or piece with mappings to evaluate CSN (it can be done, but we needed a simple config) . To instantiate an SAP database from a production and without any downtime and little impact on it makes for some challenges when all the data is one schema in each database.  In this case, we used a static table listing to pull over all the tables in pieces (groups by SCN), setup apply for each piece, then added an apply that used a wildcard to catch any new tables and other things such as OTHER DDL category.  If using 11.2.0.4+, definitely use the integrated apply if at all possible and patch SAP to the highest you can because the RDBMS patches are included and integrated apply requires RDBMS patches in addition to using Oracle GoldenGate 12c. However, please please, be aware, that if doing fall back replication to an older Oracle version, integrated apply likely will not work for you and you will need a different apply configuration for the fall back replication (if using it).  In our case, integrated apply worked really well, but we needed classic replicats (coordinated) for fallback replication. Therefore, instead of having two apply configurations and having to tune each strategy, we stuck with the classic replicat approach to allow use to easily and quickly reverse the replication with no additional configuration needed for the apply on the older 11.2.0.2 databases.  We used integrated capture where we could.

I will continue to update this post to also include some additional details regarding how others do it (Oracle ACS and their 02O), or triple "O" service) and how we deviated from it slightly in the initial load portion. We did not use GoldenGate initial load for the initial data migration and instantiation and used other Oracle approaches for the initial data migration (scripted framework for using DataPump).  The initial load capabilities in Oracle GoldenGate are definitely worth using and work really well and performance is acceptable in most cases.

More to come on this in late May or June.....trying to relax a bit now and take advantage of some free time.

Further reading...white paper from Oracle on SAP Migration:

http://www.oracle.com/technetwork/articles/systems-hardware-architecture/sap-migration-163923.pdf


Take care, thank you for reading my two cents worth and good luck in your migration projects.

To end this blog, I want to throw out a leadership quote by Red Adair: "If you think it's expensive to hire a professional to do the job, wait until you hire an amateur."  Read more at: http://www.brainyquote.com/quotes/quotes/r/redadair195665.html



June 4, 2015

GoldenGate Exception Table Exception Handling and Reporting


Update to the blog content/functionality (Aug 2016):

A quick update on exception mappings.  Exception mappings/parameters are not working with Oracle RDBMS 11.2.0.4 and integrated replicats when using OGG 12c (either 12.1.x or 12.2.x) and Oracle RDBMS 11.2.0.4 only. The update from Oracle says that they indeed do not work with 11.2.0.4 and you must upgrade RDBMS to 12c or use classic replicats if mappings use exception parameters and these must work. The issue is not with OGG, but with the database and they do not have a fix it (back ported) to 11.2.0.4, you must upgrade to 12c RDBMS or use classic/coordinated replicats.

Another side note, OGG 12.2 looks incredible, we have tested it thoroughly at many customer sites and it is rock solid. We still recommend you always download the latest PSU for OGG and get them applied.


-- Original Blog Content

If using any form of logical replication, it is very likely at some point you will run into an "out of sync" issue during the apply on the target side.  In most cases you will encounter errors for duplicate keys (for inserts) or missing data (for updates and deletes) and will get the error numbers corresponding to those errors as implemented by the database platform being used.  You can get other errors resulting from foreign key constraints, trigger errors, etc, but for now we will focus on the most common cases.  The downside to this "out of sync issue" is that it is rarely a product issue and most likely a human error issue.  Yes, I said it, it is those of us that work on the replication implementation that cause the most issues with the replication implementation.  I would say 90% are a result of human error versus 10% or less as a result of a product issue.

When using logical replication, everything is situational becomes a common term.  This tends to drive customers nuts, however, it is the absolute truth.  What you do, and how you do it, depends on the exact situation at hand and must factor in what is currently configured, how and what are you trying to accomplish or fix in terms of the changes to be implemented.  In most replication implementations, when this "out of sync" issue creeps up, the technical team supporting the replication would like to continue the "apply" for the data or data sets that is not out of sync and somehow deal with the errors that come up automatically, as well as get a feel (a good feel) for how wide spread the issue might be (number of tables/rows out of sync potentially).  Continuing the replication ("apply") for the remaining data sets is usually critical to the business rather than stopping the apply all together until all apply errors are analyzed and repaired.

The problem with the "out of sync" issue and continuing the apply is the result of multiple issues to be dealt with:

  • the potential volume of transactions that need to be applied to the target and the volume of those transactions that will encounter errors, or have been encountering apply errors issues and must be dealt with manually
  • how the replication is currently configured to either handle or not handle apply errors
    In most replication products, default error most is to stop the apply (ABEND) whenever an apply error has occurred.  This works on a small scale (lower number of errors to be dealt with manually), as it allows you to fix the error (the affected rows) and continue the apply.
    This manual process works fine if there is a few hundred or less apply errors, but it is usually not feasible if there are potentially thousands, millions or billions of apply errors to come that you have not seen yet. 
  • the configuration now needs to be split, ie. a replicat or a few of them includes other tables in which do not have issues and need to continue apply. This brings up the need to handle these apply errors programmatically while still applying to all the other tables
In this blog, I will cover some things you can do to make it is easier to automate the error handling and significantly increase the analysis effort accuracy and efficiency of the apply errors so you can have track/report on the number of errors, types of errors and where the errors are (tables and/or rows).  Your implementation will still be out of sync (this is another topic on refreshing data), however,  you will be able to continue the apply for the in-sync data and record all apply errors for analysis to give you detailed information of what needs to be repaired to get back in sync.  As mentioned, repairing is another topic and there are many ways to repair the out of sync data.

If you are using Oracle GoldenGate (a powerful, flexible, amazing and market leading product), it has some built in features and functionality to help deal with the apply errors automatically and programmatically by using its exception handling and CDR parameters.  In this blog, I will cover the exception handling parameters and resulting mappings needed that can be used to automate the skipping and logging of the apply errors for analysis. We can also use CDR (Conflict Detection and Resolution) parameters/mappings by themselves and/or with exception logging parameter/mappings and I will cover these in the second part/blog on active-active as CDR need/requires upfront configuration (requirements) and analysis as to how to handle data apply errors and what data to deal with up front so parameters/mappings can be configured to handle the errors in those defined detection and resolution cases.

If your replication implementation does not have exception handling or CDR mappings in place, you can use these mappings to quickly get replication up and running and logging all apply errors that occur during the apply. The next step would be to evaluate the implementation to see what needs to be repaired and start discussing and preparing to configure and use Oracle GoldenGate's CDR mappings.  Oracle GoldenGate has most of the functionality built-in to meet most (if not all) your needs with the new CDR parameters in 11.2+ GoldenGate.

I regularly implement this exception handling process quickly at customer sites when doing what I call "OGG 911".  The story usually goes like this: customer calls and needs help because the apply has encountered multiple errors and has stopped all or some of the apply processes.  The technical team fixed the few records and restarted the apply.  After starting the apply the processes got another error immediately, it was fixed, then another error and another error.  After doing this for 100-200 times, they throw up their hands as to how to handle it because they cannot predict how many more apply errors there may be just waiting in the trail files (CDC transactions) that has not yet had apply attempted.  Nonetheless, they feel stuck.  They are in luck however, as Oracle GoldenGate has several methods for handling this and any of them could be used to get replication back up and running. The issue is they still need to do root cause analysis as to why the apply errors occurred and repair the data out of sync. The method discussed will help get you back up and running and just as important, give you a good process and information regarding the "what" and "how much" to date have there been new apply errors that are occurring as apply runs.  The method I am going to detail makes/offers quick and easy analysis (uses a database table to log all errors for SQL analysis) and also captures the ROWID of the row from the source system that is now out of sync on the target side for which changes could not be applied on the target side (remember, ROWID is from the source not the target).  Having the ROWID makes it easy to fix only that set of rows (and just those rows) quickly or programmatically rather than re-syncing a 1 billion row table over again because you had no way of quickly knowing which rows are affected. Again, there are lots of ways to fix/repair data while replication is running, but I will not cover that here to keep it shorter and to the point on exception handling, but if you are interest you should look into EMS/EMI and other mapping parameter FILTER to resync data inline.  The method is used to help get a customer that is in outage mode and needs to get back up and running quickly and able to analysis the impact of the errors.

Now, on to how to address the "out of sync" apply error issue. To do this, I will cover how Oracle GoldenGate works in default mode/default parameters and what parameters are available to help address and handle the "out of sync" issue.

First, we must cover what an apply error is to Oracle GoldenGate, ie., the unit of work in Oracle GoldenGate when running with default values and default parameters (no exception handling or CDR parameter/mappings), just plain mappings and default values for other parameters in the replicat.

By default, Oracle GoldenGate's error mode is to throw an error at the operation level and per operation if any operation results in an error code being returned.  An operation in Oracle GoldenGate is the basic unit of work: a single row in a table. This does not mean a transaction, the transaction could be 1 row or 100 rows.  That single row operation in the transaction could be a new record for a table or a change to an existing row in a table (such as an update or delete to a row).
If the transaction on the source changed 10 rows, GoldenGate would capture 10 operations to be applied on the target.  When you insert 10 rows on the source, that results in 10 operations (10 inserts) to be applied on the target. When you issue an update DML on the source that updates 10 rows in the table (a single UPDATE statement), that results in 10 update operations (statements) to be applied on the target.  Same for deletes, you issue a single delete DML on the source that deletes 10 rows from a table, this results in 10 delete operations to be applied on the target (one for each row).  This also explains why logical apply is slower and takes some tuning on the apply side in many cases. This is not unusual, all logical replication products (SQL apply) work this way. Only GoldenGate is very fast at it and is well tuned to use native methods to make the apply extremely efficient.

Now, what about transactional boundaries from source to target. Oracle GoldenGate does honor transactional order (SCN order). And, depending on  how it is configured or not configured, it may not honor transactional boundaries when it comes to error handling in an automated manner. This is a really good thing and it gives you flexibility you may need in some cases.

By default, Oracle GoldenGate apply (replicats, whether classic, coordinated or integrated) will ABEND on any operational error (any single row change in the transaction that returns an error code). This means the replicat will stop applying data at this spot when any error occurs, when any single operation returns an error code.  This is controlled by the REPERROR parameter.  By default REPERROR is set to ABEND on any operation that returns an error code. Even if you do no have REPERROR specified or listed in your replicat parameter file, the default value for the parameter is used at runtime. All Oracle GoldenGate parameters have default values and those values will be used if not changed at runtime via a parameter file.  When an error occurs during apply, the replicat will ABEND and write a message to the event Log ("ggserr.log") and also write information to a discard file if that has been configured.  At this point, you either have to fix the row in error on the target so the apply can occur without returning an error code, or you must skip this row change somehow and move on to the next operation in the trail to keep apply working for the remaining schemas/tables in the configuration of this replicat.

I will lay out your options here to continue the apply, basically skipping the operations that result in errors.  These are in order of my preference:

1) SKIP the entire transaction in the trail (start replicat with SKIPTRANSACTION)

     - GGSCI> start replicat RPTEST1 SKIPTRANSACTION

     - skips all operations in the transaction (may skip more than you want or needed)
2) Skip the one operation in the trail by repositioning replicat to the next operation in the trail
    (classic only)

     - GGSCI> DBLOGIN userid ogguser
     - GGSCI> alter replicat RPTEST1, extseqno ??, extrba
     - this is for advanced usage only

3) Use REPERROR with the discard or ignore options
     - write the error and some column values to the text-based discard
     - LOB or advanced/raw data types are not written due to text file type

        REPERROR (-1403, DISCARD)   (use -<error_number for integrated apply)
        REPERROR (-1403, IGNORE)      (use -<error_number for integrated apply)

4) Use the HANDLECOLLISIONS parameter (very basic CDR type functionality for duplicate rows)
      - skips any operations that result in missing data errors (updates and deletes)
      - turns an insert into an update for duplicate row errors
      - collisions are reported in the text-based report files
         - hard to analysis as a whole, it will record number of insert/update/delete
           collisions per table, however, you do not know what row it was or see any the
           column/transactional information
         - if the report files roll over often, you may loose some of the information as report
           files roll over and can be lost if that occurs enough

       HANDLECOLLISIONS (put it before any table mappings - as a global parameter)

       Or, you can invoke HANDLECOLLISIONS on a per mapping basis:

       MAP SCHEMA1.*, TARGET SCHEMA1.*, HANDLECOLLISIONS;          (for all tables)
       MAP SCHEMA1.TEST, TARGET SCHEMA1.*, HANDLECOLLISIONS;   (per table)

5) Use Exception mappings to skip the operations in error and optionally to log all the error
    information to a global exception table and/or a per-base table exception table
    and continue the apply
 
       ** examples below

As you can see, my preference is #5, then number #4 and then #3. Again, your choice it often situational and depends on what you need to know and do that affects the option you use.  Of all the options, #5 requires some changes to the mappings to configure and requires creating a global exception table and sometimes per-base table exception tables as well depending on whether you want to see/preserve the transactional data (especially LOBs and other advanced data types).
Options 3-4 are very easy to implement by adding the parameter to the replicat PRM file.
Option 2 requires altering the replicat's position in the trail (for the advanced only), and option one is a parameter to use when starting the replicat.

I do not like to use option #1, because all operations in the transaction are skipped (whether they would have resulted in an error or not); that is too risky to use unless it is your last and only option.
Using it will result in more data needing repair because you could skip data that will cause tables not out of sync to now get out of sync.

The rest of this posting covers option #5: using exception mappings with a global and optionally per base table exception tables.

Exception mappings are implemented by adding a few parameters and changing the mappings in the replicat. No changes are required for the source configuration to use the exception mappings.

One thing to point out, there are two ways of implementing exception mappings. One method applies for individual table mappings and the other applies for mappings that involve wildcards. When using wildcard mappings, there is a MAPEXCEPTION parameter to the MAP parameter that should be used.  When using with individual table mappings, the exception is a second set of map parameters right after the initial map parameter (must come immediately after it).

We will look at each example below.

To implement exception mappings for either method do the following:

1) Create the global exception table as the GoldenGate apply user in the target database
   (in that schema)

    Note: This is a global exception table, all errors are reported here globally for all tables/mappings
              It will record metadata regarding the operation and ROWID but will not include
              any transactional data in the operation, this is what per-base table exception table do.
              Per-base table exception tables are exact copies of the base tables, but with no constraints or
              indexes and they are used to preserve the transactional data in the database instead of leaving
              in the trail files.   You cannot process data inside the trails, only GoldenGate can. There are
              ways of getting the data out of the trail files (using adapters) but this is not covered as the
              adapters are powerful/flexible and need a lot of space to explain them in detail.

-- Create the Global Exception Table:

CREATE TABLE EXCEPTIONS
(
 excp_date timestamp(6) default systimestamp,
 rep_name varchar2(10),
 table_name varchar2(56),
 errno number,
 errtype varchar2(6),
 optype varchar2(24),
 transind varchar2(12),
 transimgind varchar2(8),
 committimestamp varchar2(26),
 reccsn number,
 recseqno number,
 recrba number,
 rectranspos number,
 reclength number,
 logrba number,
 logposition number,
 grouptype varchar2(12),
 filename varchar2(50),
 fileno number,
 srcrowid varchar2(40),
 srcdbcharset varchar2(40),
 replag number,
 cnt_cdr_conflicts number,
 cnt_cdr_resolutions number,
 cnt_cdr_failed number 
);


2) Change the replicat parameters to force exception processing when operations return an error code     during apply and map those errors to exception processing:

        Note: We can choose which error codes result in exception logging.
                  Also note, we are not logging all operations in the transaction, only the ones that return 
                  an error code (notice the one parameter commented out, that changes to Transactional                           level exception processing)

     -- Set error mode for GoldenGate (changing default of "ABEND")

      REPERROR (DEFAULT, EXCEPTION)
      -- REPERROR (DEFAULT, TRANSEXCEPTION) 
     REPERROR (DEFAULT2, ABEND)
     -- Set error codes that cause exception processing
     REPERROR (-1, EXCEPTION)
     REPERROR (-1403, EXCEPTION)
     REPERROR (-2291, EXCEPTION)

3) Change the mappings to include exception processing (parameters) and map the transaction
    metadata to columns in the exception table:

    -- Wildcard mappings: (uses the MAPEXCEPTION parameter to MAP)

           Note: Notice there is a MAPEXCEPTION parameter, instead of a separate MAP statement.
                     MAP parameters EXCEPTIONSONLY and INSERTALLRECORDS are used.


MAP XTEST.*, TARGET XTEST.*, MAPEXCEPTION (TARGET OGG.EXCEPTIONS, EXCEPTIONSONLY, INSERTALLRECORDS, COLMAP 
excp_date = @DATENOW(), 
rep_name = @GETENV ('GGENVIRONMENT', 'GROUPNAME'), 
table_name = @GETENV ('GGHEADER', 'TABLENAME'), 
errno = @GETENV ('LASTERR', 'DBERRNUM'), 
errtype = @GETENV ('LASTERR', 'ERRTYPE'), 
optype = @GETENV ('LASTERR', 'OPTYPE'), 
transind = @GETENV ('GGHEADER', 'TRANSACTIONINDICATOR'), 
transimgind = @GETENV ('GGHEADER', 'BEFOREAFTERINDICATOR'), 
committimestamp = @GETENV ('GGHEADER', 'COMMITTIMESTAMP'), 
reccsn = @GETENV ('TRANSACTION', 'CSN'), 
recseqno = @GETENV ('RECORD', 'FILESEQNO'), 
recrba = @GETENV ('RECORD', 'FILERBA'), 
rectranspos = @GETENV ('RECORD', 'RSN'), 
reclength = @GETENV ('GGHEADAER', 'RECORDLENGTH'), 
logrba = @GETENV ('GGHEADER', 'LOGRBA'), 
logposition = @GETENV ('GGHEADER', 'LOGPOSITION'), 
grouptype = @GETENV ('GGENVIRONMENT', 'GROUPTYPE'), 
filename = @GETENV ('GGFILEHEADER', 'FILENAME'), 
fileno = @GETENV ('GGFILEHEADER', 'FILESEQNO'), 
srcrowid = @GETENV ('TRANSACTION', 'CSN'), 
srcdbcharset = @GETENV ('GGFILEHEADER', 'DBCHARSET'), 
replag = @GETENV ('LAG', 'SEC'), 
cnt_cdr_conflicts = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_CONFLICTS'), cnt_cdr_resolutions = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_RESOLUTIONS_SUCCEEDED'), 
cnt_cdr_failed = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_RESOLUTIONS_FAILED'))
 )
);

   -- Individual table mappings: (uses a separate MAP statement for exception processing)

         Note: Notice there is no MAPEXCEPTION parameter, instead there is a separate MAP    
                   statement.
                   MAP statement with EXCEPTIONSONLY and INSERTALLRECORDS parameters.

MAP XTEST.*, TARGET XTEST.*;
MAP MAP XTEST.*, TARGET OGG.EXCEPTIONS, EXCEPTIONSONLY, INSERTALLRECORDS, &
COLMAP (
excp_date = @DATENOW(), 
rep_name = @GETENV ('GGENVIRONMENT', 'GROUPNAME'), 
table_name = @GETENV ('GGHEADER', 'TABLENAME'), 
errno = @GETENV ('LASTERR', 'DBERRNUM'), 
errtype = @GETENV ('LASTERR', 'ERRTYPE'), 
optype = @GETENV ('LASTERR', 'OPTYPE'), 
transind = @GETENV ('GGHEADER', 'TRANSACTIONINDICATOR'), 
transimgind = @GETENV ('GGHEADER', 'BEFOREAFTERINDICATOR'), 
committimestamp = @GETENV ('GGHEADER', 'COMMITTIMESTAMP'), 
reccsn = @GETENV ('TRANSACTION', 'CSN'), 
recseqno = @GETENV ('RECORD', 'FILESEQNO'), 
recrba = @GETENV ('RECORD', 'FILERBA'), 
rectranspos = @GETENV ('RECORD', 'RSN'), 
reclength = @GETENV ('GGHEADAER', 'RECORDLENGTH'), 
logrba = @GETENV ('GGHEADER', 'LOGRBA'), 
logposition = @GETENV ('GGHEADER', 'LOGPOSITION'), 
grouptype = @GETENV ('GGENVIRONMENT', 'GROUPTYPE'), 
filename = @GETENV ('GGFILEHEADER', 'FILENAME'), 
fileno = @GETENV ('GGFILEHEADER', 'FILESEQNO'), 
srcrowid = @GETENV ('TRANSACTION', 'CSN'), 
srcdbcharset = @GETENV ('GGFILEHEADER', 'DBCHARSET'), 
replag = @GETENV ('LAG', 'SEC'), 
cnt_cdr_conflicts = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_CONFLICTS'), cnt_cdr_resolutions = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_RESOLUTIONS_SUCCEEDED'), 
cnt_cdr_failed = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_RESOLUTIONS_FAILED'))
);


-- Example of wildcard exception mapping with CDR rules:

   When CDR is called, the exception/CDR processing will be handled (in most cases)
    and logged so you know where it is occurring and how often.

MAP XTEST.*, TARGET XTEST.*, 
COMPARECOLS(ON UPDATE KEYINCLUDING (OGG_TS), ON DELETE KEYINCLUDING (OGG_TS)), RESOLVECONFLICT (INSERTROWEXISTS, (DEFAULT, USEMAX (OGG_TS))),
RESOLVECONFLICT (UPDATEROWEXISTS, (DEFAULT, USEMAX (OGG_TS))),
RESOLVECONFLICT (UPDATEROWMISSING,(DEFAULT, OVERWRITE)),
RESOLVECONFLICT (DELETEROWEXISTS, (DEFAULT, OVERWRITE)),
RESOLVECONFLICT (DELETEROWMISSING,(DEFAULT, DISCARD)),

MAPEXCEPTION (TARGET OGG.EXCEPTIONS, EXCEPTIONSONLY, INSERTALLRECORDS, COLMAP
excp_date = @DATENOW(), 
rep_name = @GETENV ('GGENVIRONMENT', 'GROUPNAME'), 
table_name = @GETENV ('GGHEADER', 'TABLENAME'), 
errno = @GETENV ('LASTERR', 'DBERRNUM'), 
errtype = @GETENV ('LASTERR', 'ERRTYPE'), 
optype = @GETENV ('LASTERR', 'OPTYPE'), 
transind = @GETENV ('GGHEADER', 'TRANSACTIONINDICATOR'), 
transimgind = @GETENV ('GGHEADER', 'BEFOREAFTERINDICATOR'), 
committimestamp = @GETENV ('GGHEADER', 'COMMITTIMESTAMP'), 
reccsn = @GETENV ('TRANSACTION', 'CSN'), 
recseqno = @GETENV ('RECORD', 'FILESEQNO'), 
recrba = @GETENV ('RECORD', 'FILERBA'), 
rectranspos = @GETENV ('RECORD', 'RSN'), 
reclength = @GETENV ('GGHEADAER', 'RECORDLENGTH'), 
logrba = @GETENV ('GGHEADER', 'LOGRBA'), 
logposition = @GETENV ('GGHEADER', 'LOGPOSITION'), 
grouptype = @GETENV ('GGENVIRONMENT', 'GROUPTYPE'), 
filename = @GETENV ('GGFILEHEADER', 'FILENAME'), 
fileno = @GETENV ('GGFILEHEADER', 'FILESEQNO'), 
srcrowid = @GETENV ('TRANSACTION', 'CSN'), 
srcdbcharset = @GETENV ('GGFILEHEADER', 'DBCHARSET'), 
replag = @GETENV ('LAG', 'SEC'), 
cnt_cdr_conflicts = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_CONFLICTS'), 
cnt_cdr_resolutions = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_RESOLUTIONS_SUCCEEDED'), 
cnt_cdr_failed = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_RESOLUTIONS_FAILED'))
 )
);

One thing to note, since the mappings are very, very long due to the transformations (COLMAP) to the exception table, these are normally put into a macro library that has macros to do both wildcard mappings as well as individual table mappings and the call to the macro helps keep the PRM files short, clean and neat and standardized.
Example below of calling the macro to do all the mappings for a schema:

   #replicateschamewithGlobalExcp (XTEST,XTEST);
   #replicatetablewithGlobalExcp (XTEST.TAB1);

You can easily create additional macros which do both CDR and exception logging. CDR tries to resolve the conflict but the exception is still logged:

   #replicateschamewithCDRandGlobalExcp (XTEST,XTEST);
   #replicatetablewithCDRandGlobalExcp (XTEST.TAB1);

When the mappings need to change or we need to have several different mappings to switch back and forth, we can create multiple macros to do those mappings just as we want, then we can switch the macro name in the PRM file to our desired functionality. See my blog on Macro, Variables and Include files where there are simple examples. Macros can also be nested, but in that example we have keep them simple to understand.

See the following post for how to use macros and variables to streamline and standardize a configuration. They do make the config a bit more complicated, but also make it much faster to change the configuration globally if needed. Macros are life safer when you have long and many mappings, macros basically enable reuse which is always efficient but again, makes the configuration a bit more complicated. Hopefully, someday, Oracle will expand the macro capabilities with some control looping and evaluation capabilities.

        Include files, variables and macros in GoldenGate

When all this is done and there is any error -1 or error 1403 (Oracle errors in this case) on any operations during apply, the errors will be logged to the exception table on the target database for analysis and reivew. You can use SQL to analyze in many different ways and make it easy to see the big picture of errors and error types and tables involved that need to be repaired (resync'ed somehow).  Fixing the errors again is situational and there are many ways to do it and that is not covered. I prefer to configure the EMS in OGG so I can sync tables in line with a small outage for the apply without having to create new replicats and later merge them back into one.  More on that in other blog posts.

You can now query the exceptions table for errors. You can use SQL to drill down or summarize all you need. This is much easier than parsing through text files and writing scripts to parse those, plus, the big advantage is you know more about the transaction, including the ROWIDs of the source rows which are involved. The ROWIDs are invaluable and can be used to fix selective rows.



REP_NAME  TABLE_NAME                        ERRNO   COUNT(*)
--------- ---------------------------- ---------- ----------
RDEV1001  GGTEST.TAB1                           1         20
RDEV1001  GGTEST.TAB1                        1403      16896
RDEV1001  GGTEST.TAB9                        1403         16
RDEV2001  GGTEST.TAB8                        1403         32
4 rows selected.


  EXCP_DATE REP_NAME  TABLE_NAME                   ERRNO   COUNT(*)
----------- --------- ----------------------- ---------- ----------
11-NOV-2013 RDEV1001  GGTEST.TAB1                      1         16
11-NOV-2013 RDEV1001  GGTEST.TAB1                   1403         16
14-NOV-2013 RDEV1001  GGTEST.TAB1                      1          4
14-NOV-2013 RDEV1001  GGTEST.TAB1                   1403         12
15-NOV-2013 RDEV1001  GGTEST.TAB1                   1403          6
16-NOV-2013 RDEV1001  GGTEST.TAB1                   1403         19
20-NOV-2013 RDEV1001  GGTEST.TAB1                   1403      13200
20-NOV-2013 RDEV2001  GGTEST.TAB8                   1403          2
21-NOV-2013 RDEV1001  GGTEST.TAB1                   1403       3643
21-NOV-2013 RDEV1001  GGTEST.TAB9                   1403         16
21-NOV-2013 RDEV2001  GGTEST.TAB8                   1403         30


11 rows selected.


An additional look at more columns in the exeptions table, remember there are about 25 additional columns in the exception table, and you certainly do not need all of them, however, about 6-8 are very handy and since this is an exception table, it can be clean out when the data is no longer needed.
I take the view it is better to have more information to troubleshoot than not enough.

EXCPDATE   REP_NAME TABLE_NAME              ERRNO ERRTYP OPTYPE       SRCROWID
---------- -------- ---------------------- ------ ------ ------------ ---------------------
2014-11-13 RPXTEST  XTEST.TAB8              1403  DB     DELETE       AAAflqAAEAAARksAAJ
2014-11-13 RPXTEST  XTEST.TAB8              1403  DB     DELETE       AAAflqAAEAAARksAAA
2014-11-13 RPXTEST  XTEST.TAB8              1403  DB     DELETE       AAAflqAAEAAARksAAB
2014-11-13 RPXTEST  XTEST.TAB8              1403  DB     DELETE       AAAflqAAEAAARkuAAB
2014-11-13 RPXTEST  XTEST.TAB8              1403  DB     DELETE       AAAflqAAEAAARkuAAA
2014-11-13 RPZTEST  ZTEST.TAB8                 1  DB     INSERT       AAAfkRAAEAAARgVAAK
2014-11-13 RPZTEST  ZTEST.TAB8                 1  DB     INSERT       AAAfkRAAEAAARgVAAA
2014-11-13 RPZTEST  ZTEST.TAB8              1403  DB     SQL COMPUPDA AAAfkRAAEAAARgUAAA
2014-11-13 RPZTEST  ZTEST.TAB8              1403  DB     SQL COMPUPDA AAAfkRAAEAAARgUAAB
2015-05-14 RPTEST   TEST.TAB1               1403  DB     DELETE       AAAiJjAAEAAAX0WAAA
2015-05-14 RPTEST   TEST.TAB1               1403  DB     DELETE       AAAiJjAAEAAAX0WAAe
2015-05-14 RPTEST   TEST.TAB1               1403  DB     DELETE       AAAiJjAAEAAAX0WAAi
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAD
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAm
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAP
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAm
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAP
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAD
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAe
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAA
2015-05-20 RPTEST   TEST.TAB1               1403  DB     DELETE       AAAiJjAAEAAAX0TAAF
2015-05-20 RPTEST   TEST.TAB1               1403  DB     DELETE       AAAiJjAAEAAAX0TAAG
2015-05-20 RPTEST   TEST.TAB1               1403  DB     DELETE       AAAiJjAAEAAAX0TAAN


Below is an example of logging the transactional information to a global exception table (all exceptions are logged to this generic summary table) as well as a to per-base table exception table. The per-base table exception table is an exact replica of the base target table with additional columns added to store the metadata about the transaction along with the transactional data itself. The additional columns are not needed per say, as the metadata is also stored in the global exception table. It is best to include the ROWID (from the source) in both exception tables to help link the rows together to see the metadata and transactional data if the metadata is not included in the per-base table.  This is an advanced configuration for environments where the transactional data must be retained at all times, either for auditing and/or for programmatic reprocessing of the apply errors (if you had millions of apply errors you would not want to fix or summarize by hand, therefore, this offers an easy way to do analysis.

The data in the transaction (source columns) is in "red", and the transaction metadata we added is in "blue".

COL1 COL2     COL3 CR_DATE   MOD_DATE  EXCP_DATE REP_NAME  TABLE_NAME    ERRNO ERRT OPTYPE         TRANSIND COMMITTIM    RECCSN RECSEQNO RECRBA RECTRANSPOS LOGRBA LOGPOSITION GROUPTYPE  FILENAME       FILENO
----- -------- ---- --------- --------- --------- --------- ------------ ------ ---- -------------- -------- --------- --------- -------- ------ ----------- ------ ----------- ---------- -------------- ------
   15 change3       07-SEP-13 07-SEP-13 07-SEP-13 RPRTEST   RTEST.TAB1     1403 DB   SQL COMPUPDATE END      07-SEP-13   6728670       22   2204                  0      546832 REPLICAT   ./dirdat/z1000 22
   15 change4  C4   07-SEP-13 07-SEP-13 07-SEP-13 RPRTEST   RTEST.TAB1     1403 DB   SQL COMPUPDATE END      07-SEP-13   6729169       22   2542                  0     2554384 REPLICAT   ./dirdat/z1000 22
   13 row13    XX   01-SEP-13 01-SEP-13 07-SEP-13 RPRTEST   RTEST.TAB1     1403 DB   DELETE         WHOLE    07-SEP-13   6730273       24   2491                389     3254288 REPLICAT   ./dirdat/z1000 24
   13 row13    XX   01-SEP-13 01-SEP-13 01-SEP-13 RPRTEST   RTEST.TAB1        1 DB   INSERT         WHOLE    01-SEP-13   6378047        6   2229                365    15399440 REPLICAT   ./dirdat/z1000  6
   14 row14a   C1   01-SEP-13 01-SEP-13 01-SEP-13 RPRTEST   RTEST.TAB1        1 DB   INSERT         WHOLE    01-SEP-13   6378120        6   2426                365    15831348 REPLICAT   ./dirdat/z1000  6
   15 row15b   C1   01-SEP-13 01-SEP-13 01-SEP-13 RPRTEST   RTEST.TAB1        1 DB   INSERT         WHOLE    01-SEP-13   6378215        6   2624                365    15872820 REPLICAT   ./dirdat/z1000  6


6 rows selected.


Are the sample mappings to log the exception data to a "global" exception table, all processes report errors here and to the per-base table exception table. The example uses a procedure call to log to the global table and the exception mapping does the mapping to the per-base table in order to preserve the transactional data. This example uses CDR as well and this example is used many implementations involving CDR and non-CDR (just remove the CDR portions). And lastly, I show the full mapping list, however, the call do the mappings is done via a macro with parameters.

Also note, this example is used for a bi-directional from a Pluggable 12c PDB to a 11.2.0.4 single instance. Customer was upgrading to12c in phases on their active-active system and this was used to POC the active-active implementation and all the features before implementing, which was a huge success and value add for the customer to go active-active and upgrade along the way:


2016-08-30 07:31:53  INFO    OGG-06506  Wildcard MAP resolved (entry PDB2.TEST.*):

MAP "PDB2"."TEST"."TAB1", TARGET TEST."TAB1",
COMPARECOLS(ON UPDATE KEY, ON DELETE KEY), 

  RESOLVECONFLICT (INSERTROWEXISTS, (DEFAULT, USEMAXEQ (OGGTS))),     
  RESOLVECONFLICT (UPDATEROWEXISTS, (DEFAULT, USEMAXEQ (OGGTS))), 
  RESOLVECONFLICT (UPDATEROWMISSING,(DEFAULT, OVERWRITE)),
  RESOLVECONFLICT (DELETEROWEXISTS, (DEFAULT, OVERWRITE)), 
  RESOLVECONFLICT (DELETEROWMISSING,(DEFAULT, DISCARD)), 
MAPEXCEPTION (TARGET TEST.*_EX, 
SQLEXEC (SPNAME OGG.LOGEXCEPTION, ID @DATE('YYYY-MM-DD HH:MI:SS.FFFFFF','JTS',@GETENV('JULIANTIMESTAMP')), 
PARAMS ( 
rep_name=@GETENV ('GGENVIRONMENT','GROUPNAME'), 
table_name=@GETENV ('GGHEADER','TABLENAME'), errno=@GETENV('LASTERR','DBERRNUM'), 
errtype=@GETENV('LASTERR','ERRTYPE'), 
optype=@GETENV('LASTERR','OPTYPE'), transind=@GETENV('GGHEADER','TRANSACTIONINDICATOR'), transimgind=@GETENV('GGHEADER','BEFOREAFTERINDICATOR'), 
committimestamp= @STREXT(@GETENV ('GGHEADER','COMMITTIMESTAMP'),1,22), reccsn=@GETENV('TRANSACTION','CSN'), 
recseqno=@GETENV('RECORD','FILESEQNO'), 
recrba=@GETENV('RECORD','FILERBA'), 
rectranspos=@GETENV('RECORD','RSN'), reclength=@GETENV('GGHEADAER','RECORDLENGTH'), logrba=@GETENV('GGHEADER','LOGRBA'), logposition=@GETENV('GGHEADER','LOGPOSITION'), grouptype=@GETENV('GGENVIRONMENT','GROUPTYPE'), filename=@GETENV('GGFILEHEADER','FILENAME'), fileno=@GETENV('GGFILEHEADER','FILESEQNO'), srcrowid=@GETENV('RECORD','ROWID'), 
srcdbcharset =@GETENV('GGFILEHEADER','DBCHARSET'), 
replag=@GETENV('LAG','SEC'), 
cnt_cdr_conflicts=@GETENV ('DELTASTATS','TABLE',@GETENV ('GGHEADER','TABLENAME'),'CDR_CONFLICTS'), 
cnt_cdr_resolutions=@GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER','TABLENAME'),'CDR_RESOLUTIONS_SUCCEEDED'), cnt_cdr_failed=@GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER','TABLENAME'),'CDR_RESOLUTIONS_FAILED') ), ALLPARAMS OPTIONAL, ERROR REPORT, DBOP, TRACE ALL), 
EXCEPTIONSONLY, INSERTALLRECORDS, 
COLMAP (USEDEFAULTS, 
excp_date=@DATENOW(), 
rep_name=@GETENV ('GGENVIRONMENT', 'GROUPNAME'), 
table_name=@GETENV ('GGHEADER', 'TABLENAME'), 
errno= @GETENV ('LASTERR', 'DBERRNUM'), 
errtype=@GETENV ('LASTERR', 'ERRTYPE'), 
optype=@GETENV ('LASTERR', 'OPTYPE'), transind=@GETENV('GGHEADER','TRANSACTIONINDICATOR'), 
committimestamp = @GETENV ('GGHEADER','COMMITTIMESTAMP'), 
reccsn=@GETENV ('TRANSACTION','CSN'), 
recseqno=@GETENV ('RECORD','FILESEQNO'), 
recrba=@GETENV ('RECORD','FILERBA'), 
rectranspos=@GETENV ('RECORD','RSN'), 
logrba=@GETENV ('GGHEADER','LOGRBA'), 
logposition=@GETENV ('GGHEADER','LOGPOSITION'), 
grouptype=@GETENV ('GGENVIRONMENT','GROUPTYPE'), 
filename=@GETENV ('GGFILEHEADER','FILENAME'), 
fileno=@GETENV ('GGFILEHEADER', 'FILESEQNO'), 
srcdbcharset=@GETENV ('GGFILEHEADER', 'DBCHARSET') ));

Below is the traced call to the stored procedure that is called, with tracing so you can see the values being sent to the procedure:

SQLEXEC @DATE('YYYY-MM-DD HH:MI:SS.FFFFFF','JTS',@GETENV('JULIANTIMESTAMP')) succeeded:
>            REP_NAME: RPTEST2
>          TABLE_NAME: PDB2.TEST.TAB1
>               ERRNO: 1403
>             ERRTYPE: DB
>              OPTYPE: SQL COMPUPDATE
>            TRANSIND: MIDDLE
>         TRANSIMGIND: AFTER
>     COMMITTIMESTAMP: 2016-08-29 15:35:14.12
>              RECCSN: 1972727
>            RECSEQNO: 28
>              RECRBA: 15529
>         RECTRANSPOS:
>           RECLENGTH:
>              LOGRBA: 48
>         LOGPOSITION: 24156860
>           GROUPTYPE: REPLICAT
>            FILENAME: ./dirdat/r2000000028
>              FILENO: 28
>            SRCROWID: AAAWcxAAKAAAADNAAM
>        SRCDBCHARSET: 0
>              REPLAG: 57399
>   CNT_CDR_CONFLICTS: 0
>   CNT_CDR_RESOLUTIONS: 0
>      CNT_CDR_FAILED: 0

As mentioned, you certainly do not want to type all this in for each table/schema, therefore the mappings are implemented via a macro call with parameters to localize the mappings for the schema:

#replicateschemaCDROnKeyandDate (TEST,TEST,TEST.*_EX, OGGTS);


CDR has a few requirements to be meet and needs discussions on the business logic that would be used to automatically handle errors and see if that fits into how CDR built into OGG works.

In another blog I will show you how you can use this metadata about the transactions to produce a transactions table showing all table activity, this is sometimes very very helpful when you need to profile an application (see everything about the transactions) or you understand how much data may be in the trail that may still be problematic and has not yet been processed.


Good luck and wishing you the best in all your replication endeavors!

Gregory Thomas