June 4, 2015

GoldenGate Exception Table Exception Handling and Reporting


Update to the blog content/functionality (Aug 2016):

A quick update on exception mappings.  Exception mappings/parameters are not working with Oracle RDBMS 11.2.0.4 and integrated replicats when using OGG 12c (either 12.1.x or 12.2.x) and Oracle RDBMS 11.2.0.4 only. The update from Oracle says that they indeed do not work with 11.2.0.4 and you must upgrade RDBMS to 12c or use classic replicats if mappings use exception parameters and these must work. The issue is not with OGG, but with the database and they do not have a fix it (back ported) to 11.2.0.4, you must upgrade to 12c RDBMS or use classic/coordinated replicats.

Another side note, OGG 12.2 looks incredible, we have tested it thoroughly at many customer sites and it is rock solid. We still recommend you always download the latest PSU for OGG and get them applied.


-- Original Blog Content

If using any form of logical replication, it is very likely at some point you will run into an "out of sync" issue during the apply on the target side.  In most cases you will encounter errors for duplicate keys (for inserts) or missing data (for updates and deletes) and will get the error numbers corresponding to those errors as implemented by the database platform being used.  You can get other errors resulting from foreign key constraints, trigger errors, etc, but for now we will focus on the most common cases.  The downside to this "out of sync issue" is that it is rarely a product issue and most likely a human error issue.  Yes, I said it, it is those of us that work on the replication implementation that cause the most issues with the replication implementation.  I would say 90% are a result of human error versus 10% or less as a result of a product issue.

When using logical replication, everything is situational becomes a common term.  This tends to drive customers nuts, however, it is the absolute truth.  What you do, and how you do it, depends on the exact situation at hand and must factor in what is currently configured, how and what are you trying to accomplish or fix in terms of the changes to be implemented.  In most replication implementations, when this "out of sync" issue creeps up, the technical team supporting the replication would like to continue the "apply" for the data or data sets that is not out of sync and somehow deal with the errors that come up automatically, as well as get a feel (a good feel) for how wide spread the issue might be (number of tables/rows out of sync potentially).  Continuing the replication ("apply") for the remaining data sets is usually critical to the business rather than stopping the apply all together until all apply errors are analyzed and repaired.

The problem with the "out of sync" issue and continuing the apply is the result of multiple issues to be dealt with:

  • the potential volume of transactions that need to be applied to the target and the volume of those transactions that will encounter errors, or have been encountering apply errors issues and must be dealt with manually
  • how the replication is currently configured to either handle or not handle apply errors
    In most replication products, default error most is to stop the apply (ABEND) whenever an apply error has occurred.  This works on a small scale (lower number of errors to be dealt with manually), as it allows you to fix the error (the affected rows) and continue the apply.
    This manual process works fine if there is a few hundred or less apply errors, but it is usually not feasible if there are potentially thousands, millions or billions of apply errors to come that you have not seen yet. 
  • the configuration now needs to be split, ie. a replicat or a few of them includes other tables in which do not have issues and need to continue apply. This brings up the need to handle these apply errors programmatically while still applying to all the other tables
In this blog, I will cover some things you can do to make it is easier to automate the error handling and significantly increase the analysis effort accuracy and efficiency of the apply errors so you can have track/report on the number of errors, types of errors and where the errors are (tables and/or rows).  Your implementation will still be out of sync (this is another topic on refreshing data), however,  you will be able to continue the apply for the in-sync data and record all apply errors for analysis to give you detailed information of what needs to be repaired to get back in sync.  As mentioned, repairing is another topic and there are many ways to repair the out of sync data.

If you are using Oracle GoldenGate (a powerful, flexible, amazing and market leading product), it has some built in features and functionality to help deal with the apply errors automatically and programmatically by using its exception handling and CDR parameters.  In this blog, I will cover the exception handling parameters and resulting mappings needed that can be used to automate the skipping and logging of the apply errors for analysis. We can also use CDR (Conflict Detection and Resolution) parameters/mappings by themselves and/or with exception logging parameter/mappings and I will cover these in the second part/blog on active-active as CDR need/requires upfront configuration (requirements) and analysis as to how to handle data apply errors and what data to deal with up front so parameters/mappings can be configured to handle the errors in those defined detection and resolution cases.

If your replication implementation does not have exception handling or CDR mappings in place, you can use these mappings to quickly get replication up and running and logging all apply errors that occur during the apply. The next step would be to evaluate the implementation to see what needs to be repaired and start discussing and preparing to configure and use Oracle GoldenGate's CDR mappings.  Oracle GoldenGate has most of the functionality built-in to meet most (if not all) your needs with the new CDR parameters in 11.2+ GoldenGate.

I regularly implement this exception handling process quickly at customer sites when doing what I call "OGG 911".  The story usually goes like this: customer calls and needs help because the apply has encountered multiple errors and has stopped all or some of the apply processes.  The technical team fixed the few records and restarted the apply.  After starting the apply the processes got another error immediately, it was fixed, then another error and another error.  After doing this for 100-200 times, they throw up their hands as to how to handle it because they cannot predict how many more apply errors there may be just waiting in the trail files (CDC transactions) that has not yet had apply attempted.  Nonetheless, they feel stuck.  They are in luck however, as Oracle GoldenGate has several methods for handling this and any of them could be used to get replication back up and running. The issue is they still need to do root cause analysis as to why the apply errors occurred and repair the data out of sync. The method discussed will help get you back up and running and just as important, give you a good process and information regarding the "what" and "how much" to date have there been new apply errors that are occurring as apply runs.  The method I am going to detail makes/offers quick and easy analysis (uses a database table to log all errors for SQL analysis) and also captures the ROWID of the row from the source system that is now out of sync on the target side for which changes could not be applied on the target side (remember, ROWID is from the source not the target).  Having the ROWID makes it easy to fix only that set of rows (and just those rows) quickly or programmatically rather than re-syncing a 1 billion row table over again because you had no way of quickly knowing which rows are affected. Again, there are lots of ways to fix/repair data while replication is running, but I will not cover that here to keep it shorter and to the point on exception handling, but if you are interest you should look into EMS/EMI and other mapping parameter FILTER to resync data inline.  The method is used to help get a customer that is in outage mode and needs to get back up and running quickly and able to analysis the impact of the errors.

Now, on to how to address the "out of sync" apply error issue. To do this, I will cover how Oracle GoldenGate works in default mode/default parameters and what parameters are available to help address and handle the "out of sync" issue.

First, we must cover what an apply error is to Oracle GoldenGate, ie., the unit of work in Oracle GoldenGate when running with default values and default parameters (no exception handling or CDR parameter/mappings), just plain mappings and default values for other parameters in the replicat.

By default, Oracle GoldenGate's error mode is to throw an error at the operation level and per operation if any operation results in an error code being returned.  An operation in Oracle GoldenGate is the basic unit of work: a single row in a table. This does not mean a transaction, the transaction could be 1 row or 100 rows.  That single row operation in the transaction could be a new record for a table or a change to an existing row in a table (such as an update or delete to a row).
If the transaction on the source changed 10 rows, GoldenGate would capture 10 operations to be applied on the target.  When you insert 10 rows on the source, that results in 10 operations (10 inserts) to be applied on the target. When you issue an update DML on the source that updates 10 rows in the table (a single UPDATE statement), that results in 10 update operations (statements) to be applied on the target.  Same for deletes, you issue a single delete DML on the source that deletes 10 rows from a table, this results in 10 delete operations to be applied on the target (one for each row).  This also explains why logical apply is slower and takes some tuning on the apply side in many cases. This is not unusual, all logical replication products (SQL apply) work this way. Only GoldenGate is very fast at it and is well tuned to use native methods to make the apply extremely efficient.

Now, what about transactional boundaries from source to target. Oracle GoldenGate does honor transactional order (SCN order). And, depending on  how it is configured or not configured, it may not honor transactional boundaries when it comes to error handling in an automated manner. This is a really good thing and it gives you flexibility you may need in some cases.

By default, Oracle GoldenGate apply (replicats, whether classic, coordinated or integrated) will ABEND on any operational error (any single row change in the transaction that returns an error code). This means the replicat will stop applying data at this spot when any error occurs, when any single operation returns an error code.  This is controlled by the REPERROR parameter.  By default REPERROR is set to ABEND on any operation that returns an error code. Even if you do no have REPERROR specified or listed in your replicat parameter file, the default value for the parameter is used at runtime. All Oracle GoldenGate parameters have default values and those values will be used if not changed at runtime via a parameter file.  When an error occurs during apply, the replicat will ABEND and write a message to the event Log ("ggserr.log") and also write information to a discard file if that has been configured.  At this point, you either have to fix the row in error on the target so the apply can occur without returning an error code, or you must skip this row change somehow and move on to the next operation in the trail to keep apply working for the remaining schemas/tables in the configuration of this replicat.

I will lay out your options here to continue the apply, basically skipping the operations that result in errors.  These are in order of my preference:

1) SKIP the entire transaction in the trail (start replicat with SKIPTRANSACTION)

     - GGSCI> start replicat RPTEST1 SKIPTRANSACTION

     - skips all operations in the transaction (may skip more than you want or needed)
2) Skip the one operation in the trail by repositioning replicat to the next operation in the trail
    (classic only)

     - GGSCI> DBLOGIN userid ogguser
     - GGSCI> alter replicat RPTEST1, extseqno ??, extrba
     - this is for advanced usage only

3) Use REPERROR with the discard or ignore options
     - write the error and some column values to the text-based discard
     - LOB or advanced/raw data types are not written due to text file type

        REPERROR (-1403, DISCARD)   (use -<error_number for integrated apply)
        REPERROR (-1403, IGNORE)      (use -<error_number for integrated apply)

4) Use the HANDLECOLLISIONS parameter (very basic CDR type functionality for duplicate rows)
      - skips any operations that result in missing data errors (updates and deletes)
      - turns an insert into an update for duplicate row errors
      - collisions are reported in the text-based report files
         - hard to analysis as a whole, it will record number of insert/update/delete
           collisions per table, however, you do not know what row it was or see any the
           column/transactional information
         - if the report files roll over often, you may loose some of the information as report
           files roll over and can be lost if that occurs enough

       HANDLECOLLISIONS (put it before any table mappings - as a global parameter)

       Or, you can invoke HANDLECOLLISIONS on a per mapping basis:

       MAP SCHEMA1.*, TARGET SCHEMA1.*, HANDLECOLLISIONS;          (for all tables)
       MAP SCHEMA1.TEST, TARGET SCHEMA1.*, HANDLECOLLISIONS;   (per table)

5) Use Exception mappings to skip the operations in error and optionally to log all the error
    information to a global exception table and/or a per-base table exception table
    and continue the apply
 
       ** examples below

As you can see, my preference is #5, then number #4 and then #3. Again, your choice it often situational and depends on what you need to know and do that affects the option you use.  Of all the options, #5 requires some changes to the mappings to configure and requires creating a global exception table and sometimes per-base table exception tables as well depending on whether you want to see/preserve the transactional data (especially LOBs and other advanced data types).
Options 3-4 are very easy to implement by adding the parameter to the replicat PRM file.
Option 2 requires altering the replicat's position in the trail (for the advanced only), and option one is a parameter to use when starting the replicat.

I do not like to use option #1, because all operations in the transaction are skipped (whether they would have resulted in an error or not); that is too risky to use unless it is your last and only option.
Using it will result in more data needing repair because you could skip data that will cause tables not out of sync to now get out of sync.

The rest of this posting covers option #5: using exception mappings with a global and optionally per base table exception tables.

Exception mappings are implemented by adding a few parameters and changing the mappings in the replicat. No changes are required for the source configuration to use the exception mappings.

One thing to point out, there are two ways of implementing exception mappings. One method applies for individual table mappings and the other applies for mappings that involve wildcards. When using wildcard mappings, there is a MAPEXCEPTION parameter to the MAP parameter that should be used.  When using with individual table mappings, the exception is a second set of map parameters right after the initial map parameter (must come immediately after it).

We will look at each example below.

To implement exception mappings for either method do the following:

1) Create the global exception table as the GoldenGate apply user in the target database
   (in that schema)

    Note: This is a global exception table, all errors are reported here globally for all tables/mappings
              It will record metadata regarding the operation and ROWID but will not include
              any transactional data in the operation, this is what per-base table exception table do.
              Per-base table exception tables are exact copies of the base tables, but with no constraints or
              indexes and they are used to preserve the transactional data in the database instead of leaving
              in the trail files.   You cannot process data inside the trails, only GoldenGate can. There are
              ways of getting the data out of the trail files (using adapters) but this is not covered as the
              adapters are powerful/flexible and need a lot of space to explain them in detail.

-- Create the Global Exception Table:

CREATE TABLE EXCEPTIONS
(
 excp_date timestamp(6) default systimestamp,
 rep_name varchar2(10),
 table_name varchar2(56),
 errno number,
 errtype varchar2(6),
 optype varchar2(24),
 transind varchar2(12),
 transimgind varchar2(8),
 committimestamp varchar2(26),
 reccsn number,
 recseqno number,
 recrba number,
 rectranspos number,
 reclength number,
 logrba number,
 logposition number,
 grouptype varchar2(12),
 filename varchar2(50),
 fileno number,
 srcrowid varchar2(40),
 srcdbcharset varchar2(40),
 replag number,
 cnt_cdr_conflicts number,
 cnt_cdr_resolutions number,
 cnt_cdr_failed number 
);


2) Change the replicat parameters to force exception processing when operations return an error code     during apply and map those errors to exception processing:

        Note: We can choose which error codes result in exception logging.
                  Also note, we are not logging all operations in the transaction, only the ones that return 
                  an error code (notice the one parameter commented out, that changes to Transactional                           level exception processing)

     -- Set error mode for GoldenGate (changing default of "ABEND")

      REPERROR (DEFAULT, EXCEPTION)
      -- REPERROR (DEFAULT, TRANSEXCEPTION) 
     REPERROR (DEFAULT2, ABEND)
     -- Set error codes that cause exception processing
     REPERROR (-1, EXCEPTION)
     REPERROR (-1403, EXCEPTION)
     REPERROR (-2291, EXCEPTION)

3) Change the mappings to include exception processing (parameters) and map the transaction
    metadata to columns in the exception table:

    -- Wildcard mappings: (uses the MAPEXCEPTION parameter to MAP)

           Note: Notice there is a MAPEXCEPTION parameter, instead of a separate MAP statement.
                     MAP parameters EXCEPTIONSONLY and INSERTALLRECORDS are used.


MAP XTEST.*, TARGET XTEST.*, MAPEXCEPTION (TARGET OGG.EXCEPTIONS, EXCEPTIONSONLY, INSERTALLRECORDS, COLMAP 
excp_date = @DATENOW(), 
rep_name = @GETENV ('GGENVIRONMENT', 'GROUPNAME'), 
table_name = @GETENV ('GGHEADER', 'TABLENAME'), 
errno = @GETENV ('LASTERR', 'DBERRNUM'), 
errtype = @GETENV ('LASTERR', 'ERRTYPE'), 
optype = @GETENV ('LASTERR', 'OPTYPE'), 
transind = @GETENV ('GGHEADER', 'TRANSACTIONINDICATOR'), 
transimgind = @GETENV ('GGHEADER', 'BEFOREAFTERINDICATOR'), 
committimestamp = @GETENV ('GGHEADER', 'COMMITTIMESTAMP'), 
reccsn = @GETENV ('TRANSACTION', 'CSN'), 
recseqno = @GETENV ('RECORD', 'FILESEQNO'), 
recrba = @GETENV ('RECORD', 'FILERBA'), 
rectranspos = @GETENV ('RECORD', 'RSN'), 
reclength = @GETENV ('GGHEADAER', 'RECORDLENGTH'), 
logrba = @GETENV ('GGHEADER', 'LOGRBA'), 
logposition = @GETENV ('GGHEADER', 'LOGPOSITION'), 
grouptype = @GETENV ('GGENVIRONMENT', 'GROUPTYPE'), 
filename = @GETENV ('GGFILEHEADER', 'FILENAME'), 
fileno = @GETENV ('GGFILEHEADER', 'FILESEQNO'), 
srcrowid = @GETENV ('TRANSACTION', 'CSN'), 
srcdbcharset = @GETENV ('GGFILEHEADER', 'DBCHARSET'), 
replag = @GETENV ('LAG', 'SEC'), 
cnt_cdr_conflicts = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_CONFLICTS'), cnt_cdr_resolutions = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_RESOLUTIONS_SUCCEEDED'), 
cnt_cdr_failed = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_RESOLUTIONS_FAILED'))
 )
);

   -- Individual table mappings: (uses a separate MAP statement for exception processing)

         Note: Notice there is no MAPEXCEPTION parameter, instead there is a separate MAP    
                   statement.
                   MAP statement with EXCEPTIONSONLY and INSERTALLRECORDS parameters.

MAP XTEST.*, TARGET XTEST.*;
MAP MAP XTEST.*, TARGET OGG.EXCEPTIONS, EXCEPTIONSONLY, INSERTALLRECORDS, &
COLMAP (
excp_date = @DATENOW(), 
rep_name = @GETENV ('GGENVIRONMENT', 'GROUPNAME'), 
table_name = @GETENV ('GGHEADER', 'TABLENAME'), 
errno = @GETENV ('LASTERR', 'DBERRNUM'), 
errtype = @GETENV ('LASTERR', 'ERRTYPE'), 
optype = @GETENV ('LASTERR', 'OPTYPE'), 
transind = @GETENV ('GGHEADER', 'TRANSACTIONINDICATOR'), 
transimgind = @GETENV ('GGHEADER', 'BEFOREAFTERINDICATOR'), 
committimestamp = @GETENV ('GGHEADER', 'COMMITTIMESTAMP'), 
reccsn = @GETENV ('TRANSACTION', 'CSN'), 
recseqno = @GETENV ('RECORD', 'FILESEQNO'), 
recrba = @GETENV ('RECORD', 'FILERBA'), 
rectranspos = @GETENV ('RECORD', 'RSN'), 
reclength = @GETENV ('GGHEADAER', 'RECORDLENGTH'), 
logrba = @GETENV ('GGHEADER', 'LOGRBA'), 
logposition = @GETENV ('GGHEADER', 'LOGPOSITION'), 
grouptype = @GETENV ('GGENVIRONMENT', 'GROUPTYPE'), 
filename = @GETENV ('GGFILEHEADER', 'FILENAME'), 
fileno = @GETENV ('GGFILEHEADER', 'FILESEQNO'), 
srcrowid = @GETENV ('TRANSACTION', 'CSN'), 
srcdbcharset = @GETENV ('GGFILEHEADER', 'DBCHARSET'), 
replag = @GETENV ('LAG', 'SEC'), 
cnt_cdr_conflicts = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_CONFLICTS'), cnt_cdr_resolutions = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_RESOLUTIONS_SUCCEEDED'), 
cnt_cdr_failed = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_RESOLUTIONS_FAILED'))
);


-- Example of wildcard exception mapping with CDR rules:

   When CDR is called, the exception/CDR processing will be handled (in most cases)
    and logged so you know where it is occurring and how often.

MAP XTEST.*, TARGET XTEST.*, 
COMPARECOLS(ON UPDATE KEYINCLUDING (OGG_TS), ON DELETE KEYINCLUDING (OGG_TS)), RESOLVECONFLICT (INSERTROWEXISTS, (DEFAULT, USEMAX (OGG_TS))),
RESOLVECONFLICT (UPDATEROWEXISTS, (DEFAULT, USEMAX (OGG_TS))),
RESOLVECONFLICT (UPDATEROWMISSING,(DEFAULT, OVERWRITE)),
RESOLVECONFLICT (DELETEROWEXISTS, (DEFAULT, OVERWRITE)),
RESOLVECONFLICT (DELETEROWMISSING,(DEFAULT, DISCARD)),

MAPEXCEPTION (TARGET OGG.EXCEPTIONS, EXCEPTIONSONLY, INSERTALLRECORDS, COLMAP
excp_date = @DATENOW(), 
rep_name = @GETENV ('GGENVIRONMENT', 'GROUPNAME'), 
table_name = @GETENV ('GGHEADER', 'TABLENAME'), 
errno = @GETENV ('LASTERR', 'DBERRNUM'), 
errtype = @GETENV ('LASTERR', 'ERRTYPE'), 
optype = @GETENV ('LASTERR', 'OPTYPE'), 
transind = @GETENV ('GGHEADER', 'TRANSACTIONINDICATOR'), 
transimgind = @GETENV ('GGHEADER', 'BEFOREAFTERINDICATOR'), 
committimestamp = @GETENV ('GGHEADER', 'COMMITTIMESTAMP'), 
reccsn = @GETENV ('TRANSACTION', 'CSN'), 
recseqno = @GETENV ('RECORD', 'FILESEQNO'), 
recrba = @GETENV ('RECORD', 'FILERBA'), 
rectranspos = @GETENV ('RECORD', 'RSN'), 
reclength = @GETENV ('GGHEADAER', 'RECORDLENGTH'), 
logrba = @GETENV ('GGHEADER', 'LOGRBA'), 
logposition = @GETENV ('GGHEADER', 'LOGPOSITION'), 
grouptype = @GETENV ('GGENVIRONMENT', 'GROUPTYPE'), 
filename = @GETENV ('GGFILEHEADER', 'FILENAME'), 
fileno = @GETENV ('GGFILEHEADER', 'FILESEQNO'), 
srcrowid = @GETENV ('TRANSACTION', 'CSN'), 
srcdbcharset = @GETENV ('GGFILEHEADER', 'DBCHARSET'), 
replag = @GETENV ('LAG', 'SEC'), 
cnt_cdr_conflicts = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_CONFLICTS'), 
cnt_cdr_resolutions = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_RESOLUTIONS_SUCCEEDED'), 
cnt_cdr_failed = @GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER', 'TABLENAME'),'CDR_RESOLUTIONS_FAILED'))
 )
);

One thing to note, since the mappings are very, very long due to the transformations (COLMAP) to the exception table, these are normally put into a macro library that has macros to do both wildcard mappings as well as individual table mappings and the call to the macro helps keep the PRM files short, clean and neat and standardized.
Example below of calling the macro to do all the mappings for a schema:

   #replicateschamewithGlobalExcp (XTEST,XTEST);
   #replicatetablewithGlobalExcp (XTEST.TAB1);

You can easily create additional macros which do both CDR and exception logging. CDR tries to resolve the conflict but the exception is still logged:

   #replicateschamewithCDRandGlobalExcp (XTEST,XTEST);
   #replicatetablewithCDRandGlobalExcp (XTEST.TAB1);

When the mappings need to change or we need to have several different mappings to switch back and forth, we can create multiple macros to do those mappings just as we want, then we can switch the macro name in the PRM file to our desired functionality. See my blog on Macro, Variables and Include files where there are simple examples. Macros can also be nested, but in that example we have keep them simple to understand.

See the following post for how to use macros and variables to streamline and standardize a configuration. They do make the config a bit more complicated, but also make it much faster to change the configuration globally if needed. Macros are life safer when you have long and many mappings, macros basically enable reuse which is always efficient but again, makes the configuration a bit more complicated. Hopefully, someday, Oracle will expand the macro capabilities with some control looping and evaluation capabilities.

        Include files, variables and macros in GoldenGate

When all this is done and there is any error -1 or error 1403 (Oracle errors in this case) on any operations during apply, the errors will be logged to the exception table on the target database for analysis and reivew. You can use SQL to analyze in many different ways and make it easy to see the big picture of errors and error types and tables involved that need to be repaired (resync'ed somehow).  Fixing the errors again is situational and there are many ways to do it and that is not covered. I prefer to configure the EMS in OGG so I can sync tables in line with a small outage for the apply without having to create new replicats and later merge them back into one.  More on that in other blog posts.

You can now query the exceptions table for errors. You can use SQL to drill down or summarize all you need. This is much easier than parsing through text files and writing scripts to parse those, plus, the big advantage is you know more about the transaction, including the ROWIDs of the source rows which are involved. The ROWIDs are invaluable and can be used to fix selective rows.



REP_NAME  TABLE_NAME                        ERRNO   COUNT(*)
--------- ---------------------------- ---------- ----------
RDEV1001  GGTEST.TAB1                           1         20
RDEV1001  GGTEST.TAB1                        1403      16896
RDEV1001  GGTEST.TAB9                        1403         16
RDEV2001  GGTEST.TAB8                        1403         32
4 rows selected.


  EXCP_DATE REP_NAME  TABLE_NAME                   ERRNO   COUNT(*)
----------- --------- ----------------------- ---------- ----------
11-NOV-2013 RDEV1001  GGTEST.TAB1                      1         16
11-NOV-2013 RDEV1001  GGTEST.TAB1                   1403         16
14-NOV-2013 RDEV1001  GGTEST.TAB1                      1          4
14-NOV-2013 RDEV1001  GGTEST.TAB1                   1403         12
15-NOV-2013 RDEV1001  GGTEST.TAB1                   1403          6
16-NOV-2013 RDEV1001  GGTEST.TAB1                   1403         19
20-NOV-2013 RDEV1001  GGTEST.TAB1                   1403      13200
20-NOV-2013 RDEV2001  GGTEST.TAB8                   1403          2
21-NOV-2013 RDEV1001  GGTEST.TAB1                   1403       3643
21-NOV-2013 RDEV1001  GGTEST.TAB9                   1403         16
21-NOV-2013 RDEV2001  GGTEST.TAB8                   1403         30


11 rows selected.


An additional look at more columns in the exeptions table, remember there are about 25 additional columns in the exception table, and you certainly do not need all of them, however, about 6-8 are very handy and since this is an exception table, it can be clean out when the data is no longer needed.
I take the view it is better to have more information to troubleshoot than not enough.

EXCPDATE   REP_NAME TABLE_NAME              ERRNO ERRTYP OPTYPE       SRCROWID
---------- -------- ---------------------- ------ ------ ------------ ---------------------
2014-11-13 RPXTEST  XTEST.TAB8              1403  DB     DELETE       AAAflqAAEAAARksAAJ
2014-11-13 RPXTEST  XTEST.TAB8              1403  DB     DELETE       AAAflqAAEAAARksAAA
2014-11-13 RPXTEST  XTEST.TAB8              1403  DB     DELETE       AAAflqAAEAAARksAAB
2014-11-13 RPXTEST  XTEST.TAB8              1403  DB     DELETE       AAAflqAAEAAARkuAAB
2014-11-13 RPXTEST  XTEST.TAB8              1403  DB     DELETE       AAAflqAAEAAARkuAAA
2014-11-13 RPZTEST  ZTEST.TAB8                 1  DB     INSERT       AAAfkRAAEAAARgVAAK
2014-11-13 RPZTEST  ZTEST.TAB8                 1  DB     INSERT       AAAfkRAAEAAARgVAAA
2014-11-13 RPZTEST  ZTEST.TAB8              1403  DB     SQL COMPUPDA AAAfkRAAEAAARgUAAA
2014-11-13 RPZTEST  ZTEST.TAB8              1403  DB     SQL COMPUPDA AAAfkRAAEAAARgUAAB
2015-05-14 RPTEST   TEST.TAB1               1403  DB     DELETE       AAAiJjAAEAAAX0WAAA
2015-05-14 RPTEST   TEST.TAB1               1403  DB     DELETE       AAAiJjAAEAAAX0WAAe
2015-05-14 RPTEST   TEST.TAB1               1403  DB     DELETE       AAAiJjAAEAAAX0WAAi
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAD
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAm
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAP
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAm
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAP
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAD
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAe
2015-05-14 RPTEST   TEST.TAB1               1403  DB     SQL COMPUPDA AAAiJjAAEAAAX0WAAA
2015-05-20 RPTEST   TEST.TAB1               1403  DB     DELETE       AAAiJjAAEAAAX0TAAF
2015-05-20 RPTEST   TEST.TAB1               1403  DB     DELETE       AAAiJjAAEAAAX0TAAG
2015-05-20 RPTEST   TEST.TAB1               1403  DB     DELETE       AAAiJjAAEAAAX0TAAN


Below is an example of logging the transactional information to a global exception table (all exceptions are logged to this generic summary table) as well as a to per-base table exception table. The per-base table exception table is an exact replica of the base target table with additional columns added to store the metadata about the transaction along with the transactional data itself. The additional columns are not needed per say, as the metadata is also stored in the global exception table. It is best to include the ROWID (from the source) in both exception tables to help link the rows together to see the metadata and transactional data if the metadata is not included in the per-base table.  This is an advanced configuration for environments where the transactional data must be retained at all times, either for auditing and/or for programmatic reprocessing of the apply errors (if you had millions of apply errors you would not want to fix or summarize by hand, therefore, this offers an easy way to do analysis.

The data in the transaction (source columns) is in "red", and the transaction metadata we added is in "blue".

COL1 COL2     COL3 CR_DATE   MOD_DATE  EXCP_DATE REP_NAME  TABLE_NAME    ERRNO ERRT OPTYPE         TRANSIND COMMITTIM    RECCSN RECSEQNO RECRBA RECTRANSPOS LOGRBA LOGPOSITION GROUPTYPE  FILENAME       FILENO
----- -------- ---- --------- --------- --------- --------- ------------ ------ ---- -------------- -------- --------- --------- -------- ------ ----------- ------ ----------- ---------- -------------- ------
   15 change3       07-SEP-13 07-SEP-13 07-SEP-13 RPRTEST   RTEST.TAB1     1403 DB   SQL COMPUPDATE END      07-SEP-13   6728670       22   2204                  0      546832 REPLICAT   ./dirdat/z1000 22
   15 change4  C4   07-SEP-13 07-SEP-13 07-SEP-13 RPRTEST   RTEST.TAB1     1403 DB   SQL COMPUPDATE END      07-SEP-13   6729169       22   2542                  0     2554384 REPLICAT   ./dirdat/z1000 22
   13 row13    XX   01-SEP-13 01-SEP-13 07-SEP-13 RPRTEST   RTEST.TAB1     1403 DB   DELETE         WHOLE    07-SEP-13   6730273       24   2491                389     3254288 REPLICAT   ./dirdat/z1000 24
   13 row13    XX   01-SEP-13 01-SEP-13 01-SEP-13 RPRTEST   RTEST.TAB1        1 DB   INSERT         WHOLE    01-SEP-13   6378047        6   2229                365    15399440 REPLICAT   ./dirdat/z1000  6
   14 row14a   C1   01-SEP-13 01-SEP-13 01-SEP-13 RPRTEST   RTEST.TAB1        1 DB   INSERT         WHOLE    01-SEP-13   6378120        6   2426                365    15831348 REPLICAT   ./dirdat/z1000  6
   15 row15b   C1   01-SEP-13 01-SEP-13 01-SEP-13 RPRTEST   RTEST.TAB1        1 DB   INSERT         WHOLE    01-SEP-13   6378215        6   2624                365    15872820 REPLICAT   ./dirdat/z1000  6


6 rows selected.


Are the sample mappings to log the exception data to a "global" exception table, all processes report errors here and to the per-base table exception table. The example uses a procedure call to log to the global table and the exception mapping does the mapping to the per-base table in order to preserve the transactional data. This example uses CDR as well and this example is used many implementations involving CDR and non-CDR (just remove the CDR portions). And lastly, I show the full mapping list, however, the call do the mappings is done via a macro with parameters.

Also note, this example is used for a bi-directional from a Pluggable 12c PDB to a 11.2.0.4 single instance. Customer was upgrading to12c in phases on their active-active system and this was used to POC the active-active implementation and all the features before implementing, which was a huge success and value add for the customer to go active-active and upgrade along the way:


2016-08-30 07:31:53  INFO    OGG-06506  Wildcard MAP resolved (entry PDB2.TEST.*):

MAP "PDB2"."TEST"."TAB1", TARGET TEST."TAB1",
COMPARECOLS(ON UPDATE KEY, ON DELETE KEY), 

  RESOLVECONFLICT (INSERTROWEXISTS, (DEFAULT, USEMAXEQ (OGGTS))),     
  RESOLVECONFLICT (UPDATEROWEXISTS, (DEFAULT, USEMAXEQ (OGGTS))), 
  RESOLVECONFLICT (UPDATEROWMISSING,(DEFAULT, OVERWRITE)),
  RESOLVECONFLICT (DELETEROWEXISTS, (DEFAULT, OVERWRITE)), 
  RESOLVECONFLICT (DELETEROWMISSING,(DEFAULT, DISCARD)), 
MAPEXCEPTION (TARGET TEST.*_EX, 
SQLEXEC (SPNAME OGG.LOGEXCEPTION, ID @DATE('YYYY-MM-DD HH:MI:SS.FFFFFF','JTS',@GETENV('JULIANTIMESTAMP')), 
PARAMS ( 
rep_name=@GETENV ('GGENVIRONMENT','GROUPNAME'), 
table_name=@GETENV ('GGHEADER','TABLENAME'), errno=@GETENV('LASTERR','DBERRNUM'), 
errtype=@GETENV('LASTERR','ERRTYPE'), 
optype=@GETENV('LASTERR','OPTYPE'), transind=@GETENV('GGHEADER','TRANSACTIONINDICATOR'), transimgind=@GETENV('GGHEADER','BEFOREAFTERINDICATOR'), 
committimestamp= @STREXT(@GETENV ('GGHEADER','COMMITTIMESTAMP'),1,22), reccsn=@GETENV('TRANSACTION','CSN'), 
recseqno=@GETENV('RECORD','FILESEQNO'), 
recrba=@GETENV('RECORD','FILERBA'), 
rectranspos=@GETENV('RECORD','RSN'), reclength=@GETENV('GGHEADAER','RECORDLENGTH'), logrba=@GETENV('GGHEADER','LOGRBA'), logposition=@GETENV('GGHEADER','LOGPOSITION'), grouptype=@GETENV('GGENVIRONMENT','GROUPTYPE'), filename=@GETENV('GGFILEHEADER','FILENAME'), fileno=@GETENV('GGFILEHEADER','FILESEQNO'), srcrowid=@GETENV('RECORD','ROWID'), 
srcdbcharset =@GETENV('GGFILEHEADER','DBCHARSET'), 
replag=@GETENV('LAG','SEC'), 
cnt_cdr_conflicts=@GETENV ('DELTASTATS','TABLE',@GETENV ('GGHEADER','TABLENAME'),'CDR_CONFLICTS'), 
cnt_cdr_resolutions=@GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER','TABLENAME'),'CDR_RESOLUTIONS_SUCCEEDED'), cnt_cdr_failed=@GETENV ('DELTASTATS','TABLE', @GETENV ('GGHEADER','TABLENAME'),'CDR_RESOLUTIONS_FAILED') ), ALLPARAMS OPTIONAL, ERROR REPORT, DBOP, TRACE ALL), 
EXCEPTIONSONLY, INSERTALLRECORDS, 
COLMAP (USEDEFAULTS, 
excp_date=@DATENOW(), 
rep_name=@GETENV ('GGENVIRONMENT', 'GROUPNAME'), 
table_name=@GETENV ('GGHEADER', 'TABLENAME'), 
errno= @GETENV ('LASTERR', 'DBERRNUM'), 
errtype=@GETENV ('LASTERR', 'ERRTYPE'), 
optype=@GETENV ('LASTERR', 'OPTYPE'), transind=@GETENV('GGHEADER','TRANSACTIONINDICATOR'), 
committimestamp = @GETENV ('GGHEADER','COMMITTIMESTAMP'), 
reccsn=@GETENV ('TRANSACTION','CSN'), 
recseqno=@GETENV ('RECORD','FILESEQNO'), 
recrba=@GETENV ('RECORD','FILERBA'), 
rectranspos=@GETENV ('RECORD','RSN'), 
logrba=@GETENV ('GGHEADER','LOGRBA'), 
logposition=@GETENV ('GGHEADER','LOGPOSITION'), 
grouptype=@GETENV ('GGENVIRONMENT','GROUPTYPE'), 
filename=@GETENV ('GGFILEHEADER','FILENAME'), 
fileno=@GETENV ('GGFILEHEADER', 'FILESEQNO'), 
srcdbcharset=@GETENV ('GGFILEHEADER', 'DBCHARSET') ));

Below is the traced call to the stored procedure that is called, with tracing so you can see the values being sent to the procedure:

SQLEXEC @DATE('YYYY-MM-DD HH:MI:SS.FFFFFF','JTS',@GETENV('JULIANTIMESTAMP')) succeeded:
>            REP_NAME: RPTEST2
>          TABLE_NAME: PDB2.TEST.TAB1
>               ERRNO: 1403
>             ERRTYPE: DB
>              OPTYPE: SQL COMPUPDATE
>            TRANSIND: MIDDLE
>         TRANSIMGIND: AFTER
>     COMMITTIMESTAMP: 2016-08-29 15:35:14.12
>              RECCSN: 1972727
>            RECSEQNO: 28
>              RECRBA: 15529
>         RECTRANSPOS:
>           RECLENGTH:
>              LOGRBA: 48
>         LOGPOSITION: 24156860
>           GROUPTYPE: REPLICAT
>            FILENAME: ./dirdat/r2000000028
>              FILENO: 28
>            SRCROWID: AAAWcxAAKAAAADNAAM
>        SRCDBCHARSET: 0
>              REPLAG: 57399
>   CNT_CDR_CONFLICTS: 0
>   CNT_CDR_RESOLUTIONS: 0
>      CNT_CDR_FAILED: 0

As mentioned, you certainly do not want to type all this in for each table/schema, therefore the mappings are implemented via a macro call with parameters to localize the mappings for the schema:

#replicateschemaCDROnKeyandDate (TEST,TEST,TEST.*_EX, OGGTS);


CDR has a few requirements to be meet and needs discussions on the business logic that would be used to automatically handle errors and see if that fits into how CDR built into OGG works.

In another blog I will show you how you can use this metadata about the transactions to produce a transactions table showing all table activity, this is sometimes very very helpful when you need to profile an application (see everything about the transactions) or you understand how much data may be in the trail that may still be problematic and has not yet been processed.


Good luck and wishing you the best in all your replication endeavors!

Gregory Thomas