snapshot replication - corrupted BCP files - Andy M Mallon

Question

Each night my SQLServer replication failed on a big table with the following error. It is failing on a different table each night but always on a relatively big table ( > 200,000 rows)

The process could not bulk copy into table ‘"dbo"."Table01"’. (Source: MSSQL_REPL, Error number: MSSQL_REPL20037)
Get help: http://help/MSSQL_REPL20037
Field size too large
Batch send failed
Violation of PRIMARY KEY constraint ‘PK_Table01’. Cannot insert duplicate key in object ‘dbo.Table01’. The duplicate key value is (0). (Source: MSSQLServer, Error number: 2627)
Get help: http://help/2627
Failed to send batch after max errors (Source: MSSQLServer, Error number: 3621)
Get help: http://help/3621

Publication is configured on a SQL Server 2017 CU29 instance on a FCI Cluster which is hosted on one node (the errors were also occurring with CU27)
"Push" Subscription is configured on a SQL Server 2017 CU29 instance on the same node of the cluster
Distributor is configured on the same instance of the subscription

Each morning I trigger the snapshot and all get fixed automatically.
I made the following tests:

move the schedule of my snapshot from 0:30AM to 2:25AM -> same problem
disable antivirus scan -> same problem
disable filesystem backup on the drive used by the datafiles -> same problem
one night the replication failed for one database which had 2 subscribers. the 2 subscribers were in failure using the same snapshot -> problem is not on the subscriber side
insert the BCP files manually into the destination/subscriber database -> did not work

sometimes the error message varies :

The process could not bulk copy into table ‘"dbo"."Table02"’. (Source: MSSQL_REPL, Error number: MSSQL_REPL20037)
Get help: http://help/MSSQL_REPL20037
Batch send failed
Violation of PRIMARY KEY constraint ‘PK_Table02’. Cannot insert duplicate key in object ‘dbo.Table02’. The duplicate key value is (0). (Source: MSSQLServer, Error number: 2627)
Get help: http://help/2627
To obtain an error file with details on the errors encountered when initializing the subscribing table, execute the bcp command that appears below. Consult the BOL for more information on the bcp utility and its supported options. (Source: MSSQLServer, Error number: 20253)
Get help: http://help/20253
bcp "DB02"."dbo"."Table02" in "\SERV02\SQL_Repldata\unc\SQL02_DB02_PUB_DB02_SN\20220727003003\Table0227f2adc_608#36.bcp" -e "errorfile" -t"\n<x$3>\n" -r"\n<,@g>\n" -m10000 -SLIRIO\ISTORE -T -w (Source: MSSQLServer, Error number: 20253)
Get help: http://help/20253

I am pretty convinced that something occurs during the creation of the BCP files.
Can you help me?

Do you know a way to get more information during the creation of the BCP files or monitor the snapshot folder?

Thank you,
Vince

asked 2022-08-18 by Vince

Answer

I think it’s your isolation level

Since you are using the default READ_COMMITTED isolation level, and not READ_COMMITTED_SNAPSHOT isolation (RCSI) level, the snapshot process is likely taking an inconsistent snapshot as it reads data out to create the file. The snapshot isn’t corrupt–SQL Server is just beholden to the rules of the isolation level it is running under as it pulls the data. You can read more about Read Committed and RSCI, but pay attention to the "Locking Read Committed Behaviours" section:

The short-term shared locks used by the SQL Server locking read committed implementation provide very few of the guarantees commonly expected of a database transaction by T-SQL programmers. In particular, a statement running under locking read committed isolation:

Can encounter the same row multiple times;

Can miss some rows completely; and

Does not provide a point-in-time view of the data

That list might seem more like a description of the weird behaviours you might associate more with the use of NOLOCK hints, but all these things really can, and do happen when using locking read committed isolation.

What is likely happening is that as your replication snapshot runs, it is "encountering the same row multiple times", because it "does not provide a point-in-time view of the data."

During the time that you are generating the replication snapshot, if there are processes that modify data (including index maintenance), you will be more likely to see these odd behaviors than generating the replication snapshot during a time of no/low data modification.

Confirming the problem

There are two ways that come to mind as easy ways to confirm that this is the problem.

If you drop all constraints (including the PK) from Table02 and manually load the snapshot files using bcp.exe from the command line, you should be able to confirm if there are duplicates in the file. \
You can pre-create the table on the subscriber (without the PK or constraints), and set the article in your publication to truncate–rather than drop/recreate–when loading the data on the subscriber. You can do this with sp_changearticle to change this, or you can change it in SSMS on the article properties. After changing, you will need to generate a new snapshot.

EXEC sp_changearticle 
    @publication            = 'MyPublicationName',
    @article                = 'Table02',
    @property               = 'pre_creation_cmd',
    @value                  = 'truncate'; -- or 'drop' or 'delete' or 'none'

Possible fixes

The most reliable solution is to switch your publisher to use Read Committed Snapshot. This should ensure that the snapshot agent will properly bcp out a consistent snapshot that does not show the same row multiple times. Making this change can change application behavior subtly, so be sure you research & understand the change and impact before proceeding.

A second option (as you’ve found) is that you can just generate the snapshot at a different time when there is less data movement, so that you are less likely to run into this behavior where a row is read multiple times during replication snapshot creation.

answered 2022-08-25 by Andy Mallon