User Tools

Site Tools


guides:reference:database:database_sync_repair

NetYCE Database Re-synchronization procedure

Refer to the two NetYCE database servers as PRI and SEC although this assignment has nothing to do with the actual organization and hierarchy. Think of it as the local and remote NetYCE database server respectively.

Only execute this procedure if the db-sync issues could not be resolved using ‘skip errors’ and a database re-merge is required. I.e, the normal restore master archive on both servers is not possible since there is no clear master

Step 1

This procedure will use cli commands to be issued by an application manager familiar with he NetYCE setup and UNIX commands

ALL tasks have to be executed as ‘yce’ user

Do not permit user activity or API activity on any NetYCE system while executing this procedure

Step 2

Set the ignore flag for the skulker and kill it on both servers

$ go stop skulker

Check both servers are running on their preferred database.

$ go jobs
$ hostname -i
$ cat DSN.dat

If ip-addresses does not match, edit DSN.dat and update the address

This step ensures no database failovers will take place during execution

Step 3

Ensure we have sufficient workspace in /var/tmp and /var/opt/mysql The free disk space required is somewhat more than the largest *.MYD file in the databases

Check both servers:

$ df -h /var/opt/mysql
$ df -h /var/tmp

If disk space is low, see what files could be removed from the archives directory

$ go backup
$ ls -l

Or empty debugging logs

$ go clr dev
$ go logs
$ cp /dev/null *debug*

Or excessive database (bin)log files

$ go mysql
$ ls -l

If very large files like ‘mysql-bin.000nnn’ or ‘<hostname>-relay-bin.000nnn’ are found, those will be removed in step 4, other large log files can safely be deleted.

Should the mysql-bin and relay-bin files not prove to be removable, stop the database, delete ALL these files and restart it.

Step 4

On both servers ensure no table errors exist and the table fragmentation is reduced to within 5%:

$ ck_dbsync.pl -x -c
This command MUST be executed locally on both NetYCE database servers since the '-c' command cannot perform the disk space requirement verifications remotely.

Because the '-c' option was added in later revisions of ck_dbsync.pl, use the log_maint.pl script instead if the -c option is not available:

$ log_maint.pl

Check for errors If in doubt, repeat

As with the ck_dbsync '-c' option , this command must be executed locally on both NetYCE database servers

Step 5

Steps 5 through 8 use the 'ck_dbsync.pl' script. The command sequences for these steps are also available in all-in-one options ('-R' and '-Q' options). Familiarise yourself with the 'ck_dbsync.pl' command and its options in the paragraphs at the bottom of this page.

Reset the dbsync for both replication directions:

$ ck_dbsync.pl -x -F
$ ck_dbsync.pl -x -F -r

These commands will first repair the master-master setup between the PRI and SEC databases if it is needed.

Then the master and slave replication backlogs will be reset at both ends. This will also get rid the possibly by now very large binary log files in the var/opt/mysql directory.

Verify no large mysql-bin or relay-bin files remain after this step.

The synchronization may seem to be functional again at this stage, but since data-inconsistencies are not yet removed, the out-of-sync errors will soon reappear.

Step 6a

Steps 6a and 6b perform the actual database table comparison and merging. This could be a very slow process since all column values of all rows in all the tables need to be compared and were needed inserted or updated.

Therefore the merging can be executed in two phases. The 6a) step is faster and focuses on table mismatches where rows are missing. It ensures that the the rows (using the same keys) of a table exist in both databases. It does not verify that all rows have the same content, only the added missing rows will have that guarantee.

The second phase of this step, 6b), is significantly slower and will perform the full table field verification AND insert any missing rows as well. In situations where the time required for a full table verification is not (yet) available, the operator could choose to execute step 6a) and skip 6b). When such time restraints are no factor, 6a) is optional and could be skipped.

Skip to step 6b) if the fast-but-incomplete key-only compare is not desired

The ck_dbsync.pl has many options. Below is fastest method to merge only those tables with obvious row count and size differences. In this case we demonstrate the manual table-by-table approach where the operator selects the tables with an obvious row count difference. This involves using the -t table_name option. Without this option all tables will have their rows checked.

We need to compare each table twice. Once to compare the PRI database with the SEC database, and once to compare the SEC database to the PRI. To accomplish this reversal the -r option is used.

Execute on PRI only (the SEC gives the same results)

$ ck_dbsync.pl -x -s

Write down the table names listing issues in records or size like the Task_log table in the example below:

$ ck_dbsync.pl -x -s
2017-12-19 13:40:13 === Started ===
2017-12-19 13:40:13 Defaulting to -s (summary option)
2017-12-19 13:40:13 Using local PRI and remote SEC setup:
2017-12-19 13:40:13   PRImary: 172.17.10.28
2017-12-19 13:40:13   SECondary: 172.17.10.22
2017-12-19 13:40:13 Connected to databases
2017-12-19 13:40:13 Table: YCE.Action_log                 |   254870 rows|    57.1 Mb|
2017-12-19 13:40:13 Table: YCE.auth_permissions           |      400 rows|     0.0 Mb|
2017-12-19 13:40:13 Table: YCE.City_codes                 |     4038 rows|     0.2 Mb|
2017-12-19 13:40:13 Table: YCE.Client                     |      122 rows|     0.0 Mb|
2017-12-19 13:40:13 Table: YCE.Client_types               |      224 rows|     0.0 Mb|
2017-12-19 13:40:13 Table: YCE.Config_log                 |     5983 rows|    14.4 Mb|
:::
2017-12-19 13:40:13 Table: YCE.Task_log                   |    14533 rows|   297.0 Mb|
2017-12-19 13:40:13   Rows PRI: 14533
2017-12-19 13:40:13   Rows SEC: 37266
:::
2017-12-19 13:40:13 Table: NMS.Service_status             |       13 rows|     0.0 Mb|
2017-12-19 13:40:13 Table: NCCM.Nccm_node_data            |     1178 rows|     3.8 Mb|
2017-12-19 13:40:13 Table: NCCM.Nccm_parameters           |        5 rows|     0.0 Mb|
2017-12-19 13:40:13 Table: NCCM.Nccm_selection            |       96 rows|     0.0 Mb|
2017-12-19 13:40:13 Replication status
2017-12-19 13:40:13   PRI slave IO : No
2017-12-19 13:40:13   PRI slave SQL: No
2017-12-19 13:40:13   SEC slave IO : No  Got fatal error 1236 from master when reading data from binary log: 'Client requested master to start replication from impossible position; the first event 'mysql-bin.000001' at 153395, the last event read from 'mysql-bin.000001' at 4, the last byte read from 'mysql-bin.000001' at 4.'
2017-12-19 13:40:13   SEC slave SQL: Yes
2017-12-19 13:40:13 === Completed ===

For each of these tables execute these two commands:

$ ck_dbsync.pl -x -t <table> -k
$ ck_dbsync.pl -x -t <table> -k -r

When done, check the summary again:

$ ck_dbsync.pl -x -s

Alternatively, if not just the row-count must be fixed, but also the presence of each row must be verified, then execute these two commands once. The number of rows may be equal, but both tables may have (an equal number) of missing rows. The commands below will verify all tables have no missing rows.

$ ck_dbsync.pl -x -k
$ ck_dbsync.pl -x -k -r

Step 6b

Once the row count of all tables are equal on the PRI and the SEC, a reasonable level of synchronization is achieved. However there could be many differences in the values of each row. To resolve these conflicts all tables need to be compared row-by-row, field-by-field.

This step (6b) is intended to do just that, but requires some time to execute. On a large database hours could be required. Therefore it is suggested to execute this step only if such time is available.

The level of synchronization achieved so far is likely sufficient to prevent synchronization failures due to primary-key violations, but an equal row-count does not imply that both tables have entries for the same keys. To ensure proper synchronization, this step is required at first opportunity. And if they do, rows using the same key could have fields with different values.

The -T option is used to decide which record is the ‘latest’ should it exist on both servers but with field differences. In those cases the -T uses the record timestamps to determine, otherwise the PRI is assumed to be the valid record.

So in cases where the SEC is expected to have most of the valid records, it advisable to use the command with the reversed roles (’-T -r’) options before the regular (‘-T’) command.

Before initiating, ensure that the databases will not be halted for backup. The default backup times are 23:00 and 23:20 and are initiated by crontab (/opt/yce/bin/dbarchive.pl).

$ ck_dbsync.pl -x -u -T
$ ck_dbsync.pl -x -u -T -r

Whether this step is skipped or executed later, the remaining steps MUST be completed in either case.

Step 7

Reset some important ID's like Job-id and Task-id:

$ ck_dbsync.pl -x -L
$ ck_dbsync.pl -x -L -r

Step 8

Reset the dbsync, again on PRI only:

$ ck_dbsync.pl -x -F
$ ck_dbsync.pl -x -F -r

Step 9

Check the NetYCE front-end Admin - System’ tool also reports a healthy synchronization status on both servers Test that several times because other tables might need merging too, or the initial updates cause a conflict.

In that case try the ‘Skip error’ button (only a few times should suffice). If that is not the case, as shown by the backlog counters in the message, then reset the sync again as in step 8.

By this time it should be safe to allow user and API operations to resume.

Also the skulkers can be restarted on each server:

go clr skulker

ck_dbsync.pl command

The latest revision of ck_dbsync has additional options to simplify, speedup and group actions together. It also has an option to check, defragment and repair corrupt tables.

The usage message:

Compare data between synced YCE databases
usage: /opt/yce/system/ck_dbsync.pl options
  options -x -s
          -x -c
          -x -k [-r]      [-t table_name]
          -x -u [-r] [-T] [-t table_name]
          -x -L [-r]
          -x -F [-r]
          -x -R
          -x -Q
   -x        execute: mandatory option. Just to prevent accicental use
   -s        summary: return row count and size of each table
   -c        check tables: defragment and/or repair. Only on local database
   -k        keys only. Insert missing records from PRI into SEC
   -u        update full. Insert and update all SEC records from PRI
   -T        update using timestamps. If PRI timestamp is more recent than SEC (use with -u)
   -t table  table-name only. Use 'db.table_name' or 'table_name' format
   -r        reverse: switch PRI (default local) and SEC (remote) databases
   -L        lookup: reset Job, Task and other ID's (after finishing table updates!)
   -F        force: reset the master and slave configuration for PRI <- SEC
   -R        repair-all: perform a sequence to full-repair using cycles
             of the various options for both databases (-F, -uT, -L, -F).
   -Q        repair-quick: perform a sequence to key-repair using cycles
             of the various options for both databases (-F, -k, -L, -F).
   -h        help: this help message

   The '-x' option is mandatory

The new options are '-c' for the table check and repair, the '-R' to perform the sequence for a full re-sync, and the '-Q' to perform the sequence for a quick keys-only re-sync.

The output generated while executing is also logged in the file /var/opt/yce/logs/ck_dbsync.log. Each invocation of the command rotates this logfile using extensions '.0' thru '.9'.

Skipping in-sync tables

A significant speedup in the overall process was achieved by determining if tables are already in-sync or not. Tables already in-sync are skipped in the comparison. The sync status is determined by calculating checksums over the data and is thereby not influenced by fragmentation or row-orders.

The output for the summary takes a little longer though. Sample output of the summary (-s) option:

$ ck_dbsync.pl -x -s
2018-01-04 10:53:05 === Started ===
2018-01-04 10:53:05 Using local PRI and remote SEC setup:
2018-01-04 10:53:05   PRImary: 172.17.10.28
2018-01-04 10:53:05   SECondary: 172.17.10.22
2018-01-04 10:53:05 Connected to databases
2018-01-04 10:53:05 Table summary
2018-01-04 10:53:06 Table: YCE.Action_log                 |   251493 rows|    56.5 Mb|    sync
2018-01-04 10:53:06 Table: YCE.auth_permissions           |      400 rows|     0.0 Mb|    sync
2018-01-04 10:53:06 Table: YCE.City_codes                 |     4038 rows|     0.2 Mb|    sync
2018-01-04 10:53:06 Table: YCE.Client                     |      122 rows|     0.0 Mb|    sync
2018-01-04 10:53:06 Table: YCE.Client_types               |      224 rows|     0.0 Mb|    sync
2018-01-04 10:53:06 Table: YCE.Config_log                 |     2580 rows|     9.9 Mb|    sync
2018-01-04 10:53:06 Table: YCE.Domain                     |       32 rows|     0.0 Mb|    sync
2018-01-04 10:53:06 Table: YCE.Graph                      |      455 rows|     0.0 Mb|    sync
2018-01-04 10:53:06 Table: YCE.Images                     |       21 rows|     0.9 Mb|    sync
2018-01-04 10:53:06 Table: YCE.IpBKrouter                 |        0 rows|     0.0 Mb|    sync
2018-01-04 10:53:06 Table: YCE.IpCNrouterIN               |        1 rows|     0.0 Mb|    sync
2018-01-04 10:53:06 Table: YCE.Ipsec_gre                  |     7302 rows|     0.5 Mb|    sync
2018-01-04 10:53:06 Table: YCE.Ipsec_map                  |     1378 rows|     0.1 Mb|    sync
2018-01-04 10:53:06 Table: YCE.IpSupernet                 |     5566 rows|     0.5 Mb|    sync
:::
2018-01-04 10:53:13 Table: YCE.Task_log                   |    33057 rows|   981.3 Mb|out-sync
:::
2018-01-04 10:53:14 Table: NCCM.Nccm_parameters           |        5 rows|     0.0 Mb|    sync
2018-01-04 10:53:14 Table: NCCM.Nccm_selection            |       96 rows|     0.0 Mb|    sync
2018-01-04 10:53:14 Replication status
2018-01-04 10:53:14   PRI slave IO : Yes
2018-01-04 10:53:14   PRI slave SQL: Yes
2018-01-04 10:53:14   SEC slave IO : Yes
2018-01-04 10:53:14   SEC slave SQL: Yes
2018-01-04 10:53:14 === Completed ===

Table check, defragment and repair

The '-c' option to check, defragment and repair tables will not support the reverse (-r) option. Due to the verification of free disk space and the detection of crashed repair attempts, the '-c' wil only work on the local server. In order to check both databases, the ck_dbsync.pl -x -c command needs to be executed on both NetYCE database servers.

The output of the '-c' option. In this case no repairs or defragmentation was required:

$ ck_dbsync.pl -x -c 
2018-01-04 11:25:09 === Started ===
2018-01-04 11:25:09 Using local PRI and remote SEC setup:
2018-01-04 11:25:09   PRImary: 172.17.10.28
2018-01-04 11:25:09   SECondary: 172.17.10.22
2018-01-04 11:25:09 Connected to databases
2018-01-04 11:25:09 LOCAL Database table check (defragment, repair)
2018-01-04 11:25:09 Table check
2018-01-04 11:25:09 Flushing database tables to disk
2018-01-04 11:25:09 Checking fragmentation database 'YCE'...
2018-01-04 11:25:09  Table: YCE.IPv6_subnet
2018-01-04 11:25:09  Table: YCE.IPv6_prefix
2018-01-04 11:25:09  Table: YCE.IPv6_plan_client_type
2018-01-04 11:25:09  Table: YCE.IPv6_map
2018-01-04 11:25:09  Table: YCE.Ip_dhcp
2018-01-04 11:25:09  Table: YCE.Port_scan
2018-01-04 11:25:09  Table: YCE.IPv6_plans
2018-01-04 11:25:09  Table: YCE.IPv6_plan
:::
2018-01-04 11:25:12  Table: NMS.Nmmi_nodes
2018-01-04 11:25:12  Table: NMS.CmdbNodes
2018-01-04 11:25:12  Table: NMS.Dhcp_history
2018-01-04 11:25:12 Checking fragmentation database 'NCCM'...
2018-01-04 11:25:12  Table: NCCM.Nccm_parameters
2018-01-04 11:25:12  Table: NCCM.Nccm_selection
2018-01-04 11:25:12  Table: NCCM.Nccm_node_data
2018-01-04 11:25:12 Checking fragmentation database 'CMDB'...
2018-01-04 11:25:12 Database table checking complete
2018-01-04 11:25:12 === Completed ===

When including the '-r' option, the action is denied:

$ ck_dbsync.pl -x -c -r
2018-01-04 11:26:46 === Started ===
2018-01-04 11:26:46 Using reversed remote PRI and local SEC setup:
2018-01-04 11:26:46   PRImary: 172.17.10.22
2018-01-04 11:26:46   SECondary: 172.17.10.28
2018-01-04 11:26:46 Connected to databases
2018-01-04 11:26:46 LOCAL Database table check (defragment, repair)
2018-01-04 11:26:46 Table check
2018-01-04 11:26:46   Cannot defragment/repair remote tables
2018-01-04 11:26:46 === Completed ===

All-in-one options

The '-R' option combines the majority of steps involved with a full resynchronisation. Using the '-R' option is equivalent to ck_dbsync invocations in the following sequence:

  -x -s
  -x -F
  -x -F -r
  -x -u -T
  -x -u -T -r
  -x -L
  -x -L -r
  -x -F
  -x -F -r
  -x -s
Please note that the '-c' option is NOT included. Since the table repairs can only be executed locally, these steps should be performed BEFORE using the '-R' or '-Q' sequences.

Likewise, the '-Q' option will execute in succession all steps needed to perform key-only comparisons. This is less accurate than the full re-sync, but it is the fastest way to get synchronisation up and running again. The '-R' can then be performed at a later, less busy moment.

Using the '-R' option is equivalent to ck_dbsync invocations in the following sequence:

  -x -s
  -x -F
  -x -F -r
  -x -k
  -x -k -r
  -x -L
  -x -L -r
  -x -F
  -x -F -r
  -x -s

A sample session summary of the '-R' usage:

$ ck_dbsync.pl -x -R
2018-01-04 11:40:21 === Started ===
2018-01-04 11:40:21 Using local PRI and remote SEC setup:
2018-01-04 11:40:21   PRImary: 172.17.10.28
2018-01-04 11:40:21   SECondary: 172.17.10.22
2018-01-04 11:40:21 Connected to databases
2018-01-04 11:40:21 Doing full repair using a sequence of options
2018-01-04 11:40:21 Table status
2018-01-04 11:40:22 Table: YCE.Action_log                 |   251493 rows|    56.5 Mb|    sync
2018-01-04 11:40:22 Table: YCE.auth_permissions           |      400 rows|     0.0 Mb|    sync
2018-01-04 11:40:22 Table: YCE.City_codes                 |     4038 rows|     0.2 Mb|    sync
2018-01-04 11:40:22 Table: YCE.Client                     |      122 rows|     0.0 Mb|    sync
::
2018-01-04 11:40:29 Table: YCE.Task_log                   |    33066 rows|   981.3 Mb|out-sync
::
2018-01-04 11:40:29 Table: NCCM.Nccm_selection            |       96 rows|     0.0 Mb|    sync
2018-01-04 11:40:29 Reset dbsync PRI <- SEC
2018-01-04 11:40:29   PRI stop slave
2018-01-04 11:40:30   SEC reset master
2018-01-04 11:40:32   PRI reset slave
2018-01-04 11:40:33   PRI start slave
2018-01-04 11:40:34 Reset dbsync SEC <- PRI
2018-01-04 11:40:34   PRI stop slave
2018-01-04 11:40:35   SEC reset master
2018-01-04 11:40:36   PRI reset slave
2018-01-04 11:40:37   PRI start slave
2018-01-04 11:40:38 Sync table records PRI -> SEC
2018-01-04 11:40:38 Table: YCE.Action_log                 |   251493 rows|    56.5 Mb|    sync
::
2018-01-04 11:40:45 Table: YCE.Task_log                   |    33066 rows|   981.3 Mb|out-sync
2018-01-04 11:40:45   keys: Task_id
2018-01-04 11:40:45   page: 1 / 34
2018-01-04 11:40:46   page: 2 / 34
2018-01-04 11:40:47   page: 3 / 34
2018-01-04 11:40:49   page: 4 / 34
::
2018-01-04 11:42:06   page: 34 / 34
2018-01-04 11:42:06   Record mismatch: Task_id = [0103_0017]
2018-01-04 11:42:06     Task_type [Request: '1' action: '' ] != [Request: '1' action: 'done' ]
2018-01-04 11:42:06     Task_request [<task response=""> ] != [<task response="completed"> ]
2018-01-04 11:42:06   Skipping update, PRI timestamp is older than SEC
2018-01-04 11:42:06 Table: YCE.Template                   |     1134 rows|     0.1 Mb|    sync
::
2018-01-04 11:42:07 Table: NCCM.Nccm_selection            |       96 rows|     0.0 Mb|    sync
2018-01-04 11:42:07 Sync table records SEC -> PRI
2018-01-04 11:42:08 Table: YCE.Action_log                 |   251493 rows|    56.5 Mb|    sync
::
2018-01-04 11:42:14 Table: YCE.Task_log                   |    33066 rows|   981.3 Mb|out-sync
2018-01-04 11:42:14   keys: Task_id
2018-01-04 11:42:14   page: 1 / 34
2018-01-04 11:42:16   page: 2 / 34
2018-01-04 11:42:17   page: 3 / 34
2018-01-04 11:42:18   page: 4 / 34
::
2018-01-04 11:43:31   page: 34 / 34
2018-01-04 11:43:31   Record mismatch: Task_id = [0103_0017]
2018-01-04 11:43:31     Task_type [Request: '1' action: 'done' ] != [Request: '1' action: '' ]
2018-01-04 11:43:31     Task_request [<task response="completed"> ] != [<task response=""> ]
2018-01-04 11:43:31     Updating record in SEC
::
2018-01-04 11:43:32 Table: NCCM.Nccm_selection            |       96 rows|     0.0 Mb|    sync
2018-01-04 11:43:32 Reset PRI Job-, Task- and other ID's
2018-01-04 11:43:32   Task ID: 40
2018-01-04 11:43:32   Job ID: 1
2018-01-04 11:43:32   Service key: 28893
2018-01-04 11:43:32   Topo ID: 34251
2018-01-04 11:43:32   Crypt update: mirage
2018-01-04 11:43:32   Action_log pos: mirage, yceseven
2018-01-04 11:43:32   Config_log pos: mirage, yceseven
2018-01-04 11:43:32   Task_log pos: mirage, yceseven
2018-01-04 11:43:32 Reset SEC Job-, Task- and other ID's
2018-01-04 11:43:32   Task ID: 40
2018-01-04 11:43:32   Job ID: 1
2018-01-04 11:43:32   Service key: 28893
2018-01-04 11:43:32   Topo ID: 34251
2018-01-04 11:43:32   Crypt update: mirage
2018-01-04 11:43:32   Action_log pos: mirage, yceseven
2018-01-04 11:43:32   Config_log pos: mirage, yceseven
2018-01-04 11:43:32   Task_log pos: mirage, yceseven
2018-01-04 11:43:32 Reset dbsync PRI <- SEC
2018-01-04 11:43:32   PRI stop slave
2018-01-04 11:43:33   SEC reset master
2018-01-04 11:43:34   PRI reset slave
2018-01-04 11:43:35   PRI start slave
2018-01-04 11:43:36 Reset dbsync SEC <- PRI
2018-01-04 11:43:36   PRI stop slave
2018-01-04 11:43:37   SEC reset master
2018-01-04 11:43:38   PRI reset slave
2018-01-04 11:43:39   PRI start slave
2018-01-04 11:43:40 Table status
2018-01-04 11:43:41 Table: YCE.Action_log                 |   251493 rows|    56.5 Mb|    sync
::
2018-01-04 11:43:47 Table: YCE.Task_log                   |    33066 rows|   981.3 Mb|    sync
::
2018-01-04 11:43:48 Table: NCCM.Nccm_selection            |       96 rows|     0.0 Mb|    sync
2018-01-04 11:43:48 Sync status
2018-01-04 11:43:48 Replication status
2018-01-04 11:43:48   PRI slave IO : Yes
2018-01-04 11:43:48   PRI slave SQL: Yes
2018-01-04 11:43:48   SEC slave IO : Yes
2018-01-04 11:43:48   SEC slave SQL: Yes
2018-01-04 11:43:48 === Completed ===
LDAP: couldn't connect to LDAP server
guides/reference/database/database_sync_repair.txt · Last modified: 2020/01/22 07:14 by bdorlandt