Background
Recently, I encountered a problem where a user wanted to synchronize data from PostgreSQL to TiDB (which uses the same protocol as MySQL) and wanted to know whether the data after synchronization is consistent. I hadn’t dealt with this kind of issue before, so I did a bit of research.
Typically, to verify data consistency, you compute a checksum on both sides and compare them.
TiDB (MySQL) Side
For the verification of a specific table, the following SQL is used:
SELECT bit_xor(
CAST(crc32(
concat_ws(',',
col1, col2, col3, …, colN,
concat(isnull(col1), isnull(col2), …, isnull(colN))
)
) AS UNSIGNED)
)
FROM t;
Let’s look at a specific example:
DROP TABLE IF EXISTS t;
CREATE TABLE t (i INT, j INT);
INSERT INTO t VALUES (2, 3), (NULL, NULL);
SELECT bit_xor(
CAST(crc32(
concat_ws(',',
i, j,
concat(isnull(i), isnull(j))
)
) AS UNSIGNED)
)
FROM t;
The result is:
+-------------------------------------------------------------------------------------------------------------------------------------------+
| bit_xor(
CAST(crc32(
concat_ws(',',
i, j,
concat(isnull(i), isnull(j))
)
) AS UNSIGNED)
) |
+-------------------------------------------------------------------------------------------------------------------------------------------+
| 5062371 |
+-------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
PostgreSQL Side
The goal is simply to write the same SQL as above, but PostgreSQL does not support bit_xor
, crc32
, isnull
, nor does it have unsigned types. Therefore, the solution is relatively straightforward—relying on UDFs (User-Defined Functions).
After some research, the main missing functions can be addressed with a few custom implementations.
bit_xor
:
CREATE OR REPLACE AGGREGATE bit_xor(IN v bigint) (SFUNC = int8xor, STYPE = bigint);
crc32
:
CREATE OR REPLACE FUNCTION crc32(text_string text) RETURNS bigint AS $$
DECLARE
tmp bigint;
i int;
j int;
byte_length int;
binary_string bytea;
BEGIN
IF text_string = '' THEN
RETURN 0;
END IF;
i = 0;
tmp = 4294967295;
byte_length = bit_length(text_string) / 8;
binary_string = decode(replace(text_string, E'\\', E'\\\\'), 'escape');
LOOP
tmp = (tmp # get_byte(binary_string, i))::bigint;
i = i + 1;
j = 0;
LOOP
tmp = ((tmp >> 1) # (3988292384 * (tmp & 1)))::bigint;
j = j + 1;
IF j >= 8 THEN
EXIT;
END IF;
END LOOP;
IF i >= byte_length THEN
EXIT;
END IF;
END LOOP;
RETURN (tmp # 4294967295);
END
$$ IMMUTABLE LANGUAGE plpgsql;
isnull
:
CREATE OR REPLACE FUNCTION isnull(anyelement) RETURNS int AS $$
BEGIN
RETURN CAST(($1 IS NULL) AS INT);
END
$$ LANGUAGE plpgsql;
After creating the three UDFs above, let’s test the previous example. Note that UNSIGNED
should be changed to BIGINT
.
DROP TABLE IF EXISTS t;
CREATE TABLE t (i INT, j INT);
INSERT INTO t VALUES (2, 3), (NULL, NULL);
SELECT bit_xor(
CAST(crc32(
concat_ws(',',
i, j,
concat(isnull(i), isnull(j))
)
) AS BIGINT)
)
FROM t;
The result:
bit_xor
---------
5062371
(1 row)
It’s exactly the same as on the TiDB (MySQL) side.
Postscript
- I haven’t tested more extensively; this is just a simple test.
- UDFs are indeed a great feature that greatly enhance flexibility.