Pantek Library
Hosting Provided By
CybrHost
High Speed Hosting

RE: Verifing Data Integrity

From: Rick James <rjames(at)yahoo-inc.com>
Date: Tue Jul 17 2007 - 11:20:33 EDT


We are mixing several goals...

  • good enuf hash function
  • fast enuf (hence discussion of indexes and LIMIT)

If you could extract
  SELECT MD5(concat(all the fields)) FROM foo ORDER BY 1; would be a nearly perfect hash. It would require a table scan (indexes useless). Then take the output, store it in a file, and take the MD5 of the file. This gives you a single, fairly short string that is quite adequate for comparing between master and slave.

Anything less than all columns, or less than a table scan (or index scan) may miss diffs.

The ORDER BY I mention is in case the data is not in the identical order in MyISAM on slave and master. SQL can't see the diff unless you do a table scan, at which point the semantics say that you can't trust the order. ORDER BY should be unnecessary for InnoDB, because the data is in PK order. (The ORDER BY would unnecessarily slow down the hash process.)

> -----Original Message-----
> From: Bob Bankay X-AST: 7731^29u18e3 [mailto:bobb@ssl.berkeley.edu]
> Sent: Monday, July 16, 2007 1:46 PM
> To: Rick James
> Cc: 'Darpan Dinker'; 'R Bankay'; 'Andrew Brosnan'; 'Paul
> Sindelar'; replication@lists.mysql.com
> Subject: Re: Verifing Data Integrity
>
> I did not mean just to limit by a single column.
>
> the following would verify a hash of several(all) numeric columns:
>
> select sum(col1 + col2 + col3 + col4.........) as hash order by col1;
> [limit 10000]
>
> On my server the above sql runs in much less than 10 sec for 10000
> rows. Obviously it needs to access the table rows.
>
> Cheers
>
> Bob
> PS. Perhaps there is another function for alpha data but we
> find that
> hashing the numeric data gives us the most critical
> validation that we need for replica accuracy.
>
>
>
> Rick James wrote:
> > It will use the index if all the SUMed values are in ONE
> index. Indexes are
> > usually smaller than table. EXPLAIN will say "Using index"
> to indicate
> > such.
> >
> > (Caveat: This does not apply to the PRIMARY KEY on InnoDB, which is
> > effectively the data, itself.)
> >
> >
> >> -----Original Message-----
> >> From: bob b [mailto:robertcb-1@comcast.net]
> >> Sent: Sunday, July 15, 2007 2:35 AM
> >> To: Darpan Dinker
> >> Cc: R Bankay; Andrew Brosnan; Paul Sindelar;
> >> replication@lists.mysql.com
> >> Subject: Re: Verifing Data Integrity
> >>
> >> What I meant was If you have an index ind1 that has a
> numeric key
> >> say key1 then
> >>
> >> select sum(key1) from table1 order by key1; (this is a
> special case
> >> and not general )
> >>
> >> This will likely not access the table and report the sum of key1
> >> using only the idex ind1. It is probably not a good idea to sum
> >> (primary key) but something else like transaction # which
> might be
> >> indexed.
> >>
> >> But any numeric item can be summed whether it is indexed
> or not and
> >> will normally scan the entire table.
> >>
> >>
> >>
> >>
> >>
> >> On Jul 15, 2007, at 12:52 AM, Darpan Dinker wrote:
> >>
> >>
> >>> R Bankay wrote:
> >>>
> >>>> To thoroughly verify data in a table slave/master, first
> stop the
> >>>> db applications.
> >>>>
> >>>> Then sum any numeric columns (I prefer a floating columns.) e.g.,
> >>>>
> >>>> select sum(columnfp1), sum(colmnfp2) from table1;
> >>>>
> >>>> Do this on both master and slave. Should point out that if you
> >>>> choose an index with numeric value it will be much faster.
> >>>>
> >>>>
> >>> When you have to scan the whole table, why will using the index
> >>> will be faster?
> >>>
> >>>> Bob B
> >>>>
> >>>> ----- Original Message ----
> >>>> From: Andrew Brosnan <andrew@broscom.com>
> >>>> To: Paul Sindelar <paul.lists@mnwebhost.net>
> >>>> Cc: replication@lists.mysql.com
> >>>> Sent: Saturday, July 14, 2007 6:14:55 PM
> >>>> Subject: Re: Verifing Data Integrity
> >>>>
> >>>> On 7/14/07 at 12:18 PM, paul.lists@mnwebhost.net (Paul
> Sindelar)
> >>>> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> Does anyone know a reliable/efficient way of thoroughly
> >>>>>
> >> verifying
> >>
> >>>>> the integrity of a slave?
> >>>>>
> >>>>>
> >>>> Would this be done on a routine basis? I think it's more
> >>>>
> >> realistic
> >>
> >>>> to do
> >>>> routine sanity checks - perhaps you already do. What I do
> >>>>
> >> is check
> >>
> >>>> for
> >>>> same number of records on master and slave, thoroughly verify
> >>>> *some* of
> >>>> the data by checking random rows, compare file sizes,
> etc. I can't
> >>>> imagine going through bit by bit without suspecting a
> problem. I
> >>>> think
> >>>> I'd be more likely to reset the slave with a new
> snapshot of the
> >>>> masters
> >>>> data. :-)
> >>>>
> >>>> Regards,
> >>>> Andrew
> >>>>
> >>>>
> >>>> --
> >>>> MySQL Replication Mailing List
> >>>> For list archives: http://lists.mysql.com/replication
> >>>> To unsubscribe: http://lists.mysql.com/replication?
> >>>> unsub=robertcb@yahoo.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> _____________________________________________________________________
> >>
> >>>> _______________
> >>>> Park yourself in front of a world of choices in alternative
> >>>> vehicles. Visit the Yahoo! Auto Green Center.
> >>>> http://autos.yahoo.com/green_center/
> >>>>
> >>>>
> >> Robert Bankay -AST:7731^29u18e3
> >> robertcb-1@comcast.net
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
>
>
> --
> MySQL Replication Mailing List
> For list archives: http://lists.mysql.com/replication
> To unsubscribe:
> http://lists.mysql.com/replication?unsub=rjames@yahoo-inc.com
>
>

-- 
MySQL Replication Mailing List
For list archives: 
http://lists.mysql.com/replication
To unsubscribe:    
http://lists.mysql.com/replication?unsub=lists@pantek.com
Received on Tue Jul 17 11:21:10 2007

This archive was generated by hypermail 2.1.8 : Thu Aug 09 2007 - 19:30:25 EDT


Contact Us  Legal Notices  Order Services Online 
Pantek Home  Privacy Policy  IT news  Site Map  Pantek Library