Two Plus Two Publishing LLC Two Plus Two Publishing LLC
 

Go Back   Two Plus Two Poker Forums > >

Notices

Programming Discussions about computer programming

Reply
 
Thread Tools Display Modes
Old 05-16-2018, 05:17 PM   #1
kerowo
lolcat
 
kerowo's Avatar
 
Join Date: Nov 2005
Posts: 34,476
Brain Teaser

While I'm waiting for these extracts to get to where I can get some more test data I thought I'd start a thread and see if anyone has any thoughts on what's going on or how to find out what's going on.

We're migrating an app from one environment to another as part of a buyout. The first environment is a Centos box running in an Azure VM and is connecting to a SQL Server database. The data extract is written in R.

The second environment is an Ubuntu docker container running in an Azure VM connecting to a SQL Server database. The Senior guy has assured me the databases were copied correctly and they do appear to have the same number of rows.

When this extract runs on the original environment it returns about 750k rows, when it runs on the new environment it returns about 1million rows.

I've found the spot where the filtering results are different on the two boxes and the have been comparing one of the data files involved. At the start of the process it has the same number of lines in both environments but whenever I compare it; with winMerge on my windows machine or diff on the centos box or diff on my MacbookPro it shows differences in the files but the lines shown as having differences are exactly the same. I output the diff to a file and looked at it with hex dump and the lines are the same. I suppose I should find a line marked as different and look at those lines with hex dump from the original file?

I thought it might be differences in the order of the file so did all of the above after running sort on the files. Still getting differences. I also get different MD5s between all 4 sorted and unsorted input files.

Not sure what kind of character encodings SQL server could use that would result in different data between copies of the db. Any thoughts on how to find an actual difference I can show to someone would be appreciated.
kerowo is offline   Reply With Quote
Old 05-16-2018, 08:59 PM   #2
RustyBrooks
Carpal \'Tunnel
 
RustyBrooks's Avatar
 
Join Date: Feb 2006
Location: Austin, TX
Posts: 23,534
Re: Brain Teaser

I'm not sure I'm following, but, uh, if you want to see the differences, yeah you're going to have to compare the lines that have differences.

Are you also saying that one of the files has 25% fewer lines AND some of the lines are different? Or are all the differences missing lines? Or are you saying that the files have the same # of lines at the start, with differences on some lines and later in the process, the output from one of them has less lines?

If I was a gambling man I'd say you might have some control sequences in your text, that don't 'look' like anything in most text editors but that count as differences. Looking at the differing lines in hex should make it obvious.

Another thing to do is to convert a chunk of each file to hex first, and diff the hex output. That should make it a lot easier to see exactly which characters are different, esp if my theory is right.
RustyBrooks is offline   Reply With Quote
Old 05-16-2018, 09:22 PM   #3
kerowo
lolcat
 
kerowo's Avatar
 
Join Date: Nov 2005
Posts: 34,476
Re: Brain Teaser

Yeah, I was stream of conscious there a little. The rows are ~150 to ~250 characters long between the different files I'm looking at which makes getting them sorted properly a pita which makes diff with defaults kind of pointless. Using the -C 0 flag to strip out all the context helps and it looks like there is an unprintable character in one of the DBs.

We lose all the records on a merge between the two files that I'm looking into, we may have stripped periods and commas from addresses in one of the tables but not both, but I thought we did that before the tables were copied.

Explaining it helps, so thanks for responding.
kerowo is offline   Reply With Quote
Old 05-17-2018, 01:32 PM   #4
kerowo
lolcat
 
kerowo's Avatar
 
Join Date: Nov 2005
Posts: 34,476
Re: Brain Teaser

Turned out the data in the manufacturer column was cleaned in one of the tables and not in the new table or any of the 3 of the 4 tables involved. This meant we were effectively skipping 6 manufacturers data on the export because of a merge between a fixed table an an unfixed table. After fixing all the tables involved we're now down to under a thousand rows different out of a million row extract and that's good enough for the data we have.

The lesson is to have a better understanding of what your system is doing before making partial changes to the data.
kerowo is offline   Reply With Quote

Reply
      

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Forum Jump


All times are GMT -4. The time now is 07:34 PM.


Powered by vBulletin®
Copyright ©2000 - 2018, Jelsoft Enterprises Ltd.
Copyright © 2008-2017, Two Plus Two Interactive
 
 
Poker Players - Streaming Live Online