Open Side Menu Go to the Top
Register
Brain Teaser Brain Teaser

05-16-2018 , 05:17 PM
While I'm waiting for these extracts to get to where I can get some more test data I thought I'd start a thread and see if anyone has any thoughts on what's going on or how to find out what's going on.

We're migrating an app from one environment to another as part of a buyout. The first environment is a Centos box running in an Azure VM and is connecting to a SQL Server database. The data extract is written in R.

The second environment is an Ubuntu docker container running in an Azure VM connecting to a SQL Server database. The Senior guy has assured me the databases were copied correctly and they do appear to have the same number of rows.

When this extract runs on the original environment it returns about 750k rows, when it runs on the new environment it returns about 1million rows.

I've found the spot where the filtering results are different on the two boxes and the have been comparing one of the data files involved. At the start of the process it has the same number of lines in both environments but whenever I compare it; with winMerge on my windows machine or diff on the centos box or diff on my MacbookPro it shows differences in the files but the lines shown as having differences are exactly the same. I output the diff to a file and looked at it with hex dump and the lines are the same. I suppose I should find a line marked as different and look at those lines with hex dump from the original file?

I thought it might be differences in the order of the file so did all of the above after running sort on the files. Still getting differences. I also get different MD5s between all 4 sorted and unsorted input files.

Not sure what kind of character encodings SQL server could use that would result in different data between copies of the db. Any thoughts on how to find an actual difference I can show to someone would be appreciated.
Brain Teaser Quote
05-16-2018 , 08:59 PM
I'm not sure I'm following, but, uh, if you want to see the differences, yeah you're going to have to compare the lines that have differences.

Are you also saying that one of the files has 25% fewer lines AND some of the lines are different? Or are all the differences missing lines? Or are you saying that the files have the same # of lines at the start, with differences on some lines and later in the process, the output from one of them has less lines?

If I was a gambling man I'd say you might have some control sequences in your text, that don't 'look' like anything in most text editors but that count as differences. Looking at the differing lines in hex should make it obvious.

Another thing to do is to convert a chunk of each file to hex first, and diff the hex output. That should make it a lot easier to see exactly which characters are different, esp if my theory is right.
Brain Teaser Quote
05-16-2018 , 09:22 PM
Yeah, I was stream of conscious there a little. The rows are ~150 to ~250 characters long between the different files I'm looking at which makes getting them sorted properly a pita which makes diff with defaults kind of pointless. Using the -C 0 flag to strip out all the context helps and it looks like there is an unprintable character in one of the DBs.

We lose all the records on a merge between the two files that I'm looking into, we may have stripped periods and commas from addresses in one of the tables but not both, but I thought we did that before the tables were copied.

Explaining it helps, so thanks for responding.
Brain Teaser Quote
05-17-2018 , 01:32 PM
Turned out the data in the manufacturer column was cleaned in one of the tables and not in the new table or any of the 3 of the 4 tables involved. This meant we were effectively skipping 6 manufacturers data on the export because of a merge between a fixed table an an unfixed table. After fixing all the tables involved we're now down to under a thousand rows different out of a million row extract and that's good enough for the data we have.

The lesson is to have a better understanding of what your system is doing before making partial changes to the data.
Brain Teaser Quote

      
m