Open Side Menu Go to the Top
Register
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** ** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD **

12-10-2018 , 01:06 AM
I don't quite understand what you mean by them seeing 2x everything.

Source 1
record A, B, C

Source 2
record A2, B2, D2

Your app should show?
merged(A,A2), merged(B, B2), C, D2

Can you do some more fuzzy match on another field? Or randomly pick one of the 2 sources instead of merging them. Doing any non-trival at this stage like machine learning has huge risk. Would need more detail here.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 01:09 AM
My approach would be to look for the next best uniquely identifiable field (i.e. what fields in your data have the largest set of possible values?) . In the book case it might be author or author + date. Then match on that, and limit how many items you show of a given category.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 01:33 AM
How did nobody ever test on real data until 6+ months into the project?
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 01:50 AM
Quote:
Originally Posted by ChrisV
How did nobody ever test on real data until 6+ months into the project?
That's not what management wants to hear at this point 😁 You need to provide solutions. .
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 01:57 AM
LOL
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 02:18 AM
Quote:
Originally Posted by suzzer99
Is 3rd party not involved with the project, and you're just consuming some public API?
Both of them collect this data but the data sometimes isn't reported correctly.


Quote:
Originally Posted by muttiah
That's not what management wants to hear at this point 😁 You need to provide solutions. .
LOL yea

Quote:
Originally Posted by muttiah
I don't quite understand what you mean by them seeing 2x everything.

Source 1
record A, B, C

Source 2
record A2, B2, D2

Your app should show?
merged(A,A2), merged(B, B2), C, D2
Yep that is correct

Quote:
Originally Posted by muttiah
Can you do some more fuzzy match on another field? Or randomly pick one of the 2 sources instead of merging them. Doing any non-trival at this stage like machine learning has huge risk. Would need more detail here.
Yea we are looking into to see if we can do fuzzy matching on a different field. Problem is that even something like date ordered might be incorrectly reported. One person might say it was ordered 12/1/2018, another might say 12/2/2018 because of reporting standards or someone inputted the wrong date due to time zone, or someone forgot date it was ordered etc. The thing is that we can't really trust the data :\

Quote:
Originally Posted by ChrisV
How did nobody ever test on real data until 6+ months into the project?
Hah it was a goof up by all of us. Funny thing is that we did an ETL of this data 6 months back but no one bothered to look at the results. We just all assume it was good. My account in particular didn't match with anything.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 02:33 AM
This all sounds a lot like this: (worth watching)

** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 02:44 AM
Quote:
Originally Posted by Barrin6
Hah it was a goof up by all of us. Funny thing is that we did an ETL of this data 6 months back but no one bothered to look at the results. We just all assume it was good. My account in particular didn't match with anything.
You did an ETL of this 3rd party data, got no matches by this unique user ID you were expecting, and no one noticed? Or was the match never attempted?
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 02:47 AM
FWIW our version of this was getting prototypes from designers with names like John Smith and street addresses like 123 Elm St., developing the app with similar dummy data, then realizing our fields and columns are way too narrow when actually hoooking up to real users' names and addresses. Same thing happened with real movie names and actor names.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 02:58 AM
Quote:
Originally Posted by suzzer99
You did an ETL of this 3rd party data, got no matches by this unique user ID you were expecting, and no one noticed? Or was the match never attempted?
Match was done by the service for almost a year now I think. We ETLed the data but no one looked at the dataset to see how the matching was going.

We had stats around match numbers. But honestly, it has been 6 months + and none of us know how we missed this.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 03:10 AM
Maybe you can find out the obfuscation method then apply it to your good column and match on the result?
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 10:40 AM
Yeah you're boned.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 12:25 PM
Can you hire a bunch of Venezuelan data entry temps to manually match until you figure this out?
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 01:33 PM
Quote:
Originally Posted by suzzer99
Can you hire a bunch of Venezuelan data entry temps to manually match until you figure this out?
I mean I guess question one is how the obfuscation is done. Can a human decode it? Or is it some opaque thing like "rustybrooks => 1 and suzzer99 => 2" and so forth?

If it's reversible you can try to reverse it. If it isn't but you know the method, and the method is independent of other data you don't have, then you can obfuscate the other source too. Otherwise, you're probably boned.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 01:56 PM
oh my god my literal worst nightmare is something huge like that slipping past me.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 02:14 PM
Quote:
Originally Posted by jmakin
oh my god my literal worst nightmare is something huge like that slipping past me.
I once worked on a financial optimization engine, and for like 3 or 4 months the tax was calculated backwards (as a positive instead of a negative), and no one noticed. No one was really sure who initially broke it, and we just quietly fixed it and went on with life.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 02:54 PM
Quote:
Originally Posted by RustyBrooks
I mean I guess question one is how the obfuscation is done. Can a human decode it? Or is it some opaque thing like "rustybrooks => 1 and suzzer99 => 2" and so forth?

If it's reversible you can try to reverse it. If it isn't but you know the method, and the method is independent of other data you don't have, then you can obfuscate the other source too. Otherwise, you're probably boned.
Sounds to me more like you have some records that probably match, but w/o the ID it's a messy process. (I thought it's just the ID that's obfuscated?)

At the statistical consulting firm, we would get datasets of car accident reports from insurance companies that we had to match to actual police reports and other sources of data. We'd run a VIN match first - but that missed so many due to 0 being recorded as O, 1 for L, etc.

So I just started chipping away by making those replacements or by matching 16 out of 17 vin #s, etc. Then you'd still have to eyeball all the data from each side to make sure you had a match. Messy stuff.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 04:03 PM
Anyone use a pi-hole or something similar? https://pi-hole.net/
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-10-2018 , 04:55 PM
Quote:
Originally Posted by Grue
Anyone use a pi-hole or something similar? https://pi-hole.net/
Yeah I have it set up here in the house
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-12-2018 , 09:52 PM
Quote:
Originally Posted by _dave_
Maybe you can find out the obfuscation method then apply it to your good column and match on the result?
It comes in obfuscated and there is no way to reverse that.

Quote:
Originally Posted by suzzer99
Can you hire a bunch of Venezuelan data entry temps to manually match until you figure this out?
Haha yea that would be great. Except we are dealing with 100 of millions of data.


We are working on improving our ETL and going to run some queries on that data to get some kind of matching rate with different strategies. At this moment, I'm not too confident that this will get us to some threshold of 90%. Even then, we have to think about false positive on matches. This whole thing is ugly and we are probably boned.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-13-2018 , 06:46 AM
yesterday i found a decent sized memory leak in my main project.

Except it's the kind that's super hard to find. every time a certain pointer is used and "released" the software is holding on to some resource somewhere and eventually finally releasing it when the process exits. so valgrind reports no errors. no one has any idea where to look and I have a chance to be a hero but I tried for a good 4 hours and can't even really explain the problem well enough to someone who could figure it out. I showed my boss and he agrees.

I wrote a test that'll eventually crash a client and/or server process with OOM error. Idk what tool to use to try to nail this down, any suggestions? (C btw)
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-13-2018 , 07:52 AM
Quote:
Originally Posted by jmakin
yesterday i found a decent sized memory leak in my main project.

Except it's the kind that's super hard to find. every time a certain pointer is used and "released" the software is holding on to some resource somewhere and eventually finally releasing it when the process exits. so valgrind reports no errors. no one has any idea where to look and I have a chance to be a hero but I tried for a good 4 hours and can't even really explain the problem well enough to someone who could figure it out. I showed my boss and he agrees.

I wrote a test that'll eventually crash a client and/or server process with OOM error. Idk what tool to use to try to nail this down, any suggestions? (C btw)
So when you say the pointer is released you mean the memory allocated at the address the pointer is referencing is deallocated or the pointer itseld is just deleted or removed?

Does whatever the pointer is referencing contain another pointer that is being removed without properly cleaning up the memory it references?

Edit: Pardon my ignorance if released is a well know term for manual memory management.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-13-2018 , 08:53 AM
No, I was not clear - when the pointer is “let go” (not freed, just as far as the API is concerned the client no longer needs to worry about the pointer) the memory seems to be referenced to somewhere in the server and is never deallocated even when essentially it’s never needed again. It eventually gets deallocated, but far after the point where anything matters anymore. So valgrind reports no leaks.

Memory usage steadily creeps up til something crashes.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-13-2018 , 08:56 AM
This is no technical help but just about anytime I fail for hours on end when I come back the next day and figure it out right away.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
12-13-2018 , 10:05 AM
Quote:
Originally Posted by jmakin
yesterday i found a decent sized memory leak in my main project.

Except it's the kind that's super hard to find. every time a certain pointer is used and "released" the software is holding on to some resource somewhere and eventually finally releasing it when the process exits. so valgrind reports no errors. no one has any idea where to look and I have a chance to be a hero but I tried for a good 4 hours and can't even really explain the problem well enough to someone who could figure it out. I showed my boss and he agrees.

I wrote a test that'll eventually crash a client and/or server process with OOM error. Idk what tool to use to try to nail this down, any suggestions? (C btw)
What language are we talking about here. How is the release happening? Is it one of the magic "clean up on dealloc" like std::shared_ptr?
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote

      
m