Quote:
Originally Posted by adios
Windows and Linux system level APIs are C language interfaces more or less. Applications making Windows and/or system level calls would be my guess as to why.
I'm self taught and will admit to knowing very little about system level calls.
The program I wrote in perl ran on ubuntu on aws and did the following:
pull data from data supplier s3 bucket, in the form of gzip'd single line json file.
unzip file
parse lines of json
look for keywords,
if keywords match, grab other data
compare data to our database
if match found in our database put into "good" file, make entry in database
zip up good file
put good file in our s3 bucket
The reason they had abandoned it a few years before I got there is there was so much data no one figured out a way to process it faster than we got it in. We got 100 Million rows a night and it took them a weeks to process.
When I started my script took 2 weeks to process one day. By using the ssd's on aws, threading, optimizing databases, optimizing sql and threading the hell out of the process I got it down to 4 hours. I also set it up to run on as many servers as you wanted to and farm the work out to them as well which made it as fast as you wanted it to be. The fact that aws had just released 96 cpu boxes and I could run 50 threads at once was a big help too. And yes, before I was fired I was looking into lambda to do this even faster.
I'm guessing they never figured out some of that stuff in python, as the last script that I didn't finish was the one that controlled kicking off threads/servers and parsing out the work.
How much faster would a script running C be than a python one for parsing json and looking for keywords? I know the perl script was way faster than python, but I can't see the savings being enough for a rewrite that's going to take months.
The total aws bill for a month using my perl scripts was going to be <$2000. I set it up with aurora which would shut down when not running, and the 96 cpu boxes would fire up only when needed for processing and then shut down when done so only one dinky little box would run 24/7.