Open Side Menu Go to the Top

11-10-2019 , 03:52 PM
copypasting this from elsewhere on the forum, pretty awesome:

** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD **
$25m Guaranteed WPM on CoinPoker
Join the action now
Daily Rewards • Splash Pots • CoinRaces
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD **
11-18-2019 , 09:53 PM
Is there something like promise.all() in python?

Code:
  reportData["isFirstTimeUser"] = isFirstTimeUser(event)
  reportData["paymentLinkClicked"] = giveAgainPayNowClicked(event)
  reportData["profileChangeMade"] = profileChangeMade(event)
  reportData["feedbackSubmitted"] = feedbackSubmitted(event)
Each of these runs a query against our logs and takes 5-10 seconds. It would be nice to run them in parallel.

Mutliprocessing and Process looks promising. But I can't figure out how to tell when all the calls are completed.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-19-2019 , 12:15 AM
multiprocessing in python is literally going to use multiple processes - which may be fine but you should know that. You might be better off using something like gevent, which has the same basic interface as the python threading library, but everything runs in one thread. This works great for things that are not CPU bound, like stuff that's just waiting like in your case. gevent has the notion of "waiting" for a set of eventlets to finish, same as the threading library, and something like that would probably work for you.

The threading library itself may or may not work for you also in this case - it kind of depends. The problem with the threading library is that python code invokes a global lock so in practice pure python code can't run in parallel, one python command or another is always taking the lock. But many C/C++ libraries used in python don't invoke the lock and can be run in parallel. It's not uncommon for stuff like database libraries to have some multi-threadability. You'd have to try it to see.

Finally there's a library called "twisted" which is used for asynchronous/parallel even processing. Of all the stuff mentioned so far, it's probably the most complicated. Just throwing it out there.

Also python3 has some async/await stuff, but my recollection of it is that it's mostly for generators, which might not fit the pattern of what you're working on. It would be good if each of the functions you had above produced an entry at regular intervals but maybe not great if each of them paused for a long time and then produced a lot of entries. I'm half talking out of my ass here, I've never really used it.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-19-2019 , 12:23 AM
Of the ones I mentioned I think I'd try gevent first. I could probably help you set up an example if you get stuck
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-19-2019 , 12:49 AM
I should really ask questions here instead of beating my head against the wall.

I had a process that I was trying to thread. Took me a while to figure out that the library I was using already multithreaded so threading didn't work.

I ended up having to kick off the same program multiple times rather than threading one to max out the cpus on the server.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-19-2019 , 05:59 PM
Quote:
Originally Posted by RustyBrooks
multiprocessing in python is literally going to use multiple processes - which may be fine but you should know that. You might be better off using something like gevent, which has the same basic interface as the python threading library, but everything runs in one thread. This works great for things that are not CPU bound, like stuff that's just waiting like in your case. gevent has the notion of "waiting" for a set of eventlets to finish, same as the threading library, and something like that would probably work for you.

The threading library itself may or may not work for you also in this case - it kind of depends. The problem with the threading library is that python code invokes a global lock so in practice pure python code can't run in parallel, one python command or another is always taking the lock. But many C/C++ libraries used in python don't invoke the lock and can be run in parallel. It's not uncommon for stuff like database libraries to have some multi-threadability. You'd have to try it to see.

Finally there's a library called "twisted" which is used for asynchronous/parallel even processing. Of all the stuff mentioned so far, it's probably the most complicated. Just throwing it out there.

Also python3 has some async/await stuff, but my recollection of it is that it's mostly for generators, which might not fit the pattern of what you're working on. It would be good if each of the functions you had above produced an entry at regular intervals but maybe not great if each of them paused for a long time and then produced a lot of entries. I'm half talking out of my ass here, I've never really used it.
gevent looks interesting but unfortunately I can't get it working with lambda and I'm not sure adding the giant source binary is the way to go.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-19-2019 , 06:55 PM
Could your lambda kick off other lambdas?
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-19-2019 , 06:58 PM
Yeah but I still need to aggregate all the results somehow. So I could write the results for each one to S3 or something then constantly poll the directory to see if they're all there. But that seems suboptimal. Lambda kicking off other lambdas works great if the end result is say to send an email for just that report.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-19-2019 , 07:01 PM
Anyway I got it working based on this example: https://aws.amazon.com/blogs/compute...th-aws-lambda/

Lambda doesn't support Queue or Pool, so you have to use Pipe.

Here's the main handler method:

Code:
from multiprocessing import Process, Pipe

def doProspectReport(event, context):
  reportData = {'crmId': event["crmId"], 'username': event["username"], 'logincount': event["logincount"] }
  processes = []

  parent_conn1, child_conn1 = Pipe()
  parent_conn2, child_conn2 = Pipe()
  parent_conn3, child_conn3 = Pipe()
  parent_conn4, child_conn4 = Pipe()

  processes.append(Process(target=isFirstTimeUser,    args=(event, child_conn1,)))
  processes.append(Process(target=paymentLinkClicked, args=(event, child_conn2,)))
  processes.append(Process(target=profileChangeMade,  args=(event, child_conn3,)))
  processes.append(Process(target=feedbackSubmitted,  args=(event, child_conn4,)))

  for p in processes:
    p.start()

  reportData["isFirstTimeUser"]    = parent_conn1.recv()
  reportData["paymentLinkClicked"] = parent_conn2.recv()
  reportData["profileChangeMade"]  = parent_conn3.recv()
  reportData["feedbackSubmitted"]  = parent_conn4.recv()

  for p in processes:
    p.join()

  return reportData
It seems to work if I set the reportData values before or after doing the p.join() - which I guess is needed to make sure everything is done. But if that's true how come I can set my variables before calling p.join (as in the example I linked)? Weird. Also apparently that extra trailing comma in the args group is important for some reason.

Here is a function that it calls:

Code:
def paymentLinkClicked(event, conn): 
  logGroup = '/aws/lambda/dp-util-dev-getCookies'
  queryString = 'fields event.ConstituentLookupId as CrmId, event.target as target, event.cognitoEmail as email, event.cognitoPhone as phone, @timestamp' \
                '| filter (event.ConstituentLookupId=\'' + event["crmId"] + '\' and name like /\\w+/)' 
  
  queryResult = doQuery(logGroup, queryString)

  gapnResult = {"giveagain": 0, "paynow": 0}
  for r in queryResult:
    gapnResult[r["target"]] += 1

  conn.send(gapnResult) 
  conn.close()
doQuery is a synchronous function that does a cloudwatch logs query.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-19-2019 , 07:33 PM
Gevent might not be suitable for lambda, no idea. I'll post a multiprocessing example later
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-19-2019 , 07:34 PM
Oh, or maybe you got it working already
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-19-2019 , 08:21 PM
Actually I read the example wrong. process.join() needs to come before parent_conn1.recv(), not after. Which makes more sense. I was getting some weird results.

But other than that yeah I think I got it working.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-19-2019 , 09:00 PM
I was getting weird results (different input, same output) for my parallel calls to another lambda. But if I put a print statement on the output it worked fine. Apparently the boto3 lambda client is not threadsafe: https://stackoverflow.com/questions/...nt-thread-safe

So moving this inside my method instead of at the top of the file seems to have fixed the problem:

lambdaClient = boto3.client('lambda')

Edit: one of the first calls we do to set this whole thing up has a very large response (80k rows). Apparently I have reached the limit of Pipe() because that just times out after 2 minutes. It worked fine w/o using pipe. No parallel for that call I guess.

Last edited by suzzer99; 11-19-2019 at 09:22 PM.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-19-2019 , 10:27 PM
Lol development in a modern world.

Quote:
Query concurrency

A maximum of 4 concurrent CloudWatch Logs Insights queries, including queries that have been added to dashboards. You can request a quota increase.
So much for all the parallel stuff I wanted to do.

But now I'm wondering does that mean it's going to crap out if my dashboard happens to be refreshing while this report runs. Argh.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-20-2019 , 12:36 AM
If you're using the multiprocessing library, then you're not using multiple threads, you're using multiple processes. It forks the processes when you create the pool or whatever, so where you put init stuff determines whether it'll be separately inited in each process, or whether it'll inherit the inited stuff when it forks.

It's common to have to re-arrange stuff so that it gets done "right" - some stuff wants to be duped and some stuff needs to be separately initialized

You can get your quota increased, we routinely request quotas at 10 or 100 times the defaults for various stuff
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-20-2019 , 12:37 AM
Wait, are you trawling cloudwatch logs to get real time data? That seems like a weird and contraindicated thing to do. Cloudwatch is slow as balls, even for what it's for, which really kinda isn't that.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-20-2019 , 01:43 AM
No it's a report that runs at 2am. They want a summary of our VIP user's activities to send to a prospect manager.

First I get the day's logins from the logs, then I compare to our full list of VIPs, then from that I run a full activity report for each match and consolidate them all into a CSV file at the end. The nice thing is that I didn't have to touch the application code at all.

I don't really care too much about the whether it's process or thread. AWS can worry about that. Although I did find out I can't run multiple lambda calls in parallel with the same boto lambda client. Has to be a new instance each time. I assume it's the same for the logs client.

I'm just trying to use parallelism to efficiently use time while waiting for the cloudwatch queries to execute. I think the right balance is run the queries for each report (4-6) in parallel, but run each report in series - since I could have 100s of reports and no clean way to queue them up.

We're requesting an increase in concurrent queries from the default of 4. I told my boss to ask for 20. Maybe I should have asked for 100. It costs the same whether they're being run in series or concurrently right?
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-20-2019 , 02:02 AM
Yeah, should be the same cost.

We had to scrape a shitload of websites at my last job. So I had my guys stick it into a lambda and fire them off. We figured out if we fired off 10,000 at a time we could be done in a week. So we did. Couldn't figure out why stuff was hanging.

Then we got a call from IT. Turns out our company had a limit of 7000 concurrent lambdas so no one else at the company could run lambda for a day because of us. I thought the whole idea of lambda was fire off as many as you wanted. Never occurred to me that there would be a limit.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-20-2019 , 08:59 AM
Welcome to AWS. Where everything has a limit, many of which are invisible to the user, some of which are invisible to their own front line support, and a few of which seem to be hard coded values nobody knows about.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-20-2019 , 12:41 PM
Quote:
Originally Posted by jjshabado
Welcome to AWS. Where everything has a limit, many of which are invisible to the user, some of which are invisible to their own front line support, and a few of which seem to be hard coded values nobody knows about.
My favorite are the ones that are rate limited at fairly low rates, like 1 or 2 per second.

Speaking of cloudwatch, there's an aws cli thing that will download cloudwatch logs to local disk. It is So Slow. I wrote my own version that's roughly 30-40x faster. To try to download our access logs, for example, took *roughly real time* i.e. if I wanted an hour of logs it was going to take an hour to get them.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-20-2019 , 02:21 PM
Hah. Yeah, no kidding. The most painful limit I've ever had to get raised was a specific api call rate limit of like 5 requests/second. I assume there was some old shitty service somewhere that they were told to protect at all costs regardless of how big a customer was asking.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-20-2019 , 04:30 PM
Just received an offer from my old company to move back to MN from SoCal.

It's a fairly significant paycut but cost of living offsets most of it and my wife wants to live in MN near family so happy wife happy life.

When I left 1.5 years ago I was a Senior Engineer and the offer is for Principal Engineer (which is two levels higher, woohoo!) I am almost certainly leap frogging a few people, but I'll try my best to make sure there's not too much animosity.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-20-2019 , 05:12 PM
yeah, just fire them on day 1.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-20-2019 , 05:16 PM
I'm digging python so far, everything seems pretty straightforward and well thought out. Like what node does except w/o having to wrap it around 25 years of backward-compatible Javascript evolution.

Code:
  for p in processes:
    p.start()
If I was king I would decree that all looping in all languages be this terse.

The only thing I got a bit stuck on was when passing an object to an async process, I can't mutate it and have that reflected in the parent process when the async job is done. I guess it makes a copy? Maybe because it's a new process and not a new thread. That's why all the Pipe() stuff is necessary.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
11-20-2019 , 05:57 PM
dont know much about python but yes, threads within a process communicate in an entirely different way than processes communicate with each other.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD **
$25m Guaranteed WPM on CoinPoker
Join the action now
Daily Rewards • Splash Pots • CoinRaces
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD **

      
m