Quote:
Originally Posted by daveT
Those numbers are pretty sad for a so-called "big data" software. I"m also curious why there was no index on this query in the first place.
The 100ms was for total page response time - most of our mongo queries are 5ms or under.
I didn't create the data structure or the queries, or the indexing, but possibly if I had, there wouldn't be an index there either. I add indexes sparingly, when it's proven they do some good.
The worst part is, if you ask me, this endpoint is completely useless, and it's our most "used" API endpoint. But that's just because every deployment of our main product queries it every few minutes I think. I have no idea what they do with that information.
Regarding why it took us so long to get there, the problem was that literally every metric was off the charts and the nginx and uwsgi logs in particular had some stuff in them that looked really damning. Those things turned out to be symptoms, not causes.
Also, unfortunately, none of us were really empowered to plow through the system doing what needed to be done. So it involved a lot of corralling people and letting things get bad enough that everyone would agree
For example, my first desire was to drain the web queue and reject connections or redirect to a maintenance page. IMO this was a no brainer - we were dropping 80% of traffic on the floor and probably *all* of the frontend traffic because our frontend was unusable (our API was very slow and you'd have to try multiple times but it stayed up)
It's kind of a long story but if my operating theory was correct then this would actually solve the problem by itself. It didn't, though, so we ended up eliminating the likely causes and moving to the less likely.