Quote:
Originally Posted by Sholar
I am probably missing something here, but is this just trying to trick SQL into doing a short-circuitable full-table scan? And trying to optimize average but not worst-case performance? It feels like the non-SQL code for this is pretty trivial so if this is a common task, maybe duplicating the storage and doing the retrieval in another way is the right approach.
You aren't missing anything at all. That's exactly what I'm proposing, though I'm trying to optimize against worst-case. It seems like worse-case (11 rows), is going to show up more often than the best case (< 11 rows). The average seems pretty much the same as the worst case, right?
The queries with limit 11 are running at 20ms, so it doesn't seem like a bad idea to just call from Python on each type then append the results to the output list instead, which is the same thing I'm trying to show in the procedural code. Rusty has said this query is supposed to run across many tables, so it may end up with a ton of back-and-forth.
The problem, as I'm seeing it, is that the large tables are using up all the allocated memory chunks to work. A soaking wet napkin calculation:
400M rows.
Max Memory Chunk: 10K rows
Needed Memory Chunks for naive calculation: 40,000.
The query we're specifically running through is a single table. Dong the short-circuit is pushing the size of the seed data to 11, which would free up a ton of memory chunks.
The problem is the outer query, which is still using up many memory chunks, but quite a bit fewer than 40,000. If this was able to be indexed, the database would know what chunks to look at and it would go much faster, but this depends on what tables the outer query is using.
I'm not sure about the procedural code because I don't know two basic facts about MySQL:
1- Can the query optimizer "see" inside of the procedural code. If this is the case, other ideas may work, but the looping should use up the least memory, and shouldn't depend on the query optimizer. I'm pretty confident about the basic idea.
2- How do arrays work in MySQL. Are they fast, do they even exist? If not, what is the alternative and how does it behave? I'm not confident at all about this.
I don't have a lot of domain knowledge about MySQL. I was trying to show more universal solutions first. The missing piece is CTEs and, in particular, recursive CTEs, which the procedure is attempting to replace. Making successive calls from Python and appending to a list is basically doing the same thing, albeit without the full benefit of MySQL's query optimizer.
This explains what I'm trying to do:
https://wiki.postgresql.org/wiki/Loose_indexscan
The main problem is trying to find a small segment of data among a very large segment of data. Even with indexes, this is often very slow, so a work-around is probably going to be needed.