Open Side Menu Go to the Top
Register
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** ** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD **

04-21-2017 , 02:22 PM
As a general rule, "where in" is going to much slower than "where exists." Where in generally requires the id to be a unique or PK index to work fast.

The "where exists" is more flexible on this rule.

I don't know too much about the MySQL processing language, but are you able to do something like this?

Code:
create function get_stuff (i_type)
returns table (type, cnt)

loop (x, i_type)
    select type, count(*)
    from tbl 
    where type = x
    group by type;
end loop;

---

get_stuff ('a', 'b', 'c', 'd');
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-21-2017 , 02:25 PM
Quote:
Originally Posted by daveT
I don't know too much about the MySQL processing language, but are you able to do something like this?
[/code]
Me neither. But would you actually expect that to perform better?
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-21-2017 , 02:48 PM
If the inner query is performing at 20ms, it seems the problem is the outer query. You said the unions are working a bit better.

It's kind of lower-level, but the general idea is to either a) encapsulate the function and / or b) do something strange with a while loop.

something like this:

Code:
create function some_fun (type array)
returns table (type, cnt)

int counter = 0;

for (x as select type from tbl)
   counter += 1;
   if counter == 11:
       return (x, counter);
   end if;
end for;
I know I've done stuff like this and it was way faster in Postgres and handled larger data set, but obviously I wouldn't be able to say it's faster / slower.

edit to do a cleaner thing.

Last edited by daveT; 04-21-2017 at 02:54 PM.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-21-2017 , 05:02 PM
Quote:
Originally Posted by RustyBrooks
I have half a mind to write all this **** to files and try grep on it. I'm not even kidding.
This is exactly what popped into my head when you got further into the details of the problem. Is this a single server running mysql or a mysql cluster?
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-21-2017 , 08:47 PM
Quote:
Originally Posted by blacklab
This is exactly what popped into my head when you got further into the details of the problem. Is this a single server running mysql or a mysql cluster?
It's a single instance of aws RDS. Our webservers are many though

Sent from my Nexus 5X using Tapatalk
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 09:47 AM
Quote:
Originally Posted by RustyBrooks
Haha I have had interviews that lasted over 8 hours. That is just the industry standard now. If it doesn't cost your interviewee $500 to come interview then you aren't doing your job!


Need to add the hours for your coding exercise that has you design and prototype twitter.

Quote:
Originally Posted by blacklab
This is exactly what popped into my head when you got further into the details of the problem. Is this a single server running mysql or a mysql cluster?
I am probably missing something here, but is this just trying to trick SQL into doing a short-circuitable full-table scan? And trying to optimize average but not worst-case performance? It feels like the non-SQL code for this is pretty trivial so if this is a common task, maybe duplicating the storage and doing the retrieval in another way is the right approach.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 11:25 AM
Quote:
Originally Posted by Sholar
I am probably missing something here, but is this just trying to trick SQL into doing a short-circuitable full-table scan? And trying to optimize average but not worst-case performance? It feels like the non-SQL code for this is pretty trivial so if this is a common task, maybe duplicating the storage and doing the retrieval in another way is the right approach.
You aren't missing anything at all. That's exactly what I'm proposing, though I'm trying to optimize against worst-case. It seems like worse-case (11 rows), is going to show up more often than the best case (< 11 rows). The average seems pretty much the same as the worst case, right?

The queries with limit 11 are running at 20ms, so it doesn't seem like a bad idea to just call from Python on each type then append the results to the output list instead, which is the same thing I'm trying to show in the procedural code. Rusty has said this query is supposed to run across many tables, so it may end up with a ton of back-and-forth.

The problem, as I'm seeing it, is that the large tables are using up all the allocated memory chunks to work. A soaking wet napkin calculation:

400M rows.

Max Memory Chunk: 10K rows

Needed Memory Chunks for naive calculation: 40,000.

The query we're specifically running through is a single table. Dong the short-circuit is pushing the size of the seed data to 11, which would free up a ton of memory chunks.

The problem is the outer query, which is still using up many memory chunks, but quite a bit fewer than 40,000. If this was able to be indexed, the database would know what chunks to look at and it would go much faster, but this depends on what tables the outer query is using.

I'm not sure about the procedural code because I don't know two basic facts about MySQL:

1- Can the query optimizer "see" inside of the procedural code. If this is the case, other ideas may work, but the looping should use up the least memory, and shouldn't depend on the query optimizer. I'm pretty confident about the basic idea.

2- How do arrays work in MySQL. Are they fast, do they even exist? If not, what is the alternative and how does it behave? I'm not confident at all about this.

I don't have a lot of domain knowledge about MySQL. I was trying to show more universal solutions first. The missing piece is CTEs and, in particular, recursive CTEs, which the procedure is attempting to replace. Making successive calls from Python and appending to a list is basically doing the same thing, albeit without the full benefit of MySQL's query optimizer.

This explains what I'm trying to do:

https://wiki.postgresql.org/wiki/Loose_indexscan

The main problem is trying to find a small segment of data among a very large segment of data. Even with indexes, this is often very slow, so a work-around is probably going to be needed.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 11:52 AM
Dave, the "worst case" under any short-circuit method is when there are less than 11 results, because we have to scan the entire database to fill up to 11. The best case is when our results are very dense in the result set because we then get to quit early.

In the non-short circuit method, i.e. find the real total count, the best case is when the results are small because then my group by is short/easy and the worst case is when the results are large.

At least, I think so.

Doing each query on it's own is an option but I'd have to think that the union all method would be faster. There are about 20 types - 20ms is the time the query takes in mysql itself, it might be more like 30ms when you make all the python round trips. 600ms wouldn't be too bad - I think currently the average is 1000-1500ms which I am finding acceptable.

Using an entirely different method to find the counts than the working version of the tables is probably going to be the right answer (and/or restructuring the data)

There are not a lot of tables in the query - really just 2 in most cases. But both tables are very large and in the non-short circuit case we often end up doing a join of most of the rows. The tables were initially split for a few reasons. In a way one table is the "base" table and the other table is "observations" or "additions" to the base. Since every observation shares the same base data, it made sense to keep the base separate, to save space. But I think it's probably costing us a lot in these search queries, especially when the query involves a column from each table, which requires every row to be joined before it can be eliminated or chosen.

Approximate counts would be totally fine for us, if there is something simple out there, I'd try it. Like I would be ecstatic if there was an easy way to say, I want counts by type of all the things that match this query, where the counts value is one of:
<10
10-100
100-1000
>1000
That would be like 99.9% perfect.

No one looks past the first page or 2 of results except googlebot. So the counts are really more useful just to get an idea if you have narrowed down the search enough to return relevant results, or to find out if a search term is really common/popular.

(in my industry, if you get a ton of results, it's probably a false positive)
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 11:54 AM
Actually, this just occurred to me - I wonder if there's any way to sort a table that gives a random or pseudorandom ordering? If so I could do one query, limit by like 10k, and get a probabilistic count
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 11:54 AM
So in this bootcamp they are spending a lot of time teaching us about classes/inheritance/prototypes, etc. (Which are also a major source of wasted time as these concepts are pretty intuitive and shouldn't take much explanation, but people ask for this stuff to be repeated 10 times to them).

I mentioned this to my friend and he said he never uses them. He said this is basically what OOP is, and he doesn't do it, but I shouldn't take his word for it, I should ask someone who loves OOP and think for myself.

My instructor basically responded that there is a lot of functional programming vs OOP out there. Which makes sense, but it seems like the "best" pattern would be to do some of both. So surely there are times to use both?
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 11:55 AM
Quote:
Originally Posted by RustyBrooks
Actually, this just occurred to me - I wonder if there's any way to sort a table that gives a random or pseudorandom ordering? If so I could do one query, limit by like 10k, and get a probabilistic count
order by rand() in mysql
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 12:00 PM
I guess I could just add a column, seed it with random numbers, and whenever I add a row to the table, add a random number there too. That would have the advantage of having stable results. The counts wouldn't be random, they'd be the same query over query except when new rows were added with random numbers smaller than the current random set.

You'd have the "problem" that there might be clusters of results that randomly have high random numbers and never appear in the counts. Like if you were saying
select type, count(*) from mytable where myquery order by randcol limit 10000
and all the rows with the word "blue" in them were really high, you might get funny cases that return results, but the counts show zero results.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 12:00 PM
Quote:
Originally Posted by iversonian
order by rand() in mysql
This requires you to compute rand() for every row in the table
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 12:00 PM
imo, wrt oop/fp

1. learn the concepts by the book and see examples
2. write ****ty code
3. start to identify common patterns and how you might solve a problem applying one paradigm vs another depending on the situation, and start to understand their relative advantages
(repeat steps 1-3)
4. have an opinion on "oop vs fp"
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 12:27 PM
Quote:
Originally Posted by RustyBrooks
This requires you to compute rand() for every row in the table
the first result in google had an interesting solution to what i think is a similar problem that you have. if the primary key is an autoincrement int, then take the max(id) and do the rand in the application to fetch records one by one.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 12:28 PM
Larry, what do the best students in your class have in common? Something that I might be able to tell from talking with them for an hour?
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 12:49 PM
Quote:
Originally Posted by RustyBrooks
Dave, the "worst case" under any short-circuit method is when there are less than 11 results, because we have to scan the entire database to fill up to 11. The best case is when our results are very dense in the result set because we then get to quit early.
where be the ninja edit button. Yes, totally non-thinking statement there.

Quote:
In the non-short circuit method, i.e. find the real total count, the best case is when the results are small because then my group by is short/easy and the worst case is when the results are large.

At least, I think so.
It's hard to say without measurement. You have to do a full table scan with each query w/o indexes.

I'd guess that, if you are looking at exactly one item, that item could either be kept in memory and the counter is increases, or there is a temp table with col1 = item, col2 = 1, then sum(col1). Then again, they probably have some funky data structure that I'm not aware of, which probably makes raw N computation the only consideration. These people are smart, so who knows what magic they are doing.

Quote:
Doing each query on it's own is an option but I'd have to think that the union all method would be faster. There are about 20 types - 20ms is the time the query takes in mysql itself, it might be more like 30ms when you make all the python round trips. 600ms wouldn't be too bad - I think currently the average is 1000-1500ms which I am finding acceptable.

Using an entirely different method to find the counts than the working version of the tables is probably going to be the right answer (and/or restructuring the data)

There are not a lot of tables in the query - really just 2 in most cases. But both tables are very large and in the non-short circuit case we often end up doing a join of most of the rows. The tables were initially split for a few reasons. In a way one table is the "base" table and the other table is "observations" or "additions" to the base. Since every observation shares the same base data, it made sense to keep the base separate, to save space. But I think it's probably costing us a lot in these search queries, especially when the query involves a column from each table, which requires every row to be joined before it can be eliminated or chosen.
If these tables are nearly the same size, it is a good candidate for denormalization. If you end up with 25% null space or so, it may not make a real difference. If one table is more sparse with denormalization, I'd suggest finding a way to invert the query to break the larger table down first.

Quote:
Approximate counts would be totally fine for us, if there is something simple out there, I'd try it. Like I would be ecstatic if there was an easy way to say, I want counts by type of all the things that match this query, where the counts value is one of:
<10
10-100
100-1000
>1000
That would be like 99.9% perfect.
If approximate is a good, why not build fake materialized views? Just run a cron every hour or whatever and have it compute the entire data set. You can then just query that table with a straight-forward select where x between y and z. You can also index this table, and possibly do some processing on the data to reduce the like queries. The trade-off is that you end up taking a massive hit to compute the entire set each time. You could probably do the computation in one table, then after that computation is done, run a trigger to update all the values in the table you are querying.

https://dba.stackexchange.com/questi...-view-in-mysql
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 12:59 PM
Quote:
Originally Posted by Larry Legend
So in this bootcamp they are spending a lot of time teaching us about classes/inheritance/prototypes, etc. (Which are also a major source of wasted time as these concepts are pretty intuitive and shouldn't take much explanation, but people ask for this stuff to be repeated 10 times to them).

I mentioned this to my friend and he said he never uses them. He said this is basically what OOP is, and he doesn't do it, but I shouldn't take his word for it, I should ask someone who loves OOP and think for myself.

My instructor basically responded that there is a lot of functional programming vs OOP out there. Which makes sense, but it seems like the "best" pattern would be to do some of both. So surely there are times to use both?
Classes and inheritence are very much a part of FP.

https://en.wikibooks.org/wiki/Haskell/Classes_and_types

FP is simply an approach to removing global mutation and side effects. If anyone gives you any definition that doesn't specifically talk about time and mutation, it is wrong. I don't even think people who are strong proponents of OOP will tell you to go hog-wild with global mutation.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 01:02 PM
I suspect maybe procedural programming was meant somewhere?
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 01:35 PM
Quote:
Originally Posted by RustyBrooks
Dave, the "worst case" under any short-circuit method is when there are less than 11 results, because we have to scan the entire database to fill up to 11. The best case is when our results are very dense in the result set because we then get to quit early.
Yup.

Quote:
Approximate counts would be totally fine for us, if there is something simple out there, I'd try it
Quote:
Originally Posted by RustyBrooks
Actually, this just occurred to me - I wonder if there's any way to sort a table that gives a random or pseudorandom ordering? If so I could do one query, limit by like 10k, and get a probabilistic count
Yeah if you don't mind approximate answers, than this is best. Especially since you can always enable the "fast approximate" thing first and do the slower check only if you have ~0 results and need to be sure that there's nothing.

If you have something like TABLESAMPLE then it's very easy. Otherwise, just use rand() and skip the sort. E.g.

Code:
select * from tbl where rand() < 0.001
is good enough and O(N) whereas sorting by rand() and limiting is O(N log N) and it sounds like you don't really care much about exactly how many rows come back. (Can always patch this up a bit to be based on count(1) for the table if you don't have an a priori sense of the size of the table etc.)
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 02:29 PM
Quote:
Originally Posted by _dave_
I suspect maybe procedural programming was meant somewhere?
Yeah, back in DeVry when I was learning COBOL...
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 04:08 PM
Quote:
Originally Posted by daveT
If these tables are nearly the same size, it is a good candidate for denormalization. If you end up with 25% null space or so, it may not make a real difference. If one table is more sparse with denormalization, I'd suggest finding a way to invert the query to break the larger table down first.
Every entry in table B has an entry in table A. But there can be many Bs for each A. In theory, each A could hold a large amount of data - but in practice they don't. That is, each entry in A has a description, title, and a few other things, but 99.9% of them have no description or title.

Quote:
If approximate is a good, why not build fake materialized views? Just run a cron every hour or whatever and have it compute the entire data set. You can then just query that table with a straight-forward select where x between y and z. You can also index this table, and possibly do some processing on the data to reduce the like queries. The trade-off is that you end up taking a massive hit to compute the entire set each time. You could probably do the computation in one table, then after that computation is done, run a trigger to update all the values in the table you are querying.
This would be easier than denormalizing and probably faster. But if I'm going to do this, I think it would be even easier to just make an index table. Like say there are 5 columns I search. For each row, I add 5 rows to the index table. Each row has a search value, type, and the id of the row it came from (so I can eliminate duplicates)

I'll have to still do full table scans of the index table, and it'll be 5x as many rows, but the data will be much more compact in the sense that it'll only contain fields that I'm searching, which might help.


Quote:
Originally Posted by Sholar
Code:
select * from tbl where rand() < 0.001
is good enough and O(N) whereas sorting by rand() and limiting is O(N log N) and it sounds like you don't really care much about exactly how many rows come back. (Can always patch this up a bit to be based on count(1) for the table if you don't have an a priori sense of the size of the table etc.)
If I added my own randcol to the table, and indexed it, then finding randcol < whatever should be O(N) and much faster than scanning the whole table, I think. But actually let me just try it and see if it's a significant improvement, because that would not require me to change this table.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 04:28 PM
OK, so, this surprised me...

Code:
select type, count(*) from (
    select type from mytable where myquery order by rand() limit 10001
)
group by 1
is twice as fast as my normal query, and also twice as fast as

Code:
select type, count(*) from (
    select type from mytable where rand() < 0.001 myquery 
)
group by 1
and

Code:
select type, count(*) from mytable where rand() < 0.001 group by 1

Last edited by RustyBrooks; 04-22-2017 at 04:33 PM.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 04:33 PM
Agreed: it should be *much* faster if you can add and index a random number column (better than O(N)). I had assumed somewhere along the way that you couldn't make changes to the table.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-22-2017 , 04:36 PM
Quote:
Originally Posted by Sholar
Agreed: it should be *much* faster if you can add and index a random number column (better than O(N)). I had assumed somewhere along the way that you couldn't make changes to the table.
Changing the table is a PITA but possible.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote

      
m