** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** - Page 1102 - Computer Technical Help

As a general rule, "where in" is going to much slower than "where exists." Where in generally requires the id to be a unique or PK index to work fast.

The "where exists" is more flexible on this rule.

I don't know too much about the MySQL processing language, but are you able to do something like this?

Code:

create function get_stuff (i_type)
returns table (type, cnt)

loop (x, i_type)
    select type, count(*)
    from tbl 
    where type = x
    group by type;
end loop;

---

get_stuff ('a', 'b', 'c', 'd');

Quote

04-21-2017 , 02:25 PM

#27527

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Quote:

Originally Posted by daveT

I don't know too much about the MySQL processing language, but are you able to do something like this?
[/code]

Me neither. But would you actually expect that to perform better?

Quote

04-21-2017 , 02:48 PM

#27528

daveT

S.A.G.E. Master

Join Date: Jun 2005 Posts: 23,955

If the inner query is performing at 20ms, it seems the problem is the outer query. You said the unions are working a bit better.

It's kind of lower-level, but the general idea is to either a) encapsulate the function and / or b) do something strange with a while loop.

something like this:

Code:

create function some_fun (type array)
returns table (type, cnt)

int counter = 0;

for (x as select type from tbl)
   counter += 1;
   if counter == 11:
       return (x, counter);
   end if;
end for;

I know I've done stuff like this and it was way faster in Postgres and handled larger data set, but obviously I wouldn't be able to say it's faster / slower.

edit to do a cleaner thing.

Last edited by daveT; 04-21-2017 at 02:54 PM.

Quote

04-21-2017 , 05:02 PM

#27529

blacklab

adept

Join Date: Jan 2005 Posts: 1,083

Quote:

Originally Posted by RustyBrooks

I have half a mind to write all this **** to files and try grep on it. I'm not even kidding.

This is exactly what popped into my head when you got further into the details of the problem. Is this a single server running mysql or a mysql cluster?

Quote

04-21-2017 , 08:47 PM

#27530

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Quote:

Originally Posted by blacklab

This is exactly what popped into my head when you got further into the details of the problem. Is this a single server running mysql or a mysql cluster?

It's a single instance of aws RDS. Our webservers are many though

Sent from my Nexus 5X using Tapatalk

Quote

04-22-2017 , 09:47 AM

#27531

Sholar

Carpal \'Tunnel

Join Date: Jul 2007 Posts: 6,357

Quote:

Originally Posted by RustyBrooks

Haha I have had interviews that lasted over 8 hours. That is just the industry standard now. If it doesn't cost your interviewee $500 to come interview then you aren't doing your job!

Need to add the hours for your coding exercise that has you design and prototype twitter.

Quote:

Originally Posted by blacklab

This is exactly what popped into my head when you got further into the details of the problem. Is this a single server running mysql or a mysql cluster?

I am probably missing something here, but is this just trying to trick SQL into doing a short-circuitable full-table scan? And trying to optimize average but not worst-case performance? It feels like the non-SQL code for this is pretty trivial so if this is a common task, maybe duplicating the storage and doing the retrieval in another way is the right approach.

Quote

04-22-2017 , 11:25 AM

#27532

daveT

S.A.G.E. Master

Join Date: Jun 2005 Posts: 23,955

Quote:

Originally Posted by Sholar

You aren't missing anything at all. That's exactly what I'm proposing, though I'm trying to optimize against worst-case. It seems like worse-case (11 rows), is going to show up more often than the best case (< 11 rows). The average seems pretty much the same as the worst case, right?

The queries with limit 11 are running at 20ms, so it doesn't seem like a bad idea to just call from Python on each type then append the results to the output list instead, which is the same thing I'm trying to show in the procedural code. Rusty has said this query is supposed to run across many tables, so it may end up with a ton of back-and-forth.

The problem, as I'm seeing it, is that the large tables are using up all the allocated memory chunks to work. A soaking wet napkin calculation:

400M rows.

Max Memory Chunk: 10K rows

Needed Memory Chunks for naive calculation: 40,000.

The query we're specifically running through is a single table. Dong the short-circuit is pushing the size of the seed data to 11, which would free up a ton of memory chunks.

The problem is the outer query, which is still using up many memory chunks, but quite a bit fewer than 40,000. If this was able to be indexed, the database would know what chunks to look at and it would go much faster, but this depends on what tables the outer query is using.

I'm not sure about the procedural code because I don't know two basic facts about MySQL:

1- Can the query optimizer "see" inside of the procedural code. If this is the case, other ideas may work, but the looping should use up the least memory, and shouldn't depend on the query optimizer. I'm pretty confident about the basic idea.

2- How do arrays work in MySQL. Are they fast, do they even exist? If not, what is the alternative and how does it behave? I'm not confident at all about this.

I don't have a lot of domain knowledge about MySQL. I was trying to show more universal solutions first. The missing piece is CTEs and, in particular, recursive CTEs, which the procedure is attempting to replace. Making successive calls from Python and appending to a list is basically doing the same thing, albeit without the full benefit of MySQL's query optimizer.

This explains what I'm trying to do:

https://wiki.postgresql.org/wiki/Loose_indexscan

The main problem is trying to find a small segment of data among a very large segment of data. Even with indexes, this is often very slow, so a work-around is probably going to be needed.

Quote

04-22-2017 , 11:52 AM

#27533

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Dave, the "worst case" under any short-circuit method is when there are less than 11 results, because we have to scan the entire database to fill up to 11. The best case is when our results are very dense in the result set because we then get to quit early.

In the non-short circuit method, i.e. find the real total count, the best case is when the results are small because then my group by is short/easy and the worst case is when the results are large.

At least, I think so.

Doing each query on it's own is an option but I'd have to think that the union all method would be faster. There are about 20 types - 20ms is the time the query takes in mysql itself, it might be more like 30ms when you make all the python round trips. 600ms wouldn't be too bad - I think currently the average is 1000-1500ms which I am finding acceptable.

Using an entirely different method to find the counts than the working version of the tables is probably going to be the right answer (and/or restructuring the data)

There are not a lot of tables in the query - really just 2 in most cases. But both tables are very large and in the non-short circuit case we often end up doing a join of most of the rows. The tables were initially split for a few reasons. In a way one table is the "base" table and the other table is "observations" or "additions" to the base. Since every observation shares the same base data, it made sense to keep the base separate, to save space. But I think it's probably costing us a lot in these search queries, especially when the query involves a column from each table, which requires every row to be joined before it can be eliminated or chosen.

Approximate counts would be totally fine for us, if there is something simple out there, I'd try it. Like I would be ecstatic if there was an easy way to say, I want counts by type of all the things that match this query, where the counts value is one of:
<10
10-100
100-1000
>1000
That would be like 99.9% perfect.

No one looks past the first page or 2 of results except googlebot. So the counts are really more useful just to get an idea if you have narrowed down the search enough to return relevant results, or to find out if a search term is really common/popular.

(in my industry, if you get a ton of results, it's probably a false positive)

Quote

04-22-2017 , 11:54 AM

#27534

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Actually, this just occurred to me - I wonder if there's any way to sort a table that gives a random or pseudorandom ordering? If so I could do one query, limit by like 10k, and get a probabilistic count

Quote

04-22-2017 , 11:54 AM

#27535

Larry Legend

Celtic Pride

Join Date: Jul 2009 Posts: 42,920

So in this bootcamp they are spending a lot of time teaching us about classes/inheritance/prototypes, etc. (Which are also a major source of wasted time as these concepts are pretty intuitive and shouldn't take much explanation, but people ask for this stuff to be repeated 10 times to them).

I mentioned this to my friend and he said he never uses them. He said this is basically what OOP is, and he doesn't do it, but I shouldn't take his word for it, I should ask someone who loves OOP and think for myself.

My instructor basically responded that there is a lot of functional programming vs OOP out there. Which makes sense, but it seems like the "best" pattern would be to do some of both. So surely there are times to use both?

Quote

04-22-2017 , 11:55 AM

#27536

iversonian

Carpal \'Tunnel

Join Date: Sep 2003 Posts: 21,441

Quote:

Originally Posted by RustyBrooks

order by rand() in mysql

Quote

04-22-2017 , 12:00 PM

#27537

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

I guess I could just add a column, seed it with random numbers, and whenever I add a row to the table, add a random number there too. That would have the advantage of having stable results. The counts wouldn't be random, they'd be the same query over query except when new rows were added with random numbers smaller than the current random set.

You'd have the "problem" that there might be clusters of results that randomly have high random numbers and never appear in the counts. Like if you were saying
select type, count(*) from mytable where myquery order by randcol limit 10000
and all the rows with the word "blue" in them were really high, you might get funny cases that return results, but the counts show zero results.

Quote

04-22-2017 , 12:00 PM

#27538

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Quote:

Originally Posted by iversonian

order by rand() in mysql

This requires you to compute rand() for every row in the table

Quote

04-22-2017 , 12:00 PM

#27539

iversonian

Carpal \'Tunnel

Join Date: Sep 2003 Posts: 21,441

imo, wrt oop/fp

1. learn the concepts by the book and see examples
2. write ****ty code
3. start to identify common patterns and how you might solve a problem applying one paradigm vs another depending on the situation, and start to understand their relative advantages
(repeat steps 1-3)
4. have an opinion on "oop vs fp"

Quote

04-22-2017 , 12:27 PM

#27540

iversonian

Carpal \'Tunnel

Join Date: Sep 2003 Posts: 21,441

Quote:

Originally Posted by RustyBrooks

This requires you to compute rand() for every row in the table

the first result in google had an interesting solution to what i think is a similar problem that you have. if the primary key is an autoincrement int, then take the max(id) and do the rand in the application to fetch records one by one.

Quote

04-22-2017 , 12:28 PM

#27541

iversonian

Carpal \'Tunnel

Join Date: Sep 2003 Posts: 21,441

Larry, what do the best students in your class have in common? Something that I might be able to tell from talking with them for an hour?

Quote

04-22-2017 , 12:49 PM

#27542

daveT

S.A.G.E. Master

Join Date: Jun 2005 Posts: 23,955

Quote:

Originally Posted by RustyBrooks

where be the ninja edit button. Yes, totally non-thinking statement there.

Quote:

In the non-short circuit method, i.e. find the real total count, the best case is when the results are small because then my group by is short/easy and the worst case is when the results are large.

At least, I think so.

It's hard to say without measurement. You have to do a full table scan with each query w/o indexes.

I'd guess that, if you are looking at exactly one item, that item could either be kept in memory and the counter is increases, or there is a temp table with col1 = item, col2 = 1, then sum(col1). Then again, they probably have some funky data structure that I'm not aware of, which probably makes raw N computation the only consideration. These people are smart, so who knows what magic they are doing.

Quote:

Doing each query on it's own is an option but I'd have to think that the union all method would be faster. There are about 20 types - 20ms is the time the query takes in mysql itself, it might be more like 30ms when you make all the python round trips. 600ms wouldn't be too bad - I think currently the average is 1000-1500ms which I am finding acceptable.

Using an entirely different method to find the counts than the working version of the tables is probably going to be the right answer (and/or restructuring the data)

There are not a lot of tables in the query - really just 2 in most cases. But both tables are very large and in the non-short circuit case we often end up doing a join of most of the rows. The tables were initially split for a few reasons. In a way one table is the "base" table and the other table is "observations" or "additions" to the base. Since every observation shares the same base data, it made sense to keep the base separate, to save space. But I think it's probably costing us a lot in these search queries, especially when the query involves a column from each table, which requires every row to be joined before it can be eliminated or chosen.

If these tables are nearly the same size, it is a good candidate for denormalization. If you end up with 25% null space or so, it may not make a real difference. If one table is more sparse with denormalization, I'd suggest finding a way to invert the query to break the larger table down first.

Quote:

Approximate counts would be totally fine for us, if there is something simple out there, I'd try it. Like I would be ecstatic if there was an easy way to say, I want counts by type of all the things that match this query, where the counts value is one of:
<10
10-100
100-1000
>1000
That would be like 99.9% perfect.

If approximate is a good, why not build fake materialized views? Just run a cron every hour or whatever and have it compute the entire data set. You can then just query that table with a straight-forward select where x between y and z. You can also index this table, and possibly do some processing on the data to reduce the like queries. The trade-off is that you end up taking a massive hit to compute the entire set each time. You could probably do the computation in one table, then after that computation is done, run a trigger to update all the values in the table you are querying.

https://dba.stackexchange.com/questi...-view-in-mysql

Quote

04-22-2017 , 12:59 PM

#27543

daveT

S.A.G.E. Master

Join Date: Jun 2005 Posts: 23,955

Quote:

Originally Posted by Larry Legend

Classes and inheritence are very much a part of FP.

https://en.wikibooks.org/wiki/Haskell/Classes_and_types

FP is simply an approach to removing global mutation and side effects. If anyone gives you any definition that doesn't specifically talk about time and mutation, it is wrong. I don't even think people who are strong proponents of OOP will tell you to go hog-wild with global mutation.

Quote

04-22-2017 , 01:02 PM

#27544

_dave_

_Pooh_Bah_

Join Date: Feb 2005 Posts: 13,146

I suspect maybe procedural programming was meant somewhere?

Quote

04-22-2017 , 01:35 PM

#27545

Sholar

Carpal \'Tunnel

Join Date: Jul 2007 Posts: 6,357

Quote:

Originally Posted by RustyBrooks

Yup.

Quote:

Approximate counts would be totally fine for us, if there is something simple out there, I'd try it

Quote:

Originally Posted by RustyBrooks

Yeah if you don't mind approximate answers, than this is best. Especially since you can always enable the "fast approximate" thing first and do the slower check only if you have ~0 results and need to be sure that there's nothing.

If you have something like TABLESAMPLE then it's very easy. Otherwise, just use rand() and skip the sort. E.g.

Code:

select * from tbl where rand() < 0.001

is good enough and O(N) whereas sorting by rand() and limiting is O(N log N) and it sounds like you don't really care much about exactly how many rows come back. (Can always patch this up a bit to be based on count(1) for the table if you don't have an a priori sense of the size of the table etc.)

Quote

04-22-2017 , 02:29 PM

#27546

kerowo

lolcat

Join Date: Nov 2005 Posts: 37,190

Quote:

Originally Posted by _dave_

I suspect maybe procedural programming was meant somewhere?

Yeah, back in DeVry when I was learning COBOL...

Quote

04-22-2017 , 04:08 PM

#27547

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Quote:

Originally Posted by daveT

Every entry in table B has an entry in table A. But there can be many Bs for each A. In theory, each A could hold a large amount of data - but in practice they don't. That is, each entry in A has a description, title, and a few other things, but 99.9% of them have no description or title.

Quote:

This would be easier than denormalizing and probably faster. But if I'm going to do this, I think it would be even easier to just make an index table. Like say there are 5 columns I search. For each row, I add 5 rows to the index table. Each row has a search value, type, and the id of the row it came from (so I can eliminate duplicates)

I'll have to still do full table scans of the index table, and it'll be 5x as many rows, but the data will be much more compact in the sense that it'll only contain fields that I'm searching, which might help.

Quote:

Originally Posted by Sholar

Code:

select * from tbl where rand() < 0.001

If I added my own randcol to the table, and indexed it, then finding randcol < whatever should be O(N) and much faster than scanning the whole table, I think. But actually let me just try it and see if it's a significant improvement, because that would not require me to change this table.

Quote

04-22-2017 , 04:28 PM

#27548

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

OK, so, this surprised me...

Code:

select type, count(*) from (
    select type from mytable where myquery order by rand() limit 10001
)
group by 1

is twice as fast as my normal query, and also twice as fast as

Code:

select type, count(*) from (
    select type from mytable where rand() < 0.001 myquery 
)
group by 1

and

Code:

select type, count(*) from mytable where rand() < 0.001 group by 1

Last edited by RustyBrooks; 04-22-2017 at 04:33 PM.

Quote

04-22-2017 , 04:33 PM

#27549

Sholar

Carpal \'Tunnel

Join Date: Jul 2007 Posts: 6,357

Agreed: it should be *much* faster if you can add and index a random number column (better than O(N)). I had assumed somewhere along the way that you couldn't make changes to the table.

Quote

04-22-2017 , 04:36 PM

#27550

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Quote:

Originally Posted by Sholar

Agreed: it should be *much* faster if you can add and index a random number column (better than O(N)). I had assumed somewhere along the way that you couldn't make changes to the table.

Changing the table is a PITA but possible.

Quote

Page 1102 of 1603

First

102 602 902 1002 1052 1082 1092 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1112 1122 1152 1202 1302 1602

Last

Post Reply Subscribe

...

Page 1102 of 1603

First

102 602 902 1002 1052 1082 1092 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1112 1122 1152 1202 1302 1602

Last