Open Side Menu Go to the Top
Register
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** ** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD **

04-20-2017 , 06:25 AM
Quote:
Originally Posted by kerowo
Couldn't you do something like
select type from table where count(type) > 10;

But you wanted all types, can you short circuit SQL and an OR count(type) <= 10?
Something like this seems like it would work, but Rusty seems to be asking about something different.

Something like count(type) <= 10 and type in ('a', 'b', 'c') would work in the base case, which short-circuits the upper bound and the types in a single where clause. But he does want to know about the things that have more than 11. He could possibly do an except clause like so:

Code:
create temp table t_tbl as
select col, count(*) cnt
from tbl
where cnt <= 10;

select col, 11
from 
    (select col
     from tbl

     except

     select col
     from t_tble)

union

select col, cnt
from t_tbl;
I hate this though.

I think this is different because he seems to be asking about a query that is covering many large tables with no indexes. I don't why they "can't" be indexed, unless he is dealing in JSON (does JSON index in MySQL?), or some other manually cooked up data type. If it is the latter, there probably won't be anything that will make the query faster short of rebuilding the database correctly or pre-populating a materialized view. Granted, the memory blocks may be too large to index, but it's hard to judge without measuring, so I'm taking his word for this.

Since there is no indexing, the query is going to be n (size of table) * m (amt of types). The grouping and calculation will then be m*2, so it'll be very slow without some extra help from the query. The short-circuiting I'm suggesting takes the large n and m into consideration by initializing smaller numbers.

Since he is querying across many tables, I'm trying to show something that can be built up to that. Overall, the strategy would be to break the largest table into a smaller table, and query against it with the other tables.

I may be a bit off on my interpretation of the situation and query though...
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 08:47 AM
Quote:
Originally Posted by CyberShark93
So, some follow up, they sent me the schedule, basically I need to be there from 10:00 to 3:30? Which means 5 and a half hours?? How can an interview ever last that long? Has anyone had a marathon interview like this before?
In my experience they are this long or longer for most big companies.

It looks like your interview spans lunch, so you probably won't be getting grilled during that time. You'll probably talk to 3-4 people on the team for an hour each plus lunch.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 08:53 AM
Consider denormalising your data if queries to count things are way too slow. (Surprised you can't add an index though as a first step before doing that?)

I think designing queries that return if there's more or less than x is starting to perhaps stray into mixing data and logic, I like to keep the DB as dumb as possible. EG, it should always return the exact count you're querying, then let your app/reporting program decide how to display that data (eg broken down into ranges where x <= 10 or x > 10). Designing from ground up with this in mind will make things easier in the future I think.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 10:10 AM
The data can't be indexed because
a) it's too large (text fields that are above the limit of mysql's indexing
b) the data must be glob searched, like '%foo%'
c) text searching is not sufficient, since often what is being searched for needs exact match and is not english or other language data
d) there are actually several columns that meet this requirement
e) mysql actually *really* starts to suck at large table sizes. Like...
select count(*) from mytable
takes about 1s on a table with 10 million rows. How is this even possible? The size of the primary key index is surely stored somewhere - this should be a very fast constant size lookup. It doesn't get faster on repeated queries either. Explain plan says "select_type=simple, select t ables optimized away"

I think if I can't get this under control I'll end up making an "index" table that collapses all the columns into one column. I may actually make several of these tables, partitioned by date, and query them in parallel (older data is much less useful in our case than newer data so I could stream results back as they come in, starting with newer data first)
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 10:13 AM
Quote:
select col, count(*) cnt
from tbl
where cnt <= 10;
You actually mean 'having' not 'where' in your query, and it won't work because in order to answer the question of whether cnt < 10, you have to perform the group by on the entire table - there's no short circuit
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 10:29 AM
Dave, I was excited about your option but I think it won't work. I tried the smallest version if it,

Code:
select i.type, count(*)
from mytable out
where exists (select type from mytable in where  t.type='A' and t.type=i.type limit 11)
group by 1
it's been about 2 minutes and it hasn't finished yet. I suspect it's running this against the table for every row
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 11:15 AM
On the subject of tic-tac-toe, I've been helping a lot of people in the class set up their app to patch data to the provided API for each move and for the end of the game. Seeing how crazy some of the code they wrote is, and then trying to accurately pass each move to the API has been really fun.

I was helping someone and asked the opinion of another strong person in the class, and she basically recommended re-writing a large part of the guy's app, which yea is ideal, but he barely understands why what he wrote works, and lol at re-writing the whole thing while having a couple days to pass a few requirements he is missing. I ended up figuring out how to make the code work in place and accurately pass the necessary data. Feels good being able to look at someone else's monstrosity of if/else statements and all kinds of function calls going every which way and be able to understand all of it and make it work.

On a similar note, while 90-100% of the class has some super massive if/else statement to check if someone has won, this is my winning function, which I'm kinda proud of and going to continue to refactor (winner is an array of arrays of possible winning combos):

Code:
const playerWins = function (playerArray) {
  let won = false
  $.each(winner, function (index, value) {
    let winningCombos = 0
    $.each(playerArray, function (i, v) {
      if (value.includes(v)) {
        winningCombos++
      }
    })
    if (winningCombos === 3) {
      won = true
    }
  })
  return won
}
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 11:18 AM
Not to be that guy but avoid jquery each in favor of array foreach. Also if you are already using let use arrow functions as well.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 11:24 AM
No please give criticism!

I was wondering about jquery each vs foreach. I did some googling the other day and it seems like the jquery each might technically be "faster", but it seemed like intuitively for me it didn't really make sense to use jquery each there, but after my googling I wasn't sure why I should change it.

Yea, I need to improve my use of arrow functions.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 11:36 AM
You should change it so that if/when you switch from jquery to something else, you don't have to refactor everything. I hate it when (in my languages of choice) someone uses a library function when there is an equally good builtin.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 11:38 AM
Yea that makes sense, I was thinking (well jquery is kinda standard) but I realize that is a lazy/bad thought process.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 12:11 PM
Rusty: this isn't actually helpful but just fwiw a couple years ago when I was working on a project that involved running queries on a fairly large table where I wanted to get a lot of counts of sub-types (stuff like SUM(CASE WHEN a=b AND c=d THEN 1 ELSE 0 END) as col1, SUM(CASE WHEN ... END) as col2...), I had similar issues with mysql but I ended up moving the large tables into an amazon redshift instance, which handled it quite a bit better because the data in each column is indexed and stored separately, and where there are a lot of options for how data is stored which makes it easier to search.

I'm by no means an expert but it strikes me that you may be running up against some limitations in performance for that kind of aggregation in MySQL, especially absent indexes.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 12:15 PM
I would love to put this into redshift. I might get them to try that, but since I've gotten performance into reasonable levels in the last week, probably not. Queries were taking more than 60s. I have the 90th percentile under 1s and 99th is under 5s so for now I guess we're going to live with it.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 12:27 PM
Quote:
Originally Posted by CyberShark93
So, some follow up, they sent me the schedule, basically I need to be there from 10:00 to 3:30? Which means 5 and a half hours?? How can an interview ever last that long? Has anyone had a marathon interview like this before?
This is pretty standard for big companies. You will probably interview with 3 to 5 devs for 45 to 60min each. You may have a short sync with the recruiter in the beginning or end. You may have lunch break on your own or an interview lunch with a member of the team. Lunch interviews are usually more 'behavioral' rather than white board / technical.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 12:30 PM
Haha I have had interviews that lasted over 8 hours. That is just the industry standard now. If it doesn't cost your interviewee $500 to come interview then you aren't doing your job!
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 12:33 PM
Quote:
Originally Posted by Larry Legend
Yea that makes sense, I was thinking (well jquery is kinda standard) but I realize that is a lazy/bad thought process.
I don't think it's a bad thought process per se - it's okay to use standard dependencies in lieu of built-ins - but the idea of jQuery as being standard is somewhat dated. It is only standard on the browser-side in hybrid/server-side web apps. It's not standard in Node, for SPA or for building widgets/scripts/modules that have to work across different contexts.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 12:41 PM
You could do some pre-processing on the data (if you're only ever interested in or frequently query the entire data set) as a form of denormalisation EG:

- Loop each record
- Extract the text
- Break into words
- Create a new table WordFrequencies (Word (PK), Count (int))
- Update WordFrequencies table

Everytime you change a value in a record:

- Count words in current row, deduct from WordFrequencies
- Count words in new text, increment WordFrequencies

Then you can query:

SELECT Sum(Count) FROM WordFrequencies WHERE Word = 'This' OR Word = 'That'

And it will be super fast.

Again, only good if you're querying the entire data set (if you want to add another condition to your where it wont work, or you need to work out a more sophisticated denormalisation design).

If you have multiple columns in your data set to query in this way, you can do something like:

WordFrequencies (Word (PK), ColumnID(PK), Count (int))

Where column ID is a unique identifier for the column. Then you can query:

SELECT Sum(Count) FROM WordFrequencies WHERE (Word = 'This' AND ColumnID = 1) OR (Word = 'That' AND ColumnID = 2)

Combine with EXISTS can give you a pretty robust and fast solution.

Also, I think you can do something along the lines of:

SELECT Sum(Count) FROM WordFrequencies WHERE Word LIKE '%foo%'

To do partial matching on words (EG, someones email address tom@example.com would be considered a word, but you could query email address Word LIKE '%@example.com'

Add full text search to the word column might give better performance if you're doing this.

Last edited by Gullanian; 04-20-2017 at 12:47 PM.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 12:56 PM
Quote:
- Break into words
- Create a new table WordFrequencies (Word (PK), Count (int))
- Update WordFrequencies table
The data isn't words and you can't do word searches on it.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 12:57 PM
Also crazy idea that wont work for lots of reasons but is one of my favourite fun things to try and misapply wherever I can:

- Assign each word a unique prime number >= 3
- Multiply every words primes together to give a value for a chunk of text

Then you can query:

SELECT Count(*) FROM Table WHERE PrimeValue % WordPrimeValue == 0
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 12:58 PM
Quote:
Originally Posted by RustyBrooks
The data isn't words and you can't do word searches on it.
What are you trying to search for exactly, and what is the data?
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 01:00 PM
I wonder what the largest search string someone has used in. I could maybe make an index of every N-character long string, index that and do prefix searches on it. Like if my column is abcdef I could put
abcdef
bcdef
cdef
def
ef
f
into the indexed column and search for index_col like 'foo%'
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 01:01 PM
Quote:
Originally Posted by Gullanian
What are you trying to search for exactly, and what is the data?
It's fairly arbitrary. Some of it is paragraphs of text, some of it json or xml or other data formats. But if someone wants to search for "192.168.1.1/24:eth0" then I need to return matches. A lot of it are network security rules so there is all kinds of crap in there. Most of it is not particularly even human readable.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 01:08 PM
If you want to improve performance significantly I think you need to work out a way to tokensise the data somehow, and then do something similar that I explained above.

Tokenising on newline or spaces I would of thought would work in most cases (my word example should probably of been called token).

If you can tokenise it, you might find you have a token:
192.168.1.1/24:eth0

But you can still query:
SELECT Sum(Frequency) WHERE Token LIKE '192.168%:e%'

In the table:

Token (string, PK)
Frequency (int)
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 01:56 PM
Yeah but it could be
'ip': 192.168.1.1
or <ip>192.168.1.1</i>
or who knows what. And someone might want to match of "<ip>192.168.1.2</ip><domain>foo.co</domain>
They also might want to look for ".1.1" or ".1/24" or "160.1" and so forth.


Whitespace or carraige returns might be nonexistent - aside from titles/descriptions most of this is machine generated and/or scraped from APIs, web pages, and a whole host of 3rd party software.

The industry, if it matters, is network security, and we are basically getting data from any source possible on a particular threat, and shoving it into this database.

We actually have a DSL for search and that does pretty well but people mostly don't use it, they just paste whatever they're searching for in the search bar and that's about it. I tried reducing the number of fields that are searched in the default search and people pitched a fit.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote
04-20-2017 , 01:58 PM
Quote:
Originally Posted by RustyBrooks
Dave, I was excited about your option but I think it won't work. I tried the smallest version if it,

Code:
select i.type, count(*)
from mytable out
where exists (select type from mytable in where  t.type='A' and t.type=i.type limit 11)
group by 1
it's been about 2 minutes and it hasn't finished yet. I suspect it's running this against the table for every row
I'm guessing this is sytax stuff, but this what the query looks like expanded:

Code:
select i.type, count(*)
from mytable out
where exists
     (select type
      from mytable in
      where t.type='A' and t.type=i.type
      limit 11)
group by 1;
I'm not sure if this is just an error in your translation, but there doesn't seem to be any table aliases, which means there is no "t" or "i" to work with. It's a bit confusing to me because I'm used to in and out being keywords.

What is the time on the subselect just running against type "A" with limit 11? If that's slow, the rest is going to go down with it, no matter what. You may be able to speed up the query by putting in an order by type clause in the subselect. You'd have to test against that as well.
** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** Quote

      
m