** UnhandledExceptionEventHandler :: OFFICIAL LC / CHATTER THREAD ** - Page 1100 - Computer Technical Help

Couldn't you do something like
select type from table where count(type) > 10;

But you wanted all types, can you short circuit SQL and an OR count(type) <= 10?

Something like this seems like it would work, but Rusty seems to be asking about something different.

Something like count(type) <= 10 and type in ('a', 'b', 'c') would work in the base case, which short-circuits the upper bound and the types in a single where clause. But he does want to know about the things that have more than 11. He could possibly do an except clause like so:

Code:

create temp table t_tbl as
select col, count(*) cnt
from tbl
where cnt <= 10;

select col, 11
from 
    (select col
     from tbl

     except

     select col
     from t_tble)

union

select col, cnt
from t_tbl;

I hate this though.

I think this is different because he seems to be asking about a query that is covering many large tables with no indexes. I don't why they "can't" be indexed, unless he is dealing in JSON (does JSON index in MySQL?), or some other manually cooked up data type. If it is the latter, there probably won't be anything that will make the query faster short of rebuilding the database correctly or pre-populating a materialized view. Granted, the memory blocks may be too large to index, but it's hard to judge without measuring, so I'm taking his word for this.

Since there is no indexing, the query is going to be n (size of table) * m (amt of types). The grouping and calculation will then be m*2, so it'll be very slow without some extra help from the query. The short-circuiting I'm suggesting takes the large n and m into consideration by initializing smaller numbers.

Since he is querying across many tables, I'm trying to show something that can be built up to that. Overall, the strategy would be to break the largest table into a smaller table, and query against it with the other tables.

I may be a bit off on my interpretation of the situation and query though...

Quote

04-20-2017 , 08:47 AM

#27477

maxtower

Carpal \'Tunnel

Join Date: Sep 2005 Posts: 7,996

Quote:

Originally Posted by CyberShark93

So, some follow up, they sent me the schedule, basically I need to be there from 10:00 to 3:30? Which means 5 and a half hours?? How can an interview ever last that long? Has anyone had a marathon interview like this before?

In my experience they are this long or longer for most big companies.

It looks like your interview spans lunch, so you probably won't be getting grilled during that time. You'll probably talk to 3-4 people on the team for an hour each plus lunch.

Quote

04-20-2017 , 08:53 AM

#27478

Gullanian

Carpal \'Tunnel

Join Date: Dec 2006 Posts: 14,014

Consider denormalising your data if queries to count things are way too slow. (Surprised you can't add an index though as a first step before doing that?)

I think designing queries that return if there's more or less than x is starting to perhaps stray into mixing data and logic, I like to keep the DB as dumb as possible. EG, it should always return the exact count you're querying, then let your app/reporting program decide how to display that data (eg broken down into ranges where x <= 10 or x > 10). Designing from ground up with this in mind will make things easier in the future I think.

Quote

04-20-2017 , 10:10 AM

#27479

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

The data can't be indexed because
a) it's too large (text fields that are above the limit of mysql's indexing
b) the data must be glob searched, like '%foo%'
c) text searching is not sufficient, since often what is being searched for needs exact match and is not english or other language data
d) there are actually several columns that meet this requirement
e) mysql actually *really* starts to suck at large table sizes. Like...
select count(*) from mytable
takes about 1s on a table with 10 million rows. How is this even possible? The size of the primary key index is surely stored somewhere - this should be a very fast constant size lookup. It doesn't get faster on repeated queries either. Explain plan says "select_type=simple, select t ables optimized away"

I think if I can't get this under control I'll end up making an "index" table that collapses all the columns into one column. I may actually make several of these tables, partitioned by date, and query them in parallel (older data is much less useful in our case than newer data so I could stream results back as they come in, starting with newer data first)

Quote

04-20-2017 , 10:13 AM

#27480

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Quote:

select col, count(*) cnt
from tbl
where cnt <= 10;

You actually mean 'having' not 'where' in your query, and it won't work because in order to answer the question of whether cnt < 10, you have to perform the group by on the entire table - there's no short circuit

Quote

04-20-2017 , 10:29 AM

#27481

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Dave, I was excited about your option but I think it won't work. I tried the smallest version if it,

Code:

select i.type, count(*)
from mytable out
where exists (select type from mytable in where  t.type='A' and t.type=i.type limit 11)
group by 1

it's been about 2 minutes and it hasn't finished yet. I suspect it's running this against the table for every row

Quote

04-20-2017 , 11:15 AM

#27482

Larry Legend

Celtic Pride

Join Date: Jul 2009 Posts: 42,920

On the subject of tic-tac-toe, I've been helping a lot of people in the class set up their app to patch data to the provided API for each move and for the end of the game. Seeing how crazy some of the code they wrote is, and then trying to accurately pass each move to the API has been really fun.

I was helping someone and asked the opinion of another strong person in the class, and she basically recommended re-writing a large part of the guy's app, which yea is ideal, but he barely understands why what he wrote works, and lol at re-writing the whole thing while having a couple days to pass a few requirements he is missing. I ended up figuring out how to make the code work in place and accurately pass the necessary data. Feels good being able to look at someone else's monstrosity of if/else statements and all kinds of function calls going every which way and be able to understand all of it and make it work.

On a similar note, while 90-100% of the class has some super massive if/else statement to check if someone has won, this is my winning function, which I'm kinda proud of and going to continue to refactor (winner is an array of arrays of possible winning combos):

Code:

const playerWins = function (playerArray) {
  let won = false
  $.each(winner, function (index, value) {
    let winningCombos = 0
    $.each(playerArray, function (i, v) {
      if (value.includes(v)) {
        winningCombos++
      }
    })
    if (winningCombos === 3) {
      won = true
    }
  })
  return won
}

Quote

04-20-2017 , 11:18 AM

#27483

Grue

Pooh-Bah

Join Date: Mar 2004 Posts: 5,667

Not to be that guy but avoid jquery each in favor of array foreach. Also if you are already using let use arrow functions as well.

Quote

04-20-2017 , 11:24 AM

#27484

Larry Legend

Celtic Pride

Join Date: Jul 2009 Posts: 42,920

No please give criticism!

I was wondering about jquery each vs foreach. I did some googling the other day and it seems like the jquery each might technically be "faster", but it seemed like intuitively for me it didn't really make sense to use jquery each there, but after my googling I wasn't sure why I should change it.

Yea, I need to improve my use of arrow functions.

Quote

04-20-2017 , 11:36 AM

#27485

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

You should change it so that if/when you switch from jquery to something else, you don't have to refactor everything. I hate it when (in my languages of choice) someone uses a library function when there is an equally good builtin.

Quote

04-20-2017 , 11:38 AM

#27486

Larry Legend

Celtic Pride

Join Date: Jul 2009 Posts: 42,920

Yea that makes sense, I was thinking (well jquery is kinda standard) but I realize that is a lazy/bad thought process.

Quote

04-20-2017 , 12:11 PM

#27487

well named

poorly undertitled

Join Date: Jun 2007 Posts: 78,077

Rusty: this isn't actually helpful but just fwiw a couple years ago when I was working on a project that involved running queries on a fairly large table where I wanted to get a lot of counts of sub-types (stuff like SUM(CASE WHEN a=b AND c=d THEN 1 ELSE 0 END) as col1, SUM(CASE WHEN ... END) as col2...), I had similar issues with mysql but I ended up moving the large tables into an amazon redshift instance, which handled it quite a bit better because the data in each column is indexed and stored separately, and where there are a lot of options for how data is stored which makes it easier to search.

I'm by no means an expert but it strikes me that you may be running up against some limitations in performance for that kind of aggregation in MySQL, especially absent indexes.

Quote

04-20-2017 , 12:15 PM

#27488

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

I would love to put this into redshift. I might get them to try that, but since I've gotten performance into reasonable levels in the last week, probably not. Queries were taking more than 60s. I have the 90th percentile under 1s and 99th is under 5s so for now I guess we're going to live with it.

Quote

04-20-2017 , 12:27 PM

#27489

Bantam222

veteran

Join Date: Aug 2006 Posts: 2,388

Quote:

Originally Posted by CyberShark93

This is pretty standard for big companies. You will probably interview with 3 to 5 devs for 45 to 60min each. You may have a short sync with the recruiter in the beginning or end. You may have lunch break on your own or an interview lunch with a member of the team. Lunch interviews are usually more 'behavioral' rather than white board / technical.

Quote

04-20-2017 , 12:30 PM

#27490

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Haha I have had interviews that lasted over 8 hours. That is just the industry standard now. If it doesn't cost your interviewee $500 to come interview then you aren't doing your job!

Quote

04-20-2017 , 12:33 PM

#27491

candybar

Pooh-Bah

Join Date: Aug 2011 Posts: 4,234

Quote:

Originally Posted by Larry Legend

Yea that makes sense, I was thinking (well jquery is kinda standard) but I realize that is a lazy/bad thought process.

I don't think it's a bad thought process per se - it's okay to use standard dependencies in lieu of built-ins - but the idea of jQuery as being standard is somewhat dated. It is only standard on the browser-side in hybrid/server-side web apps. It's not standard in Node, for SPA or for building widgets/scripts/modules that have to work across different contexts.

Quote

04-20-2017 , 12:41 PM

#27492

Gullanian

Carpal \'Tunnel

Join Date: Dec 2006 Posts: 14,014

You could do some pre-processing on the data (if you're only ever interested in or frequently query the entire data set) as a form of denormalisation EG:

- Loop each record
- Extract the text
- Break into words
- Create a new table WordFrequencies (Word (PK), Count (int))
- Update WordFrequencies table

Everytime you change a value in a record:

- Count words in current row, deduct from WordFrequencies
- Count words in new text, increment WordFrequencies

Then you can query:

SELECT Sum(Count) FROM WordFrequencies WHERE Word = 'This' OR Word = 'That'

And it will be super fast.

Again, only good if you're querying the entire data set (if you want to add another condition to your where it wont work, or you need to work out a more sophisticated denormalisation design).

If you have multiple columns in your data set to query in this way, you can do something like:

WordFrequencies (Word (PK), ColumnID(PK), Count (int))

Where column ID is a unique identifier for the column. Then you can query:

SELECT Sum(Count) FROM WordFrequencies WHERE (Word = 'This' AND ColumnID = 1) OR (Word = 'That' AND ColumnID = 2)

Combine with EXISTS can give you a pretty robust and fast solution.

Also, I think you can do something along the lines of:

SELECT Sum(Count) FROM WordFrequencies WHERE Word LIKE '%foo%'

To do partial matching on words (EG, someones email address tom@example.com would be considered a word, but you could query email address Word LIKE '%@example.com'

Add full text search to the word column might give better performance if you're doing this.

Last edited by Gullanian; 04-20-2017 at 12:47 PM.

Quote

04-20-2017 , 12:56 PM

#27493

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Quote:

- Break into words
- Create a new table WordFrequencies (Word (PK), Count (int))
- Update WordFrequencies table

The data isn't words and you can't do word searches on it.

Quote

04-20-2017 , 12:57 PM

#27494

Gullanian

Carpal \'Tunnel

Join Date: Dec 2006 Posts: 14,014

Also crazy idea that wont work for lots of reasons but is one of my favourite fun things to try and misapply wherever I can:

- Assign each word a unique prime number >= 3
- Multiply every words primes together to give a value for a chunk of text

Then you can query:

SELECT Count(*) FROM Table WHERE PrimeValue % WordPrimeValue == 0

Quote

04-20-2017 , 12:58 PM

#27495

Gullanian

Carpal \'Tunnel

Join Date: Dec 2006 Posts: 14,014

Quote:

Originally Posted by RustyBrooks

The data isn't words and you can't do word searches on it.

What are you trying to search for exactly, and what is the data?

Quote

04-20-2017 , 01:00 PM

#27496

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

I wonder what the largest search string someone has used in. I could maybe make an index of every N-character long string, index that and do prefix searches on it. Like if my column is abcdef I could put
abcdef
bcdef
cdef
def
ef
f
into the indexed column and search for index_col like 'foo%'

Quote

04-20-2017 , 01:01 PM

#27497

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Quote:

Originally Posted by Gullanian

What are you trying to search for exactly, and what is the data?

It's fairly arbitrary. Some of it is paragraphs of text, some of it json or xml or other data formats. But if someone wants to search for "192.168.1.1/24:eth0" then I need to return matches. A lot of it are network security rules so there is all kinds of crap in there. Most of it is not particularly even human readable.

Quote

04-20-2017 , 01:08 PM

#27498

Gullanian

Carpal \'Tunnel

Join Date: Dec 2006 Posts: 14,014

If you want to improve performance significantly I think you need to work out a way to tokensise the data somehow, and then do something similar that I explained above.

Tokenising on newline or spaces I would of thought would work in most cases (my word example should probably of been called token).

If you can tokenise it, you might find you have a token:
192.168.1.1/24:eth0

But you can still query:
SELECT Sum(Frequency) WHERE Token LIKE '192.168%:e%'

In the table:

Token (string, PK)
Frequency (int)

Quote

04-20-2017 , 01:56 PM

#27499

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Yeah but it could be
'ip': 192.168.1.1
or <ip>192.168.1.1</i>
or who knows what. And someone might want to match of "<ip>192.168.1.2</ip><domain>foo.co</domain>
They also might want to look for ".1.1" or ".1/24" or "160.1" and so forth.

Whitespace or carraige returns might be nonexistent - aside from titles/descriptions most of this is machine generated and/or scraped from APIs, web pages, and a whole host of 3rd party software.

The industry, if it matters, is network security, and we are basically getting data from any source possible on a particular threat, and shoving it into this database.

We actually have a DSL for search and that does pretty well but people mostly don't use it, they just paste whatever they're searching for in the search bar and that's about it. I tried reducing the number of fields that are searched in the default search and people pitched a fit.

Quote

04-20-2017 , 01:58 PM

#27500

daveT

S.A.G.E. Master

Join Date: Jun 2005 Posts: 23,955

Quote:

Originally Posted by RustyBrooks

Dave, I was excited about your option but I think it won't work. I tried the smallest version if it,

Code:

select i.type, count(*)
from mytable out
where exists (select type from mytable in where  t.type='A' and t.type=i.type limit 11)
group by 1

it's been about 2 minutes and it hasn't finished yet. I suspect it's running this against the table for every row

I'm guessing this is sytax stuff, but this what the query looks like expanded:

Code:

select i.type, count(*)
from mytable out
where exists
     (select type
      from mytable in
      where t.type='A' and t.type=i.type
      limit 11)
group by 1;

I'm not sure if this is just an error in your translation, but there doesn't seem to be any table aliases, which means there is no "t" or "i" to work with. It's a bit confusing to me because I'm used to in and out being keywords.

What is the time on the subselect just running against type "A" with limit 11? If that's slow, the rest is going to go down with it, no matter what. You may be able to speed up the query by putting in an order by type clause in the subselect. You'd have to test against that as well.

Quote

Page 1100 of 1603

First

100 600 900 1000 1050 1080 1090 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1110 1120 1150 1200 1300 1600

Last

Post Reply Subscribe

...

Page 1100 of 1603

First

100 600 900 1000 1050 1080 1090 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1110 1120 1150 1200 1300 1600

Last