You could do some pre-processing on the data (if you're only ever interested in or frequently query the entire data set) as a form of denormalisation EG:
- Loop each record
- Extract the text
- Break into words
- Create a new table WordFrequencies (Word (PK), Count (int))
- Update WordFrequencies table
Everytime you change a value in a record:
- Count words in current row, deduct from WordFrequencies
- Count words in new text, increment WordFrequencies
Then you can query:
SELECT Sum(Count) FROM WordFrequencies WHERE Word = 'This' OR Word = 'That'
And it will be super fast.
Again, only good if you're querying the entire data set (if you want to add another condition to your where it wont work, or you need to work out a more sophisticated denormalisation design).
If you have multiple columns in your data set to query in this way, you can do something like:
WordFrequencies (Word (PK), ColumnID(PK), Count (int))
Where column ID is a unique identifier for the column. Then you can query:
SELECT Sum(Count) FROM WordFrequencies WHERE (Word = 'This' AND ColumnID = 1) OR (Word = 'That' AND ColumnID = 2)
Combine with EXISTS can give you a pretty robust and fast solution.
Also, I think you can do something along the lines of:
SELECT Sum(Count) FROM WordFrequencies WHERE Word LIKE '%foo%'
To do partial matching on words (EG, someones email address
tom@example.com would be considered a word, but you could query email address Word LIKE '%@example.com'
Add full text search to the word column might give better performance if you're doing this.
Last edited by Gullanian; 04-20-2017 at 12:47 PM.