Open Side Menu Go to the Top
Register
Rule subset detection algorithm Rule subset detection algorithm

08-19-2014 , 07:30 PM
Say you've got a set of rules like so:

(1) if X > 2 then ...
(2) if X > 5 then ...
(3) if X > 4 AND Y > 3 then ...
(4) if X > 2 AND Y > 3 then ...
(5) if X > 1 AND Y = 2 then ...
(6) if Y > 1 AND Z < 8 then ...

and from these rules you can deduce:

(1) being true implies (2) being true
(2) being true implies nothing
(3) being true implies (2) being true
(4) being true implies (1), (2) and (3) being true
(5) being true implies (1) and (2) being true
(6) being true implies nothing

What algorithm do we need to automatically create these deductions?

In practice there will be 1000's of variables, connected together using any number of "AND" operations (the "rules" are actually paths in binary decision trees...).

Tried searching for stuff related to expert-system "specificity ordering", etc but couldn't find anything.

PS: I'm looking for an imperative algorithm as opposed to using PROLOG, etc.

Juk
Rule subset detection algorithm Quote
08-20-2014 , 08:36 AM
Would sorting the rules into some sort of a tree be a good starting point? Just thinking out loud here so could be wrong but that's probably what I might try starting on.

Good compilers will solve this sort of problem as well, might be worth Googling that sort of thing to find solutions.
Rule subset detection algorithm Quote
08-20-2014 , 11:49 AM
Here's a really simple example to show why you'd want to do this:

(1) if X > 2 then ...
(2) if X > 5 then ...

Now we can create two "dummy variables" from these rules, which are set to 1 if the rule matches, else set to 0:

(1) if X > 2 then V_1=1, else V_1=0
(2) if X > 5 then V_2=1, else V_2=0

The problem with this is that the two dummy variables are highly correlated and any iterative statistical and/or machine learning algorithm will take way more iterations to converge because of this (ie: it's much harder for an algorithm to set the weights for the two dummy variables independently, as they have to wait for one weight to be close to optimal before working on the next weight and so on...).

So the idea is to decorrelate the dummy variables like so:

By using the fact that "(1) being true implies (2) being true", we get:

(1) if X > 2 AND NOT X > 5 then V_1=1, else V_1=0
(2) if X > 5 AND then V_2=1, else V_2=0

So now you have two decorrelated dummy variables which carry the same information.

It's actually much simpler than having to add in a load of "NOT AND" rules like above, eg:

(A) being true implies (B), (C), ..., being true

simply means that if (A) is true you set all the dummy variables for (B), (C), ..., to zero (assuming you don't have cyclic implications, which I don't think will occur here anyway).

This is the simplest example I can think of, but in practice there will be 1000's of rules, using 100's of variables, with rule-depths ranging from a single comparison to a few 100 comparisons.

Hope this makes sense.

Juk
Rule subset detection algorithm Quote
08-20-2014 , 11:57 AM
Quote:
Originally Posted by Gullanian
Would sorting the rules into some sort of a tree be a good starting point? Just thinking out loud here so could be wrong but that's probably what I might try starting on.
The rules actually come from an ensemble of decision trees (see section 3.1 in this paper for an example of how this is done).

Quote:
Good compilers will solve this sort of problem as well, might be worth Googling that sort of thing to find solutions.
Yeah, been looking for anything that might help - the nearest I got was a paper on sanitizing sets of firewall rules, but not quite what I want.

Juk
Rule subset detection algorithm Quote
08-20-2014 , 12:51 PM
maybe try stackoverflow for this Q
Rule subset detection algorithm Quote
08-20-2014 , 02:22 PM
Quote:
Originally Posted by jukofyork
Say you've got a set of rules like so:

(1) if X > 2 then ...
(2) if X > 5 then ...
(3) if X > 4 AND Y > 3 then ...
(4) if X > 2 AND Y > 3 then ...
(5) if X > 1 AND Y = 2 then ...
(6) if Y > 1 AND Z < 8 then ...

and from these rules you can deduce:

(1) being true implies (2) being true
(2) being true implies nothing
(3) being true implies (2) being true
(4) being true implies (1), (2) and (3) being true
(5) being true implies (1) and (2) being true
(6) being true implies nothing

What algorithm do we need to automatically create these deductions?

In practice there will be 1000's of variables, connected together using any number of "AND" operations (the "rules" are actually paths in binary decision trees...).

Tried searching for stuff related to expert-system "specificity ordering", etc but couldn't find anything.

PS: I'm looking for an imperative algorithm as opposed to using PROLOG, etc.

Juk
The general idea is:

http://en.wikipedia.org/wiki/Unification_(computing)

In your case, something along the lines of the following may work:

Expression1 implies Expression1

(Var1 > Constant1) implies (Var2 > Constant2) if Constant1 >= Constant2 AND Var1 = Var2

(Expression1 AND Expression2) implies Expression3 if Expression1 implies Expression3 OR Expression2 implies Expression3

Expression1 implies (Expression2 OR Expression3) if Expression1 implies Expression2 OR Expression1 implies Expression3

Given this, you can take any two expressions and determine whether one includes the other by recursively trying all the possibilities

It will miss some cases, which will require additional rules:

((X > 5) OR (X > 2 AND X < 6)) implies X > 2 and vice versa.

Edit: I guess you also need (Exp1 AND Exp2) implies (Exp3 AND Exp4) if (Exp1 implies Exp3 AND Exp2 implies Exp4) OR (Exp1 implies Exp4 AND Exp2 implies Exp3)

Last edited by candybar; 08-20-2014 at 02:44 PM.
Rule subset detection algorithm Quote
08-20-2014 , 02:36 PM
Actually if numerical ranges are all you need, you can just probably write Range class that has union and intersection operations.

Range(-INF, +INF) intersect Range(5, 9) => [5, 9]
Range(6, 10) intersect Range(5, 9) => [6, 9]
Range(2, 6) union Range(7, 10) => [2,6], [7,10]

This way, you can solve the unresolved inference I pointed out earlier (though probably not all cases if multiple variables are involved) and reduce expressions into fewer terms.
Rule subset detection algorithm Quote
08-20-2014 , 06:17 PM
Thanks, that's just what I was looking for! I think I've got it now:

Assuming the only operation joining the expressions is "AND" and the only operators are ">" and "<" (lets ignore "=" for now...) then:

(Var1 > Constant1) implies (Var2 > Constant2) if Constant1 >= Constant2 AND Var1 = Var2
(Var1 < Constant1) implies (Var2 < Constant2) if Constant1 <= Constant2 AND Var1 = Var2

Code:
for each possible pairing of rules (R_i, R_j) {

  let T_i = total number of comparison operations in R_i
  let T_j = total number of comparison operations in R_j

  if T_i <= T_j {
    N = 0
    for each comparison operation O_i in R_i {
      if there is a comparison operation O_j in R_j such that O_j implies O_i
        N = N + 1
    }
    if N = T_i
      R_j implies R_i
  }

  if T_j <= T_i {
    N = 0
    for each comparison operation O_j in R_j {
      if there is a comparison operation O_i in R_i such that O_i implies O_j
        N = N + 1
    }
    if N = T_j
      R_i implies R_j
  }

}
I think this should work now (will try it on some real examples tomorrow).

Juk
Rule subset detection algorithm Quote
08-20-2014 , 06:51 PM
Thinking about this more then that won't work unless you happen to have each variable tested in a single direction only, eg:

(1) if X > 2 then ...
(2) if X > 1 AND X < 5 then ...

would falsely find (2) implies (1).

But your idea of creating a set of ranges for each variable in each rule and then testing if (1)∪(2)=(1) or (1)∪(2)=(2) should fix this!

Juk
Rule subset detection algorithm Quote
08-21-2014 , 12:05 PM
Quote:
Originally Posted by jukofyork
Thinking about this more then that won't work unless you happen to have each variable tested in a single direction only, eg:

(1) if X > 2 then ...
(2) if X > 1 AND X < 5 then ...

would falsely find (2) implies (1).
Oh yeah, you need a couple more rules to deal with inequality in both directions.

Quote:
But your idea of creating a set of ranges for each variable in each rule and then testing if (1)∪(2)=(1) or (1)∪(2)=(2) should fix this!
Yeah, if you have a robust range implementation, you just need to get the boolean inferences correctly. I think if you only have AND operators, this may be fairly simple.
Rule subset detection algorithm Quote
08-21-2014 , 01:04 PM
Quote:
Originally Posted by candybar
Yeah, if you have a robust range implementation, you just need to get the boolean inferences correctly. I think if you only have AND operators, this may be fairly simple.
Yeah looking at in now, it's just a case of each rule defining a hypercube in n-dimensional space (with the possibility of some sides extending to -INF and/or +INF, depending on if there are 0, 1 or 2 inequalities for a variable) and then testing if one hypercube fits completely inside of another.

Huge thanks for your input as until you mentioned thinking of it in terms of unions, I just couldn't see how to get anywhere with it!

Juk
Rule subset detection algorithm Quote
08-26-2014 , 10:09 AM
Quote:
Originally Posted by jukofyork
Yeah looking at in now, it's just a case of each rule defining a hypercube in n-dimensional space (with the possibility of some sides extending to -INF and/or +INF, depending on if there are 0, 1 or 2 inequalities for a variable) and then testing if one hypercube fits completely inside of another.

Huge thanks for your input as until you mentioned thinking of it in terms of unions, I just couldn't see how to get anywhere with it!

Juk
You're welcome - I'm glad I was helpful. Love how you got to hypercube in n-dimensional space from my ramblings!
Rule subset detection algorithm Quote
08-30-2014 , 08:05 AM
Quote:
Originally Posted by candybar
You're welcome - I'm glad I was helpful. Love how you got to hypercube in n-dimensional space from my ramblings!
I finally got around to coding it up today and it appears to be working beautifully!

Not sure if anybody is interested, but just in case here is a follow up to post #3:

1. You get better solutions in less iterations.

As you crank the iterations right up then both give roughly the same quality of models, but if you are limited to just doing a few iterations the "superset reduced" form of the input is vastly superior.

This is expected as the "non-superset reduced" version spends the early iterations fighting against the correlated variables, whereas the "superset reduced" version can get to work setting the weights independently right from the start (as was hoped/expected...).

2. You get much sparser input vectors.

This wasn't something I thought about when trying to do this, but for a lot of ML algorithms having a lot of zero-valued inputs is a big plus (ie: you can write optimized code that only deals with the non-zero inputs specifically).

From some simple tests I've run today, it appears that the reduction in non-zeroed inputs is as much as 50-60%!

Juk
Rule subset detection algorithm Quote

      
m