Rule subset detection algorithm - Computer Technical Help

Two Plus Two Forums Other Topics Computer and Technical Help

Rule subset detection algorithm

Post Reply Subscribe

...

08-19-2014 , 07:30 PM

jukofyork

Carpal \'Tunnel

Join Date: Sep 2004 Posts: 11,749

Say you've got a set of rules like so:

(1) if X > 2 then ...
(2) if X > 5 then ...
(3) if X > 4 AND Y > 3 then ...
(4) if X > 2 AND Y > 3 then ...
(5) if X > 1 AND Y = 2 then ...
(6) if Y > 1 AND Z < 8 then ...

and from these rules you can deduce:

(1) being true implies (2) being true
(2) being true implies nothing
(3) being true implies (2) being true
(4) being true implies (1), (2) and (3) being true
(5) being true implies (1) and (2) being true
(6) being true implies nothing

What algorithm do we need to automatically create these deductions?

In practice there will be 1000's of variables, connected together using any number of "AND" operations (the "rules" are actually paths in binary decision trees...).

Tried searching for stuff related to expert-system "specificity ordering", etc but couldn't find anything.

PS: I'm looking for an imperative algorithm as opposed to using PROLOG, etc.

Juk

Quote

08-20-2014 , 08:36 AM

Gullanian

Carpal \'Tunnel

Join Date: Dec 2006 Posts: 14,014

Would sorting the rules into some sort of a tree be a good starting point? Just thinking out loud here so could be wrong but that's probably what I might try starting on.

Good compilers will solve this sort of problem as well, might be worth Googling that sort of thing to find solutions.

Quote

08-20-2014 , 11:49 AM

jukofyork

Carpal \'Tunnel

Join Date: Sep 2004 Posts: 11,749

Here's a really simple example to show why you'd want to do this:

(1) if X > 2 then ...
(2) if X > 5 then ...

Now we can create two "dummy variables" from these rules, which are set to 1 if the rule matches, else set to 0:

(1) if X > 2 then V_1=1, else V_1=0
(2) if X > 5 then V_2=1, else V_2=0

The problem with this is that the two dummy variables are highly correlated and any iterative statistical and/or machine learning algorithm will take way more iterations to converge because of this (ie: it's much harder for an algorithm to set the weights for the two dummy variables independently, as they have to wait for one weight to be close to optimal before working on the next weight and so on...).

So the idea is to decorrelate the dummy variables like so:

By using the fact that "(1) being true implies (2) being true", we get:

(1) if X > 2 AND NOT X > 5 then V_1=1, else V_1=0
(2) if X > 5 AND then V_2=1, else V_2=0

So now you have two decorrelated dummy variables which carry the same information.

It's actually much simpler than having to add in a load of "NOT AND" rules like above, eg:

(A) being true implies (B), (C), ..., being true

simply means that if (A) is true you set all the dummy variables for (B), (C), ..., to zero (assuming you don't have cyclic implications, which I don't think will occur here anyway).

This is the simplest example I can think of, but in practice there will be 1000's of rules, using 100's of variables, with rule-depths ranging from a single comparison to a few 100 comparisons.

Hope this makes sense.

Juk

Quote

08-20-2014 , 11:57 AM

jukofyork

Carpal \'Tunnel

Join Date: Sep 2004 Posts: 11,749

Quote:

Originally Posted by Gullanian

Would sorting the rules into some sort of a tree be a good starting point? Just thinking out loud here so could be wrong but that's probably what I might try starting on.

The rules actually come from an ensemble of decision trees (see section 3.1 in this paper for an example of how this is done).

Quote:

Good compilers will solve this sort of problem as well, might be worth Googling that sort of thing to find solutions.

Yeah, been looking for anything that might help - the nearest I got was a paper on sanitizing sets of firewall rules, but not quite what I want.

Juk

Quote

08-20-2014 , 12:51 PM

greg nice

Carpal \'Tunnel

Join Date: Aug 2004 Posts: 9,990

maybe try stackoverflow for this Q

Quote

08-20-2014 , 02:22 PM

candybar

Pooh-Bah

Join Date: Aug 2011 Posts: 4,234

Quote:

Originally Posted by jukofyork

The general idea is:

http://en.wikipedia.org/wiki/Unification_(computing)

In your case, something along the lines of the following may work:

Expression1 implies Expression1

(Var1 > Constant1) implies (Var2 > Constant2) if Constant1 >= Constant2 AND Var1 = Var2

(Expression1 AND Expression2) implies Expression3 if Expression1 implies Expression3 OR Expression2 implies Expression3

Expression1 implies (Expression2 OR Expression3) if Expression1 implies Expression2 OR Expression1 implies Expression3

Given this, you can take any two expressions and determine whether one includes the other by recursively trying all the possibilities

It will miss some cases, which will require additional rules:

((X > 5) OR (X > 2 AND X < 6)) implies X > 2 and vice versa.

Edit: I guess you also need (Exp1 AND Exp2) implies (Exp3 AND Exp4) if (Exp1 implies Exp3 AND Exp2 implies Exp4) OR (Exp1 implies Exp4 AND Exp2 implies Exp3)

Last edited by candybar; 08-20-2014 at 02:44 PM.

Quote

08-20-2014 , 02:36 PM

candybar

Pooh-Bah

Join Date: Aug 2011 Posts: 4,234

Actually if numerical ranges are all you need, you can just probably write Range class that has union and intersection operations.

Range(-INF, +INF) intersect Range(5, 9) => [5, 9]
Range(6, 10) intersect Range(5, 9) => [6, 9]
Range(2, 6) union Range(7, 10) => [2,6], [7,10]

This way, you can solve the unresolved inference I pointed out earlier (though probably not all cases if multiple variables are involved) and reduce expressions into fewer terms.

Quote

08-20-2014 , 06:17 PM

jukofyork

Carpal \'Tunnel

Join Date: Sep 2004 Posts: 11,749

Thanks, that's just what I was looking for! I think I've got it now:

Assuming the only operation joining the expressions is "AND" and the only operators are ">" and "<" (lets ignore "=" for now...) then:

(Var1 > Constant1) implies (Var2 > Constant2) if Constant1 >= Constant2 AND Var1 = Var2
(Var1 < Constant1) implies (Var2 < Constant2) if Constant1 <= Constant2 AND Var1 = Var2

Code:

for each possible pairing of rules (R_i, R_j) {

  let T_i = total number of comparison operations in R_i
  let T_j = total number of comparison operations in R_j

  if T_i <= T_j {
    N = 0
    for each comparison operation O_i in R_i {
      if there is a comparison operation O_j in R_j such that O_j implies O_i
        N = N + 1
    }
    if N = T_i
      R_j implies R_i
  }

  if T_j <= T_i {
    N = 0
    for each comparison operation O_j in R_j {
      if there is a comparison operation O_i in R_i such that O_i implies O_j
        N = N + 1
    }
    if N = T_j
      R_i implies R_j
  }

}

I think this should work now (will try it on some real examples tomorrow).

Juk

Quote

08-20-2014 , 06:51 PM

jukofyork

Carpal \'Tunnel

Join Date: Sep 2004 Posts: 11,749

Thinking about this more then that won't work unless you happen to have each variable tested in a single direction only, eg:

(1) if X > 2 then ...
(2) if X > 1 AND X < 5 then ...

would falsely find (2) implies (1).

But your idea of creating a set of ranges for each variable in each rule and then testing if (1)∪(2)=(1) or (1)∪(2)=(2) should fix this!

Juk

Quote

08-21-2014 , 12:05 PM

#10

candybar

Pooh-Bah

Join Date: Aug 2011 Posts: 4,234

Quote:

Originally Posted by jukofyork

Oh yeah, you need a couple more rules to deal with inequality in both directions.

Quote:

But your idea of creating a set of ranges for each variable in each rule and then testing if (1)∪(2)=(1) or (1)∪(2)=(2) should fix this!

Yeah, if you have a robust range implementation, you just need to get the boolean inferences correctly. I think if you only have AND operators, this may be fairly simple.

Quote

08-21-2014 , 01:04 PM

#11

jukofyork

Carpal \'Tunnel

Join Date: Sep 2004 Posts: 11,749

Quote:

Originally Posted by candybar

Yeah, if you have a robust range implementation, you just need to get the boolean inferences correctly. I think if you only have AND operators, this may be fairly simple.

Yeah looking at in now, it's just a case of each rule defining a hypercube in n-dimensional space (with the possibility of some sides extending to -INF and/or +INF, depending on if there are 0, 1 or 2 inequalities for a variable) and then testing if one hypercube fits completely inside of another.

Huge thanks for your input as until you mentioned thinking of it in terms of unions, I just couldn't see how to get anywhere with it!

Juk

Quote

08-26-2014 , 10:09 AM

#12

candybar

Pooh-Bah

Join Date: Aug 2011 Posts: 4,234

Quote:

Originally Posted by jukofyork

You're welcome - I'm glad I was helpful. Love how you got to hypercube in n-dimensional space from my ramblings!

Quote

08-30-2014 , 08:05 AM

#13

jukofyork

Carpal \'Tunnel

Join Date: Sep 2004 Posts: 11,749

Quote:

Originally Posted by candybar

You're welcome - I'm glad I was helpful. Love how you got to hypercube in n-dimensional space from my ramblings!

I finally got around to coding it up today and it appears to be working beautifully!

Not sure if anybody is interested, but just in case here is a follow up to post #3:

1. You get better solutions in less iterations.

As you crank the iterations right up then both give roughly the same quality of models, but if you are limited to just doing a few iterations the "superset reduced" form of the input is vastly superior.

This is expected as the "non-superset reduced" version spends the early iterations fighting against the correlated variables, whereas the "superset reduced" version can get to work setting the weights independently right from the start (as was hoped/expected...).

2. You get much sparser input vectors.

This wasn't something I thought about when trying to do this, but for a lot of ML algorithms having a lot of zero-valued inputs is a big plus (ie: you can write optimized code that only deals with the non-zero inputs specifically).

From some simple tests I've run today, it appears that the reduction in non-zeroed inputs is as much as 50-60%!

Juk

Quote

Post Reply Subscribe

...