Документ взят из кэша поисковой машины. Адрес оригинального документа : http://sp.cs.msu.ru/courses/match/LEAPS/FiveCaseStudies91.pdf
Дата изменения: Tue Apr 8 17:51:01 2003
Дата индексирования: Mon Oct 1 23:10:58 2012
Кодировка:
Effects of DatabaseSize on Rule System Performance: Five CaseStudies
David A. Brant+, Timothy Grose+, Bernie Lofaso+, & Daniel P. Miranker$
`Applied ResearchLaboratories The University of Texasat Austin P-0. Box 8029 Austin, TX 787138029 *Department of Computer Sciences The University of Texasat Austin Austin, TX 78712

Abstract
Building practical expert database systems requires an effective inferencing capability over large data sets. rnferencing in this context means repeatedly executing a fixed set of queries, interleaved with update transactions, until a fix:3 pint is reached. The effectiveness of ,.lircncing mechanismis heavily depcnde; !:on :he amount of state space needed and the ability of the underlying algorithms to avoid unnecessary work. Common techniques used in the design of rule-basedsystemsstore large amountsof statein order to derive precise query support information that will enable better performance. These techniques were intended for use in main memory on small data setsand are not ncccssarily suited for a databaseenvironment. When confronted with a large databasethese techniques may experience severeperformance problems - severeenough to render them useless.In this paper we examine the effectsof database on live test cases.The use size of real programs with real data pr0vidc.sinsights mat are not to be found through analysis and simulation. We compare two different rule systems, one based on the TREAT match algorithm and the other on LEAPS, a lazy matching algorithm. The results show that state can be a problem in rule systemsand that by using lazy matching it is possible to eliminate some statewhile improving performance.

1 Introduction
A practical integrated database/rulesystemwill require an efhcient inferencing capability on large databases. Our research focuses on active forward chaining inferencing

systems and their behavior in a database environment. Many studies have been done on various aspectsof integrating rule and databasesystems.Several have dealt with performance issues [Tan. Maheshwari, & Srivastava, 1990,Cohen, 1989,Wang, 1989, Delcambre & Etheredge, 1988,Bein, King, & Kamel, 19871.Most of these are analytical in nature and/or simulate synthesized rules and data. While analytical and simulated approaches to this problem are valuable, important issues such as realistic query mixes, rule execution order, and join selectivity factors, are heavily influenced by both the rule program itself and the data on which it operates.Therefore, we have assembleda suite of live real world programswith automatically extensible databasesfor use in understanding rulebasedsystem behavior as the size of the databasegrows'. This work is being done in the context of developing DATEX, a tightly integrated expert databasesystem. In the interest of providing effective integration of rule execution with databasesystems, the behavior of incremental match techniques,first developed for main-memory AI systems, is investigated. These techniques take advantage of a space-time trade-off and store large amounts of state in order to derive precise query support information. We explore the utility of TREAT match [Miranker, 19901and compare it against LEAPS [Miranker & Brant, 19901,a slightly more complicated algorithm, with better spacecharacteristics,intended to support inferencing on databases. In effect, rule-based systemsexecute a set of queries, interleaved with updates,in a cycle until reaching a fixed point. In a naive implementation, this can easily imply executing ld - lo3 multiway joins eachcycle for thousands of cycles. In practice, that number is significantly smaller. This is due to the key observation that these systemsdisplay temporal redundancy, i.e., from one cycle to the next, there is very little change to the database.Thus, if knowl1, This suite is available from the authors and contributions of new programs and data generatorsare welcome.

This work was supported by the ARL:UT Internal Researchand Development Program,

Proceedings of the 17th International Conference on Very Large Data Bases

287

Barcelona, September, 1991


edge about the system state is retained across cycles, the amount of work done on any one cycle can becomequite manageableby just incrementally updating that state. In the AI community the use of incremental match algorithms forms the basis of effective main-memory rule execution environments [Forgy, 19821.The databaseproblem of maintaining a materialized view is an incremental match problem without negation. A large number of incrcmental match algorithms have been described. Many of the newer ones have been developed for databaseapplications IWang, 1990, Blakeley & Martin, 1990, Srivastava, Hwang, &Tan, 1990,Raschid, Sellis, & Lin, 19881. The problem with incremental matching is that the state for most algorithms has an exponential worst-case space requirement. For I tuples in a databaseand a rule with j joins, the worst-casestate is 0(/J). While real systems don't experience the worst case, any unexpected large polynomial spacewould still presentsignificant concerns. To explore the possibility that programs might display this type of bad behavior, we conducted a scaling study in which the databasesof five real programs were scaledup in size as their performancecharacteristicswere measured. goal was to escapethe temptation to usearThe bitrary and manufactured values of selectivity for the select and join operators of rule systems,and to provide for the interaction of many rules acting in concert. Achieving this level of realism through analysis and simulation is extremely difficult, if not impossible. The programs in the test suite are written in OPSS [Forgy, 19811.While OPSSmay he widely regardedas antiquated, it nonethelessembodies many of the underlying AI techniques used in other, more powerful, languages. Moreover, since OPS5 has been around for such a long time there exist public domain systemsand numerousprograms. Our empirical data was gathcrcd using two of the best performing OPSS-basedsystems in existcncc. Both are compiled systems.The first usesa TREAT match algorithm which has been shown to be superior to the more widely used Rete match for most programs. The second usesa relatively new technique, called LEAPS, basedon the idea of lazy mafching. Section 2 presentsa brief description of rule-basedsystems, the four knowledge types that can be employed by their incremental match algorithms, and an overview of someof the published algorithms. In section 3 we describe the metrics used in the study and the test cases.Section 4 presentsour empirical results. Section 5 contains a summary and our concluding remarks.

database,and Al,...,A, is a set of m actions or update transactionson the database.In a relational model, eachPi is a predicate on a single relation. These predicates may contain constants and variables. Predicates containing constantscan be partially evaluated with the relational select operator.If predicatessharea variable, it must be consistently bound between them for the predicates to be satisfied.This is accomplishedby a relational join operator (including both equijoin and non-equijoin). Execution of the program proceedsby evaluating the predicates,choosing one rule whose predicates are satisfied, executing its actions, and repeating the cycle until a fixed point is reached. 2.1 Query Support and Match Algorithms Each unique set of tuples that satisfies a rule's predicates, P+.JV,, is an instantiation. Thus, the set of instantiations at any one time is composed of the union of the temporaryrelations resulting from the queriesassociated with thosepredicates.A critical component of any rulebasedsystemis the match algorithm that computesinstantiations. A naive algorithm for finding the instantiations of the rules would execute the query associated with each rule's predicatesagainst the entire databaseon each cycle. That approach is combinatorially explosive and computationally intractable. To address this problem, rule-based systems maintain state information from cycle to cycle. This information provides query support for the match algorithm. The knowledge provided by the state information has been divided into four categories [McDermott, Newell, and Moore, 1978, and Miranker, 19901. 1. Condifion Membership: Associated with eachPi is a running count of the number of tuples that satisfy its selectoperation. This information is usedto identify rules that arc active, i.e., those with a non-zero count for eachPi. lnaclive rules may bc ignored by the match algorithm. 2. Memory Supporl: Provides explicit knowledge about precisely which tuples satisfy the selectsof which Pi. Thesehave beenreferred to as a-memory. If a separate set of tuples is maintained for eachPi, duplicate information will be stored for those predicatesthat have the sameselectcriteria on the samerelation. Those systems that eliminate this redundant data are said to use shared a-memory. 3. Condition Relationship: Provides knowledge about the interaction of predicates,i.e., the intermediate results of multiway joins are explicitly stored from cycle to cycle. Thesehave been referred to as P-memory. 4. Conjlicl Set Supporl: the conjlicf set containing all of the instantiations for all rules is retained from cycle to cycle. We have addeda fifth knowledge type basedon a lazy approachto query evaluation. We call the new type Reso-

2 Rule-Based Systems
A rule-based program is a set of rules of the form if Pp.../\p, then Al ,...,A,,, where PI~...hpn is a conjunction of n conditions or predicateson the current state of the baserelations of the

Proceedings of the 17th International Conference on Very Large Data Bases

288

Barcelona, September, 1991


Table 1 Relative Cost/Benefit Ranking of Rule Knowledge Knowledge Type Condition Relationship (CR) Memory Support (MS) Condition Membership (CM) Conflict Set Support (CS) Resolution Support (RS) Space-Time Cost very high high low ? ? Expected Benefit very low high high ? ?

Table 2 Match Algorithms and Their State Information Algorithm Rete porgy, 19821 TREAT [Miranker, 19901 Lazy [Miranker & Bran& 19901 Matchbox [Perlin, 19891 Gridmatch [Tan, Maheshwari, & Srivastava, 19901 unnamed machid, Sellis, & Lin, 19881 Rime [Hwang, 19891 unnamed [Oflazer, 19861 lution Support. An underlying theme in current match algorithms is the eager evaluation of all active rules to generateall possible instantiations. However, on any given cycle only one instantiation is chosen to fire one rule. The choice is based on a conflict set resolution strategy. This observation has led to the development of LEAPS, a lazy matching algorithm that computes a fireable instantiation by using the conflict set resolution strategyto direct a bestfirst search. In doing so, it eliminates the conflict set and replaces it with a stack containing the state information needed to control a demand driven stream-basedquery process.Thus, key issuesinclude the effectslazy matching has on performance and its spacerequirements compared to eagerschemes. Table 1 shows the knowledge techniques with a summary of their relative cost and expectedbenefit, As can be seen,the merits of using conflict set support versusresolution support are left as an open qucslion. We will attempt to resolve these question marks basedon the case studies in the following sections. The values that are given in Table 1 are basedon several observations. Since condition relationship storesthe results of all intermediatejoins it is given a very high spacetime cost. Furthermore, TREAT, which does not use this knowledge, consistently outperforms match algorithms which do. Therefore, its expected benefit is statedas very low. Memory support stores all of the intermediate results of select operations. Since there is a one to many mapping from the databaseto a-memory, the cost is generally high for this knowledge. However, when the selectivity of the constant tests is reasonably low (as it is for most systems measured),the expectedbenefit in increasing the speedof joins is also high. Condition membership is low in space cost since it only requires a few bytes to implement. If memory support CR 4 MS J J 4 5 J J J CS J 5 J CM RS

5 s J

is implemented, then its time cost is also extremely low, since all of the select operations will have to be performed anyhow. If memory support is not used then its time cost might be more accurately stated as moderate, as it will then have to do its own select operations. Table 2 lists someof the match algorithms in the literature and their associatedknowledge types. While not obvious, the selection of some knowledge types makes others of little use. For instance, if condition relationship is chosen, then condition membership may provide little or no benefit. Therefore, it is probably not reasonableto expect any one algorithm to useall of them. 2.2 Match Algorithms Most rule systemsin use today are basedon someform of the Rete match. The two exceptions that we are aware of are TREAT and LEAPS. The following sections dcscribe theseat a high Icvcl. 2.2.1 Rete The Rete match incorporatesmemory support and condition relationship, It compiles the queries of the rules into a discrimination network in the form of an augmented dataflow network. The input portion of the Rete network contains chains of tests that perform the relational select operations.Tokens passing through those chains partially match a particular predicate and are stored in a-memory nodes, thus forming the memory support part of the algorithm. Following the a-memories are two-input nodesthat test for consistent variable bindings between predicates. By analogy, thesenodesincrementally compute the join of the memories on the input arcs. When a token enters a two-input node it is comparedagainst tokens in the memory of the opposite arc. Paired tokens with consistent variable bindings are stored at the output of the two-input nodesas p-memories. Tokens that propagatefrom the last

Pfoceedmgs of the 17th International Conferencz on Very Large Data Bases

289

Barcelona, September, 1991


p-memories in the network reflect changesto the conflict set. The Rete algorithm is the basis for most rule-based systemsin use today.
2.2.2 TREAT

It has been shown that the condition relationship information contained in the P-memory is not justified from a performance standpoint in main-memory systems [Miranker, Lofaso, Farmer, Chandra, & Brant, 19901and this result has been recently extended to databases well as Wang & Hanson, 19901.The TREAT match algorithm does not use p-memories and has been shown to consistently outperform Rete-basedsystems. The rationale for this is due to the effects of the removal of tuples from the system.In general, Rete must do as much work to delete a tuple as it does to add one. TREAT does much less work on removals since it doesnot have to update the intermediate join results. TREAT does do slightly more work than Rete when tuples are added, but for most programs, the trade-off favors TREAT. From a databaseperspective, obviating the need for maintaining intermediatejoin results is crucial, since they constitute the largest consumer of memory in Rete-based systems.However, TREAT still makesuseof three categories of state information - condition membership,memory support, and conflict set support. It is intcrcsting to note that TREAT representsa case of reducing space requircmentswhile at the sametime improving performance.
2.2.3 LEAPS

es be automatically produced under controlled parameters. Second,we wanted a set of programs that varied considerably in size and complexity. The resulting suite contains programs that range from several rules to over seven hundred rules. And, third, we wanted to chose a set of programs that spanned the range of standard expert system problem solving techniques, e.g., constraint propagation, blackboards,and A* search.Table 3 provides a summary of the major characteristicsof the test suite.
Table 3 Program Summaries Name No. of Rules Comment

manners waltz waltzdb

8 33 35

Finds a seatingarrangementfor rslinint;guestsby depth-first Waltz iinelabeling for simple scenesby constraint propagation. A more database oriented version of Waltz line labeling for complex scenesby constraint propagation. Route planner for a robotic air vehicle using A*. A VLSI router using a blackboard technique.

ARP weaver

II8 637

3.1.1 Manners

An underlying theme in both Rete and TREAT is the eagerevaluation of all active rules to generateall possible instantiations, However, on any given cycle only one instantiation is usedto fire one rule. This observation has led to the development of the lazy matching algorithm known as LEAPS. LEAFS computesat most one instantiation on any given cycle. This is done by setting up demanddriven data streamsused to produce instantiations. In doing so, it eliminates the conflict set and replacesit with a stack. The stack contains information needed to restart streamsthat have been temporarily suspendeddue to the processingof a higher priority stream.One of the issuesto be examined in the study is how big of a spaceproblem is the conflict set and is the stack an improvement.

Manners was derived from a program appearing in Kieman, de-Maindrivillc. & Simon, 1990. It is based on a depth-first searchsolution to the problem of Anding an acceptable seating arrangement for guests at a dinner party. The particular seating protocol enforced by the version usedin this study ensuresthat each guest is seatednext to someoneof the opposite sex who sharesat least one hobby. The mannersprogram can be extended quite easily to handle many different criteria for constraining the seating arrangement.Further it is very easy to adjust the join selectivity of the program by controlling the distribution of the hobbies. The data used in this study gave a uniform distribution of hobbies from a minimum of 2 to a maximum of 5 per guest. Guestswere evenly divided into male and female. 3.1.2 Waltz and Waltzdb Waltz was developed at Columbia University. It is an expert system designed to aid in the 3-dimensional interpretation of a 2-dimensional line drawing. It does so by labeling all lines in the scene based on constraint propagation. Only scenescontaining junctions composed of two and three lines are permitted. The knowledge that Waltz usesis embeddedin the rules. The constraint propagation consistsof 17 rules that irrevocably assign labels to lines based on the labels that already exist. Additionally, there are 4 rules that establish the initial labels. The rules provide the explicit constraint information by means of constant tests - thcrc is no generalized form of constraint

3 The Test Suite and Metrics
3.1 The Programs and Data Generators Each of the selectedprogramswas chosenfor a specific reason.The goal was to achieve a diversity in terms of the number of rules, the AI problem solving techniqueemployed, and the problem domain, First, it was necessaryto be able to generaterandomly large databasesfor the programs to work on. It was also desirable that thesedatabas-

Proceedings of the 17th International Conference on Very Large Data Bases

290

Barcelona. September, 1991


propagation. The significance of this became apparent when we tried to expand the Waltz program to handle the more general case of line drawings involving junctions composedof 4, 5, and 6 lines. This resulting program is called waltzdb. Waltzdb was developed at the University of Texas at Austin. It is more general version of the Waltz program described in the previous section. Walkdb is designed so that it can be easily adapted to support junctions of 4, 5, and 6 lines. The method usedin solving the labeling problem is a version of the algorithm described by Winston [Winston, 19841.The key difference between the problem solving technique used in waltz and waltzdb is that waltzdb usesa database legal line labels that are applied of to the junctions in a constrained manner.In Waltz the constraints are enforced by constant testswithin the rules. The input data for waltz is a set of lines defined by Cartesiancoordinate pairs, The data generator usesa base drawing consisting of 72 lines which we refer to as a region. The user can specify any number of regions to be generated.We have run test caseswith as many as 100 rcgions. 3.1.3 ARP The Aeronautical Route Planner calculates the lowest :ost route between two points for an airplane or missile. It calculates the route basedon terrain, threat, and cost data. The route will avoid terrain and surface-to-air missile sites (the threats). ARP minimizes three costs for a roulc: the cost of travelling a distance, the cost of being at an altitude, and the cost of being at a height above the terrain. ARP usesthe A* searchalgorithm to restrict the portion of the searchspaceit needsto examine to calculate the route. The input data is a databaseof terrain information for a given corridor. 3.1.4 Weaver Weaveris an expert system designedto perform VLSI channel and box routing [Joobbani & Siewiorek, 19861.It is a large complex program that is made up of several independentexpert systems communicating via a common blackboard. Its input data is a list of pins, nets. and channel dimensions. 3.2 Metrics In order to minimize the affects of constants we have chosen to represent all of the memory usage in terms of the number of data elements. Data elements for both TREAT and LEAPS include the tuplcs of the database and a-memory. TREAT also contains entries in the conflict set, while LEAPS keepsa stack. There are two primary measuresof time used in this study. The first is the overall execution time of the programs in terms of the number of cpu seconds.Since match

time dominates the overall execution time for these systems, the second measure is the number of times an amemory is touched in the joins of the matching algorithm. This is further broken down into successfulversus unsuccessful tests.Successfultestsare important in that they indicate the minimum work necessary given a perfect indexing method on the join attributes.

4 Results
Each program was run with four equally spacedsetsof data points (shown in Table 4). Table 5 shows the counts
Table 4 Initial Program Set 1 Database Size

Set 2 18:; 576 106 791

manners
Waltz

16 iii :Yl

waltzdb ARP' weaver

Set643 2664 864 106 1311

"%i 3600 1152 106 1831

of the measuredelements at the point when they were at their maximum. There are several things to notice in this data.First, the stack is very small and remains that way as databasesize increases.In fact, for four of the five programsit .is a small constant. Contrast this to the size of the conflict set and its growth. Figure 1 graphs this data for all five programs,Also note that we have identified non-linear behavior for the conflict set in the mannersprogram, Another interesting aspect of these programs was the size of the (x-memory.Both of the systemsused to gather the data do not implement sharing of a-memory, but clearly the dominant space factor is a-memory. We compared the a-memory size to the maximum size of the database during the program execution and discovered that the replication factor of data in the a-memory was much higher than anticipated. Table 6 compares the maximum size of the databasewith the measuredmaximum a-memory size (columns 2 and 3). Based on this data we calculated the averagenumber of a-memory entries required for eachdatabaseentry, Since any change to the databaserequires changesto the a-memory, this can also be interpreted as an updateratio (column 4). In order to analyze the effects of using shareda-memory, we used traces of the executions to drive a sharedamemory simulator. The result is shown in column 5. By dividing the measureddata in column 3 with the simulated data in column 5 we derived the averagesharing ratio, i.e., the number of a-memory entries that are subjectedto the sameselect operator. The result was surprising and demonstratedthe clear benefits of using sharing. The last column shows the expected a-memory updatesper database changebasedon sharing. Table 7 shows the total execution time for the test programs.The data is graphedin Fig. 2. To get thesetimes the programs were run with uninstrumented versions of the

Proceedings of the 17th International Conference on Very Large Data Bases

291

Banxlona. September, 1991


Table 5 Maximum Counts Program Element manners TREAT: database a-memory conflict set LEAPS: database a-memory stack waltz TREAT: database a-memory conflict set database LEAPS: a-memory stack waltzdb TREAT: database a-memory conflict set database LEAPS: a-memory stack weaver TREAT: database a-memory conflict set LEAPS: database a-memory stack arp TREAT: database a-memory conflict set LEAPS: database a-memory stack Data Set 1 231 797 174 231 797 1 2,749 98,403 3,192 2,749 98,403 4 3,272 124,056 1,208 3.200 120143 1 2 575 150,073 35 575 143,128 2 494 6,439 353 494 6,447 15 Data Set 2 702 2,570 552 702 2,570 1 5,505 197,203 6,416 5,505 197,203 4 5,864 222,296 2,200 5,736 215,815 2 8i5 212,632 359 815 207,872 2 747 10,023 601 741 10,031 20 Data Set 3 2,425 9,227 2,248 2,425 9,227 1 8,049 288,403 9,392 8,049 288,403 4 8,456 320,536 3,192 8,272 311,199 2 1,335 34;s5-92 1'335 348:552 2 920 12,440 796 920 12,494 25 Data Set 4 8,952 34,856 9,490 8,952 34,856 1 10,805 387,203 12,616 10,805 387,203 4 11,048 418,776 4,184 10,808 406,583 2 1,855 489,232 1,799 1,855 489,232 2 1,116 15,214 960 1,116 15,278 30

Table 6 Effects of Using Shared a-Memory Program manners waltz weaver waltzdb w Database a-Memory 1,855 489,232 8,952 10,805 11,048 1,116 34,856 406,583 387,203 15,278 Update Ratio 263.7 335.; 36:8 13.7 Shared a-Memory 12,620 9,146 53,205 19,687 13,196 Sharing Ratio 3E 2i.z 1:2 Update Ratio (S) 6.8 i*; 1:8 11.8

Table 7 Total Execution Time (in seconds1 Program manners waltz waltzdb weaver Match Alg. TREAT LEAPS TREAT LEAPS TREAT LEAPS TREAT LEAPS TREAT LEAPS Set 1 ;:!i 343.3 147.9 ciQ1.5 102.9 170.3 138.5 224.3 96.9 Set 2 `3 988.0 502.1 2,109.6 340.0 255.8 232.4 529.8 224;6 Set 3 425.8 10.4 2,963.0 1,353.5 4,341.6 796.8 552.6 416.6 822.1 325.0 Set 4 15,838.5 153.1 3,831.8 2,009.6 8,033.3 1,298.3 1,053.7 680.4 1,220.2 464.2

arp

F3wxediigs of the 17th International Caferace on Very Large Data Bases

292

Barcelona. September, 1991


I

+zG--'
Figure 1: Conflict Set vs Stack

'
Figure 2: Execution Time

Pmeeedhga of the 17th International Conference on Very Large Data Bases

293


Table 8 &Memory Tests Program/Match Alg. manners TREAT: Success &iJ Total LEAPS: Success @iJ Total
Waltz

Set 1 194,806 43.899 238,705 9,996 4,001 13,997 5,772,411 m, I 842,120 23.222522 24,064,642 64,310,313 83.128,182 147,438,495 10,984,240 ms I

Set 2 4,246,944 4%%% ,, 142,551 26.226 168,777 23,048,007 167.994.682 191X)42,689 3,312,354 93.603,118 96,915,472 204,316,661 263.469.974 467,786,635 35885,166 $i!-%E, I 19,333,608 18.384.427 37,718,035 16,192,185 13.317.283 29,509,468 32,094,819 m 11:688:013 2M 7 7

Set 3 126,836,779 13%%% 2:180'071 192:883 2,372,954 49,2(X31 1 359s50.007 409,054,381 7,037,370 200.4259822 207,463,192 423,272,961 $45.241,334 968,514,295 75,111,356 l%%% 7 9 37,564,688 41.594.167 79,158,855 32,105,485 $??%-% 1I 51,553,953 izs%% 17,777,300 4$%#33 1

Set 4 e139,524,711 20.165.431 4,189,690,142 3;";;9;;: 35,652,006 88600,707 649.172.207 737,772,914 12,638,004 36 1.490.547 374,128,551 721,179,213 928.442.262 1,649,621,475 128,662,810 122.415.631 25 1,078,441 62,782,968 89.466.307 152,249,275 53,921,185

TREAT LEAPS: waltzdb TREAT: LEAPS: weaver TREAT: LEAPS: arp

Success &iJ Total Success Fail Total Success Fail Total Success Fail Total Success &liJ Total Success m Total Success FJg Total Success m Total

12,837,799 12.333.757 25,171,556 10,295,048 l!%% I 9 12,471,770 *w 5:166:769 +B% 1 1

TREAT: LEAPS:

25,781,141 m, ,

compilers that do not gather statistics. Therefore, the times provide a good indication of the state of the art in main memory rule-basedsystemexecution, The largestdatabase size in the applications was 11,048 for waltzdb. Its program took 1,298.3 secondsto complete. All of the timing data was derived from runs on Sun SPARCstation I+ workstations with 64MBytes of main memory and local disks. Most of the time spent in these systemsgoes into the join work. Joins are performed on the a-memory. Table 8 shows how many times an o-memory element is touched in the life of the program, The systemswe useddo not provide any attribute value basedindexing on a-memory. For large scale databaseproblems this should prove useful. To gain some insight into how useful it might bc, we counted both successfuland unsuccessfultests(test of an a-memory element is a test of its value on the join attribute). The successful tests provide a measure of the minimum amount of work that must be done by the joins. By dividing the successfultestsby the total testswe can also arrive

at an overall join selectivity for the programs. Table 7 shows the join selectivity data. We were somewhat surprised by the high selectivity. However, these databases are set up to provide specific knowledge to the expert systcm. This may not be the caseat all when an expert system is used on a general purpose database.In that situation, join selectivity may be considerably lower. 5 Conclusion

Several clear observations are possible based on the data gathered.First, a-memory that is not sharedmay lead to excessivespacerequirementsand updatecosts. Second, the conflict set for some applications does become problematic as the databasegrows, in that it exhibits a non-linear space behavior, Third, the stack size in LEAPS remainsextremely low and constant for most applications. And, fourth, the lazy matching strategy of LEAPS provides significant spcedup that improves as the problem size increases.

Proceedings of the 17th International Conference on Very Large Data Bases

294

Barcelona. September, 1991


Table 9 AverageCompositeJoin Selectivity Program manners
Waltz

Match Alg.
TREAT LEAPS TREAT LEAPS TREAT iEAPS `-EAT LEAPS TREAT LEAPS

Set 1
0.82 0.71 0.12 0.03 0.44 0.51 0.51 0.54 ::iZ

Set 2
0.92 0.84 0.12 0.03 0.44 0.51 0.51 0.55 0.59 0.57

Set 3
0.97 0.92 0.12 0.03 0.44 0.51 0.47 0.56 0.57 0.53

Set 4
0.99 0.96 0.12 0.03 0.44 0.51 0.41 0.55 0.55 0.50

waledb
weaver arp

Table 10 Relative Cost/BenefitRanking of Rule Knowledge -Knl:,-?edgeType Space-TimeCost ExpectedBenefit
Concltion Relationship (CR) Memory Support (MS) Condition Membership (CM) Conflict Set Support (CS) Resolution Support (RS) very high high low very low high high

high low

moderate high

Basedon theseresults we can complete the cost/benefit matrix from Table 1. Recall that the values for connict set support and resolution support were left as question marks. Table 10 shows the completed comparison. The high cost of conflict set support is derived from a comparison of its space and effects on execution time to the LEAPS-based approach. Since conflict set support does enableTREAT to execute faster than Rete-basedsystems, its benefit is rated as moderate. From the table we conclude that two of the knowledge types, condition membcrship and resolution support, are clearly desired for an integrated database/rule system. These two obviate the need for condition relationship and conflict set support. Therefore, we are now turning our attentions to the last one - memory support. We feel certain that a more effective replacement for memory support can be developed. The key will be to find ways of providing the information contained in the a-memory without relying on the accompanying space.Both TREAT and LEAPS are examplesof this reasoning successfully applied to the problem of integrating rule systemswith databases.

6 References
Bein, J., R. King, and N. Kamel, "MOBY: An Architecture for Distributed Expert DatabaseSystems," Proceedings of the 13th VLDB Conference,Brighton, 1987. Blakeley, J.A. and N.L. Martin, "Join Index, Materialized View, and H brid-Hash Join: A PerformanceAnalysis", Procee(Y of the Sixrh Imernutional Conference ing on Dara Engineering, pp. 256-263, February, 1990. Cohen, D., "Compiling Complex Database Transition Tri gers," Proc. ACM SKMOD Conference,pp. 225 23$ Portland, Oregon, May, 1989.

Delcambre, L.M.L. and Etheredge, J.N., "The Relational Production Language: A Production Lan age for Relational Databases",Proceedingsof the Pecond Internan'onal Conferenceon Expert Database Systems, 1988. Forgy, C., "OPS5 User's Manual", Technical Reporr CMU-CS-81-135,Carnegie-Mellon University, 1981. Forgy, C., "RETE: A Fast Match Algorithm for the Many Pattern/Many Object Pattern Match Problem," Artificial Intelligence, no. 19, pp. 17-37, 1982. Joobbani, R., and D.P. Siewiorek, "WEAVER: A Knowledge-BasedRouting Expert," IEEE Design and Tesf of Compurers,pp, 12-23, Feb,, 1986. Kieman, G. , C. de-Maindriville and E. Simon, "Making Deductive Databasea Practical Technology: A Step Forward," Report No. 1153, Institute National de Re&e;he en Informatique et en Automatique, January, : McDermott J., A. Newell, and J. Moore, "The Efficiency of Certain Production System Implementations," In Parlern-directed Inference Systems,D. Watermanand E Hayes-Roth (eds.), Academic Press, 1978. Miranker, D., TREAT: A New and Efficient Match Algorithm for Al Production Systems, Pittman/Morgan Kaufman, 1990. Miranker; D., B.J. Lofaso, G. Farmer, A. Char&a, and D. Bran& "On a TREAT BasedProduction System Compiler," Proceedingsof the 10th Inrernalional Conferenceon Expert Systems, Avignon, France, 1990. Miranker, D.P.,and D.A. Brant, "An Algorithmic Basis for Integrating Production Systems and Database Systems," ProceedingsSixth international Conferenceon Dala Engineering, February, 1990. Oflazer, K., "Partitioning and Parallel Processing of Production Systems," Dept. of Computer Scicncc, Carnegie-Mellon University, 1986.

Proceedings of the 17th International Cmference on Very Large Data Bases

295

Barcelona, September, 1991


Perk, M.W., "The Match Box Algorithm for Parallel Production SystemMatch," Departmentof Computer Science,Carnegie Mellon University, CMU-CS-89-163, May, 1989. Raschid, L., T. Sellis, and C-C Lin, "Exploiting Concurrency in a DBMS Implementatian for Production Systems," Proceedings of the Internalional Symposium on Databases in Parallel and Distributed Systems, 1988. Srivastava,J., K-W Hwang and J. S. Tan, "Parallelism in DatabaseProduction Systems," Proceeding Sixth International Conference on Data Engineering, pp, 121-128, 1990. Tan, J.S.Eddy, M. Maheshwari, and J. Srivastava, "GridMatch: A Basis for Integrating Production Systems with Relational Databases,"CS TR 90-14, Computer ScienceDept. Univ. of Minnesota, 1990. Wang,C. K., "Rime: A Match Algorithm for MRL," Dept. of Computer Sciences, The University of Texas at Austin, unpublished manuscript, November, 1989. Wang, Y-W, and E. Hanson, "A PerformanceComparison of the Rete and TREAT Algorithms for Testing DatabaseRule Conditions," Wright StateUniversity, Technical Report WSU-CS-90-18, 1990. Winston, P., Arfificial Inlelligence, Adison-Wesley Publishing Co., Reading, Mass., 1984.

Proceedings of the 17th International Conference on Very Large Data Bases

296

Barcelona, September, 1991