From LucidDB Wiki

Introduction

This document provides an architectural overview of how LucidDB join queries are optimized. The goal of query optimization is to generate a query plan where operations are performed as early as possible in order to minimize the amount of data that needs to be processed during later phases of the plan. This document describes the techniques used to achieve that goal. Specifically, the following topics are discussed:

How filters are pushed to minimize rows processed
How projections are pushed in a query to minimize columns read
How star joins are optimized
How join ordering is determined
How outer joins are optimized
How redundant joins are removed

Determining whether star joins should be used and the optimal join ordering are cost-based decisions. All other aspects of join optimization are determined using heuristics. The cost-based components utilize data statistics. This document will also describe how those statistics are used.

Filters

Filters can be specified in a query either in the WHERE or ON clause. The goal of the optimizer is to push these filters so they're closest to the node that they apply to. In other words, filters that reference only a single table should be pushed down to the table the filter references. Likewise, filters that reference multiple tables, i.e., join filters, should be pushed down to the join node that encompasses only the tables referenced in the join filter. This way, the input sizes are reduced as you continue processing further up in the query tree.

In order to push filters, the LucidDB optimizer first needs to decompose the filters into a list of sub-filters that can be AND'd together. That way, each sub-filter can be pushed independent of the other sub-filters in the list.

In the case where the sub-filter originates from an ON clause, the sub-filter either:

Gets pushed into the left join input
Gets pushed into the right join input
Remains in the join node where the filter originated

If the sub-filter originates from the WHERE clause, then with respect to the join node directly beneath the WHERE clause, the sub-filter is either:

Gets pushed into the left join input
Gets pushed into the right join input
Gets pushed into the join node
Remains in the WHERE clause

Whether a filter can be pushed depends on where the filter originated, and whether the join input is null generating. The right input in a LEFT OUTER JOIN is null-generating, whereas for a RIGHT OUTER JOIN, it's the left input. For a FULL OUTER join, both inputs are null-generating.

The following table describes the push-down rules. These rules ensure correct query semantics.

Filter Pushing Rules
Filter Source	Push To Left Join Input	Push To Right Join Input	Push To Join	Filter Stays Where It Is
WHERE clause	Left input is not null-generating, and filter only references tables in the left input	Right input is not null-generating, and filter only references tables in the right input	Filter cannot be pushed to either left or right input, and the join is an INNER join	None of the other conditions apply
ON clause	Right input is not null-generating, and filter only references tables in the left input	Left input is not null-generating, and filter only references tables in the right input	Same as "Filter Stays Where It Is"	None of the other conditions apply

As filters get pushed down in the query tree, these rules are recursively applied until no further pushing is possible.

Projections

Push Down

There are actually two goals the optimizer attempts to achieve with respect to projections. The obvious one is it is desirable to push projections so only the columns actually referenced in a query are read during table scans. This entails pushing projects past the following operations:

unions
intersects
except
joins
filters

Pushing projections entails examining all expressions referenced in either the projection list or filters in the query, and extracting all the individual columns referenced within the expressions. Each of those column references is then pushed down to the table that the column originates from. As a result, the only columns read from each table are either those that appear in the final projection list of the query, or those that are referenced in filters applied after the table scan. Likewise, once a join has been processed, then we can further project the result of the join to exclude columns that are no longer referenced in subsequent joins or the final projection list.

Pull Up

The other goal of optimization, with respect to projections, is to pull up projections so queries referencing views and subqueries in the FROM clause can be optimized as a single N-way join, instead of multiple M-way joins, where M < N. This allows the optimizer to consider a wider range of join orderings. A view definition always contains a projection. Therefore, once the view reference in the query is unfolded, the query tree contains a projection in the middle of the tree. By pulling up that projection, the joins referenced in the view can be combined with the joins in the rest of the query.

The following diagram illustrates this. By pulling up the projection that results after unfolding the view, the optimizer can now determine the optimal join ordering for a single 3-way join, instead two 2-way joins. In other words, T2 and T3 do not need to be joined before either is joined with T1.

   Project           Project                  Project
     |                 |                        |
    Join              Join                     Join
   /    \    ==>     /    \          ==>      /    \
  T1     V          T1   Project             T1   Join
                            |                    /    \
                          Join                  T2    T3
                         /    \
                        T2    T3

The same idea applies to subqueries in the FROM clause. The projection list in the subquery can also be pulled up.

Pulling up projections is also beneficial because it means that if there is a complex expression in one of these nested projections, the evaluation of those complex projection expressions can be deferred until the final result row is constructed.

     Join                         Project(references from X, F(C1,C2))
    /    \                  ==>         |
   X   Project(F(C1,C2))               Join
          |                           /    \
          Y                          X   Project(C1,C2)
                                            |
                                            Y

There is one case, however, where projections aren't pulled up, which is when the projection originates from a null-generating input in an outer join. For example, none of the following projections can be pulled up:

   Left Outer Join        Right Outer Join        Full Outer Join
      /     \                 /      \               /      \
     X    Project          Project    Y          Project   Project
             |                |                     |         |
             Y                X                     X         Y

Otherwise, if these projections contain expressions involving the columns originating from the null input, then by deferring evaluation of an expression involving the null value, an incorrect query result can be returned.

Note that by pulling up projection expressions in the case where the projection expression is later referenced in a join filter, that results in the expression being evaluated twice. In the example below, by pulling up the projection expression b*10, that expression is evaluated twice, once in the join filter and then again in the final projection.

  select a, bb, c from A, (select b*10 as bb, c from B, C where b = c)
      where a = bb;

           Project(a,b*10,c)                   Project(a,bb,c)
                   |                                 |
              Join(a=b*10)                       Join(a=bb)
              /         \                       /          \
      Project(a)  Join(b=c)                 Project(a)  Project(b*10,c)
          |       /       \                     |               |  
          A   Project(b)  Project(c)            A           Join(b=c)
                 |          |                               /       \
                 B          C                         Project(b)   Project(c)
                                                          |            |
                                                          B            C
         Query Tree With Pull Up             Query Tree Without Pull Up

We need to do the pull up in these cases because otherwise, as noted earlier, the join combinations available to the optimizer are limited and can possibly result in cartesian product joins. Consider the following example:

  select aa, b, c from (select a*10 as aa from A), B, C
      where aa = c and b = c;

           Project(a*10,b,c)                   Project(aa,b,c)
                   |                                 |
           Join(a*10=c,b=c)                    Join(aa=c,b=c)
           /         \                          /          \
        Join          C                  Project(a*10,b) Project(c)
       /    \                                  |            |
Project(a) Project(b)                        Join           C
   |          |                             /    \
   A          B                     Project(a)  Project(b)
                                       |           |
                                       A           B
                                                        
 Query Tree With Pull Up                 Query Tree Without Pull Up

By not pulling up the projection containing the expression a*10, the optimizer is forced to join tables A and B together before joining with C. Since there are no join filters between A and B, the join between those two tables ends up being a cartesian join.

Star Joins

Semijoins

LucidDB optimizes star joins by utilizing semijoins. The idea here is if you have a join between a fact table and a dimension table, and the dimension table is filtered, then you need not read the entire fact table. Instead, you only need to read the subset of the fact table that will join with the filtered subset from the dimension table. The reduced subset of the fact table, which is the result of the semijoin, can then be joined with the dimension table to retrieve the necessary columns from that table.

For example, if sales is the fact table and product is the dimension table,

  SELECT * from sales s, product p
     WHERE s.product_id = p.id and p.size = 'S';

then using a semijoin on the above query would result in reading only those rows from the sales table that join with product that are small in size.

LucidDB processes semijoins by doing a series of bitmap index lookups. Therefore, there must be a bitmap index defined on the columns in the fact table that join with the dimension table. The unique set of dimension columns that join with the fact table serve as lookup keys into the index scan. In other words, one bitmap index lookup is done for each set of dimension column join values. The rids returned from these lookups are then merged together, forming the list of reduced rids from the fact table.

The stream of lookup keys from the dimension table that feed into the fact table's index scan are created as follows:

First, the columns corresponding to the dimension join keys are projected from the result of the filtered dimension table scan. In doing so, if necessary, the dimension columns are cast so their types match the types of the fact table join columns.
Furthermore, any null keys are filtered out. Otherwise, the index lookup will match on null keys in the fact table. If the columns are non-nullable, then filtering is not required.
Then, duplicate key values are removed to avoid redundant lookups. This can be skipped if the keys correspond to either the primary key or a non-null unique constraint defined on the dimension table.
Finally, the keys are sorted. That way, the index lookups are localized.

Note that semijoins are also only used when you have an INNER equality join between two inputs. In general, neither the fact or dimension tables in a semijoin can be null generating inputs from any outer join within the query. The one exception is if the dimension table is the result of a full outer join.

Multiple Dimension Tables

If a fact table joins with multiple dimension tables, then it's possible to use each of those dimensions to filter the fact table. In the example below, the semijoin limits the rows scanned from sales to the rows that join with product that are small in size and salepersons that are older than 30.

  SELECT * FROM sales s, product p, salesperson sp
     WHERE
        s.product_id = p.id AND
        s.salesperson_id = sp.id AND
        p.size = 'S' AND sp.age >=30;

Bitmap indexes must exist on sales.product_id and sales.salesperson_id. The intersection of the rids returned from each of those index lookups is the subset of qualifying rows from sales.

Semijoins and Table Filters

It's also possible to combine the lookup used to process a semijoin with other index lookups on the fact table, by taking the intersection of the rids returned from the semijoin lookups with the rids returned from the regular index lookups. In the example below, if there's also an index on sales.id, the index lookup used to filter that column can be intersected with the index lookup for the semijoin with product.

  SELECT * FROM sales s, product p
     WHERE s.product_id = p.id AND p.size = 'S'
     AND s.id < 100;

Chained Semijoins

Semijoins can also be used to process snowflake joins. Here's the scenario. Dimension table, dimA, joins with a filtered dimension table, dimB, and dimA joins with the fact table. A semijoin can be used to filter both the fact table and dimA. In the example below, customer is dimA, state is dimB, and sales is the fact table.

  SELECT * FROM sales s, state st, customer c
     WHERE s.customer_id = c.customer_id AND c.city = st.city
        AND st.state = 'New York';

The filtered scan from state feeds into an index scan on customer.city. That constitutes the first semijoin. The result of that semijoin is then chained into a second semijoin that does an index scan on sales.customer_id.

IN Filters

Semijoins can also be used to process IN filters. If your query contains an IN filter with more than 20 elements, LucidDB converts the query into an equivalent join query. By converting the large IN list into a join, LucidDB avoids any internal limits in the expression evaluator.

Provided the appropriate index is available, the join query can be then processed using a semijoin. The elements in the IN list are aggregated to remove duplicates, sorted for locality of reference, and then fed into an index scan on the filtering column.

Similarly, if a query contains a "<columm> IN <subquery>" filter, then that filter can also be processed using a semijoin. The result of the subquery is aggregated, sorted, and fed into an index scan on the filtering column.

How the Optimizer Decides Which Semijoins To Use

As described in the section on chained semijoins, it is possible to apply a sequence of semijoins on a single table. The LucidDB optimizer uses a mix of a cost-based and heuristic algorithm to limit the semijoin chains it considers. The algorithm is as follows:

Sort the tables by the cost of scanning each table
Iterate through each table in descending cost order
Look for the best semijoin available to filter that table
If none exists, move to the next table
If a semijoin is available, assign the semijoin to the table, go back to step 1, and recompute the costs of scanning each table. The cost should now take into consideration the semijoin that was just chosen, as well as previous ones chosen. Note that the addition of semijoins can significantly reduce the cost of scanning large tables, resulting in those tables appearing earlier in the new sorted list.

The outer loop iterates a maximum of 10 times, so the longest chain that would ever be produced by the optimizer is of length 10.

When trying to decide if a semijoin should be used on a particular fact table, the optimizer decides this based on whether the semijoin reduces the size of the fact table sufficiently enough to outweigh the additional cost of scanning the dimension table. It assigns a score to each candidate semijoin. The score is the ratio of the number of rows from the fact table that don't need to be read (due to use of the semijoin) over the cost of scanning the dimension table. If the score exceeds a fixed threshold value, then the semijoin is a viable candidate. Therefore, if the fact table isn't sufficiently filtered, or the cost of the dimension scan is too high, then the semijoin is not considered. The semijoin with the largest score is the one that's chosen for that table.

The consequence of these algorithms is that the following semijoins are considered:

Small table (with constraints) filters medium table (without constraints), which filters big table
Small table (with constraints) filters big table (without constraints), which filters medium table

But the following likely won't be:

Big table (with constraints) filters small table (without constraints), which filters medium table

because the cost of scanning the big table will be large and the savings for the small table will be low. This results in a low score that likely is below the fixed threshold. Given that the first semijoin (big filters small) isn't considered, the second one (big filters small, which filters medium) doesn't even come into the picture.

Identifying Semijoins in EXPLAIN PLAN

Basics

The goal of this section is not to understand every detail of the EXPLAIN PLAN output for a semijoin, but rather to describe enough so you can use EXPLAIN PLAN to verify whether or not your query is using semijoins. The easiest way to describe this is through an example. If you're not familiar with some of the basic constructs in the EXPLAIN PLAN output, see Diagnosing Slow Queries Using Explain Plan#Run EXPLAIN PLAN and Interpret the Output.

Consider the following query:

   EXPLAIN PLAN FOR
   SELECT * from sales s, product p
    WHERE s.product_id = p.id and p.size = 'S';

This is what the explain output looks like. Note that the right-hand side is truncated for readability.

'FennelToIteratorConverter'
'  FennelReshapeRel(projection=[[0, 1, 2, 3, 4, 5, 6, 7, 8]], outputRowType=[RecordType(INTEGER SID, ...
'    LhxJoinRel(leftKeys=[[1]], rightKeys=[[4]], joinType=[INNER[)'
'      LcsRowScanRel(table=[[LOCALDB, SJ, SALES]], projection=[*[, clustered indexes ...
'        LcsIndexMergeRel(consumerSridParamId=[0], segmentLimitParamId=[0], ridLimitParamId=[1])'
'          LcsIndexSearchRel(table=[[LOCALDB, SJ, SALES]], index=[I_SALES_PID], projection=[*], ...
'            FennelSortRel(key=[[0]], discardDuplicates=[false])'
'              FennelReshapeRel(projection=[[0]], outputRowType=[RecordType(INTEGER ID) NOT NULL])'
'                LcsRowScanRel(table=[[LOCALDB, SJ, PRODUCT]], projection=[[0]], clustered indexes ...
'                  LcsIndexSearchRel(table=[[LOCALDB, SJ, PRODUCT]], index=[I_PRODUCT_SIZE], ... 
'                    FennelValuesRel(tuples=[[{ '[', 'S', ']', 'S' }]])'
'      FennelReshapeRel(projection=[[0, 1, 2, 3, 0]], outputRowType=[RecordType(INTEGER NOT NULL ID, ...
'        LcsRowScanRel(table=[[LOCALDB, SJ, PRODUCT]], projection=[*], clustered indexes ...
'          LcsIndexSearchRel(table=[[LOCALDB, SJ, PRODUCT]], index=[I_PRODUCT_SIZE], projection=[*], ...
'            FennelValuesRel(tuples=[[{ '[', 'S', ']', 'S' }]])'

The operations (or RelNodes) relevant to the semijoin are highlighted in bold. Note that product.id is a primary key.

Working bottom-up, here's what each of the RelNodes corresponds to. The first three RelNodes execute the filtered scan on the dimension table, product, projecting out the id column.

FennelValuesRel - Passes in the value 'S' for the size filter that used for the index search
LcsIndexSearchRel on PRODUCT - Index lookup to process the size filter
LcsRowScanRel on PRODUCT - Reads the rows filtered by the index lookup, returning only the PRODUCT_ID column. Note that the projection is done as part of this row scan rather than post-row scan in the FennelReshapeRel that follows.
FennelReshapeRel - Casts the product.id column from not nullable to nullable. This is needed because sales.product_id happens to be nullable, whereas product.id is not nullable. Because product.id is not nullable, there's no need to do additional filtering to ignore rows where product.id is null.
FennelSortRel - Sorts the product.id column. Since the column is a primary key, there's no need to remove duplicates. Otherwise, there would have been a LhxAggRel feeding into the sort.
LcsIndexSearchRel on SALES - Performs the index lookup on sales.product_id using the product.id values.
LcsIndexMergeRel - Combines the individual lookups from the index scans on the sales index into a single list of rid values. This corresponds to the subset of the sales table that needs to be read.
LcsRowScanRel on SALES - Reads the filtered subset of rows from the fact table

Other Examples

Now that you understand the basics, here are some more complex examples.

Multiple Dimension Tables Example

In this example, two separate semijoins are intersected together, as described above.

  EXPLAIN PLAN FOR
  SELECT * FROM sales s, product p, salesperson sp
    WHERE
       s.product_id = p.id AND
       s.salesperson_id = sp.id AND
       p.size = 'S' AND sp.age >=30;

'FennelToIteratorConverter'
'  FennelReshapeRel(projection=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]], ...
'    LhxJoinRel(leftKeys=[[2]], rightKeys=[[3]], joinType=[INNER])'
'      FennelReshapeRel(projection=[[0, 1, 2, 3, 4, 5, 6, 7, 8]], outputRowType=[RecordType(INTEGER ...
'        LhxJoinRel(leftKeys=[[1]], rightKeys=[[4]], joinType=[INNER])'
'          LcsRowScanRel(table=[[LOCALDB, SJ, SALES]], projection=[*], clustered indexes ...
'            LcsIndexIntersectRel(startRidParamId=[2], rowLimitParamId=[3])'
'              LcsIndexMergeRel(consumerSridParamId=[2], segmentLimitParamId=[3], ridLimitParamId=[1])'
'                LcsIndexSearchRel(table=[[LOCALDB, SJ, SALES]], index=[I_SALES_SP], projection=[*], ...
'                  FennelSortRel(key=[[0]], discardDuplicates=[false])'
'                    FennelReshapeRel(projection=[[0]], outputRowType=[RecordType(INTEGER ID) ...
'                      LcsRowScanRel(table=[[LOCALDB, SJ, SALESPERSON]], projection=[[0]], clustered indexes ...
'                        FennelValuesRel(tuples=[[{ '[', 30, '+', null }]])'
'              LcsIndexMergeRel(consumerSridParamId=[2], segmentLimitParamId=[3], ridLimitParamId=[4])'
'                LcsIndexSearchRel(table=[[LOCALDB, SJ, SALES]], index=[I_SALES_PID], projection=[*], ...
'                  FennelSortRel(key=[[0]], discardDuplicates=[false])'
'                    FennelReshapeRel(projection=[[0]], outputRowType=[RecordType(INTEGER ID) ...
'                      LcsRowScanRel(table=[[LOCALDB, SJ, PRODUCT]], projection=[[0]], clustered indexes ...
'                        LcsIndexSearchRel(table=[[LOCALDB, SJ, PRODUCT]], index=[I_PRODUCT_SIZE], ...
'                          FennelValuesRel(tuples=[[{ '[', 'S', ']', 'S' }]])'
'          FennelReshapeRel(projection=[[0, 1, 2, 3, 0]], outputRowType=[RecordType(INTEGER NOT NULL ID, ...
'            LcsRowScanRel(table=[[LOCALDB, SJ, PRODUCT]], projection=[*&#93, clustered indexes ...
'              LcsIndexSearchRel(table=[[LOCALDB, SJ, PRODUCT]], index=[I_PRODUCT_SIZE], ...
'                FennelValuesRel(tuples=[[{ '[', 'S', ']', 'S' }]])'
'      FennelReshapeRel(projection=[[0, 1, 2, 0]], outputRowType=[RecordType(INTEGER NOT NULL ID, ...
'        LcsRowScanRel(table=[[LOCALDB, SJ, SALESPERSON]], projection=[*], clustered indexes ...
'          FennelValuesRel(tuples=[[{ '[', 30, '+', null }]])'

The portion in red corresponds to the semijoin on salesperson, whereas the portion in blue corresponds to the semijoin on product. The LcsIndexIntersectRel is the RelNode that intersects the index scans from the two semijoins.

Note that an index is not used to process the age filter on salesperson. The filter is applied as part of the table scan.

Semijoin and Table Filter Example

In this example, two index lookups are used -- one for a semijoin and the second for filtering. See Semijoins and Table Filters for details.

  EXPLAIN PLAN FOR
  SELECT * FROM sales s, product p
    WHERE s.product_id = p.id AND
       p.size = 'S' and s.id < 100;

'FennelToIteratorConverter'
'  FennelReshapeRel(projection=[[0, 1, 2, 3, 4, 5, 6, 7, 8]], outputRowType=[RecordType(INTEGER SID, ...
'    LhxJoinRel(leftKeys=[[1]], rightKeys=[[4]], joinType=[INNER])'
'      LcsRowScanRel(table=[[LOCALDB, SJ, SALES]], projection=[*], clustered indexes ...
'        LcsIndexIntersectRel(startRidParamId=[2], rowLimitParamId=[3])'
'          LcsIndexMergeRel(consumerSridParamId=[2], segmentLimitParamId=[3], ridLimitParamId=[1])'
'            LcsIndexSearchRel(table=[[LOCALDB, SJ, SALES]], index=[I_SALES_PID], projection=[*], ...
'              FennelSortRel(key=[[0]], discardDuplicates=[false])'
'                FennelReshapeRel(projection=[[0]], outputRowType=[RecordType(INTEGER ID) NOT NULL])'
'                  LcsRowScanRel(table=[[LOCALDB, SJ, PRODUCT]], projection=[[0]], clustered indexes ...
'                    LcsIndexSearchRel(table=[[LOCALDB, SJ, PRODUCT]], index=[I_PRODUCT_SIZE], ...
'                      FennelValuesRel(tuples=[[{ '[', 'S', ']', 'S' }]])'
'          LcsIndexMergeRel(consumerSridParamId=[2], segmentLimitParamId=[3], ridLimitParamId=[4])'
'            LcsIndexSearchRel(table=[[LOCALDB, SJ, SALES]], index=[I_SALES_ID], projection=[*], ...
'              FennelValuesRel(tuples=[[{ '(', null, ')', 100 }]])'
'      FennelReshapeRel(projection=[[0, 1, 2, 3, 0]], outputRowType=[RecordType(INTEGER NOT NULL ID, ...
'        LcsRowScanRel(table=[[LOCALDB, SJ, PRODUCT]], projection=[*], clustered indexes ...
'          LcsIndexSearchRel(table=[[LOCALDB, SJ, PRODUCT]], index=[I_PRODUCT_SIZE], projection=[*], ...
'            FennelValuesRel(tuples=[[{ '[', 'S', ']', 'S' }]])'

The portion in red corresponds to the semijoin with product, whereas the portion in blue corresponds to the index lookup on sales.id. The LcsIndexIntersectRel intersects the two index scans.

Chained Semijoin Example

See the section above for a discussion of how chained semijoins work. Using that example, here's what the explain output for a chained semijoin looks like.

  EXPLAIN PLAN FOR
  SELECT * FROM sales s, state st, customer c
    WHERE s.customer_id = c.customer_id AND c.city = st.city
       AND st.state = 'New York';

'FennelToIteratorConverter'
'  FennelReshapeRel(projection=[[0, 1, 2, 3, 4, 8, 9, 5, 6, 7]], outputRowType=[RecordType(INTEGER SID, ...
'    LhxJoinRel(leftKeys=[[3]], rightKeys=[[5]], joinType=[INNER])'
'      LcsRowScanRel(table=[[LOCALDB, SJ, SALES]], projection=[*], clustered indexes ...
'        LcsIndexMergeRel(consumerSridParamId=[0], segmentLimitParamId=[0], ridLimitParamId=[2])'
'          LcsIndexSearchRel(table=[[LOCALDB, SJ, SALES]], index=[I_SALES_CUST], projection=[*], ...
'            FennelSortRel(key=[[0]], discardDuplicates=[false])'
'              FennelReshapeRel(projection=[[0]], outputRowType=[RecordType(INTEGER CUSTOMER_ID) NOT NULL])'
'                LcsRowScanRel(table=[[LOCALDB, SJ, CUSTOMER]], projection=[[0]], clustered indexes ...
'                  LcsIndexMergeRel(consumerSridParamId=[0], segmentLimitParamId=[0], ...
'                    LcsIndexSearchRel(table=[[LOCALDB, SJ, CUSTOMER]], index=[I_CUSTOMER_CITY], ...
'                      FennelSortRel(key=[[0]], discardDuplicates=[false])'
'                        LcsRowScanRel(table=[[LOCALDB, SJ, STATE]], projection=[[0]], clustered indexes ...
'                          FennelValuesRel(tuples=[[{ '[', 'New York            ', ']', 'New York            ' }]])'
'      FennelReshapeRel(projection=[[0, 1, 2, 3, 4, 0]], outputRowType=[RecordType(INTEGER NOT NULL ID, ...
'        LhxJoinRel(leftKeys=[[2]], rightKeys=[[0]], joinType=[INNER])'
'          LcsRowScanRel(table=[[LOCALDB, SJ, CUSTOMER]], projection=[*], clustered indexes ...
'            LcsIndexMergeRel(consumerSridParamId=[0], segmentLimitParamId=[0], ridLimitParamId=[1])'
'              LcsIndexSearchRel(table=[[LOCALDB, SJ, CUSTOMER]], index=[I_CUSTOMER_CITY], ...
'                FennelSortRel(key=[[0]], discardDuplicates=[false])'
'                  LcsRowScanRel(table=[[LOCALDB, SJ, STATE]], projection=[[0]], clustered indexes ...
'                    FennelValuesRel(tuples=[[{ '[', 'New York            ', ']', 'New York            ' }]])'
'          LcsRowScanRel(table=[[LOCALDB, SJ, STATE]], projection=[*], clustered indexes ...
'            FennelValuesRel(tuples=[[{ '[', 'New York            ', ']', 'New York            ' }]])'

The portions in blue represent the semijoin between customer and state. The portion in red represents the chained semijoin on sales. Notice that the blue semijoin is done twice -- once to filter the customer table by itself, and then a second time to filter the customer table when it's used to filter the sales table. Although not shown in this example, LucidDB does support temporary materialization of query sub-results. So, if it makes sense from a cost perspective, it will avoid the second execution and simply read the result of the row scan on customer from a temporary buffer.

Notice also that the portion in blue does not include a FennelReshapeRel. That's because both CUSTOMER.CITY and STATE.CITY have the same type, and both happen to be non-nullable.

IN Filter Examples

Here's what an IN filter with a large number of elements in the IN list looks like, when it's processed using a semijoin.

 EXPLAIN PLAN FOR
 SELECT * FROM sales WHERE customer_id IN (1, 2, 3, ..., 100, 200, 300);

'FennelToIteratorConverter'
'  LcsRowScanRel(table=[[LOCALDB, SJ, SALES]], projection=[*[, clustered indexes= ...
'    LcsIndexMergeRel(consumerSridParamId=[0], segmentLimitParamId=[0], ridLimitParamId=[1])'
'      LcsIndexSearchRel(table=[[LOCALDB, SJ, SALES]], index=[I_SALES_CUST], projection=[*], ...
'        FennelSortRel(key=[[0]], discardDuplicates=[false])'
'          FennelReshapeRel(projection=[[0]], outputRowType=[RecordType(INTEGER ROW_VALUE) NOT NULL])'
'            LhxAggRel(groupCount=[1])'
'              FennelValuesRel(tuples=[[{ 1 }, { 2 }, { 3 }, { 4 }, { 5 }, { 6 }, { 7 }, { 8 }, { 9 }, { 10 }, ...

In this case, the FennelValuesRel is used to pass along the elements in the IN list, and then the LhxAggRel is used to remove duplicates. The rest of the plan highlighted in bold follows the same pattern seen in the earlier examples, where the list of distinct IN list elements become the key values used in the index lookups on SALES.

Here's what the plan looks like when the filter is an IN <subquery> filter.

 EXPLAIN PLAN FOR
 SELECT * FROM sales WHERE customer_id IN (SELECT id FROM customer WHERE id < 10);

'FennelToIteratorConverter'
'  LcsRowScanRel(table=[[LOCALDB, SJ, SALES]], projection=[*], clustered indexes= ...
'    LcsIndexMergeRel(consumerSridParamId=[0], segmentLimitParamId=[0], ridLimitParamId=[1])'
'      LcsIndexSearchRel(table=[[LOCALDB, SJ, SALES]], index=[I_SALES_CUST], projection=[*], ...
'        FennelSortRel(key=[[0]], discardDuplicates=[false])'
'          FennelReshapeRel(projection=[[0]], outputRowType=[RecordType(INTEGER ID) NOT NULL])'
'            LhxAggRel(groupCount=[1])'
'              LcsRowScanRel(table=[[LOCALDB, SJ, CUSTOMER]], projection=[[0]], clustered indexes= ...
'                LcsIndexMergeRel(consumerSridParamId=[0], segmentLimitParamId=[0], ridLimitParamId=[2])'
'                  LcsIndexSearchRel(table=[[LOCALDB, SJ, CUSTOMER]], index= ...
'                    FennelValuesRel(tuples=[[{ '-', null, ')', 10 }]])'

The portion in red represents processing of the subquery. That result then becomes the key values used to filter SALES.

Cost-Based Join Ordering Selection

Having decided which semijoins to use, the next phase in join optimization is to determine the optimal join ordering. The most simplistic approach here would be to generate all possible join combinations. For an N-way join, this would be N-factorial possible combinations. This is expensive if N is large, and results in the optimizer wasting cycles generating join combinations that are clearly non-optimal, e.g., combinations involving cartesian joins.

The LucidDB optimizer uses a more pragmatic algorithm. The general idea is N different join orderings are generated. Each of the N join orderings consists of one of the tables in the join appearing as the first table in the ordering (subject to outer join and semijoin restrictions to be described in later sections). The order of the remaining tables is determined by finding the best table to add from the remaining tables that haven't been joined yet. Preference is given to tables that have an equality join with the tables that are already in the ordering. If there are multiple tables with equality join filters, then the table that contains the join keys with the highest cardinality is chosen first. This is repeated until all tables have been added into the join ordering. By first considering join combinations where there exists a join condition between tables, cartesian join combinations are only considered as a last resort.

When adding a new table to a join ordering, the optimizer has the option of either adding the table to the end of the ordering or in the middle, depending on which yields a lower cost plan. Adding the table to the end of the ordering means that the new table is being joined with the result of the current ordering. In the example below, C has been added to the end of the current join ordering, (A join B).

         Join                              Join
         /  \                             /    \
        A    B                          Join    C
                                        /  \
                                       A    B
Current Join Ordering    C Added to the End of the Current Join Ordering

When adding a table to the end of an ordering, if the new table has more rows than the number of rows in the current join ordering, then the new table becomes the left input in the new ordering, and the current join ordering becomes the right input. This way, if a hash join is used to process the join, the smaller input is the build side of the hash join. Or if a nested loop join is used, the temporary index is built on the smaller input.

So, in the example above, if C is larger than the result of the join between A and B, then the new join ordering is the one below, rather than the one shown above.

It may be beneficial to push a table into the middle of a join if one of the inputs in the current join ordering contains all of the tables that the new table joins with. That way, that particular join will be done earlier in the ordering. If the new table doesn't join with any of the tables in the current join ordering, then the optimizer arbitrarily pushes the table into the left input. Pushing the new table into a join input, X, is done recursively. In other words, the optimizer now has a choice between adding the new table at the end of the join ordering corresponding to input X, or in the middle of that join. For example, if the current join ordering is the one shown below, and table D joins with table C, then the optimizer will consider pushing table D into the right input of JoinX.

       JoinX
       /   \
      A   JoinY
          /   \
         B     C

Table D can then either be added to the end of that join ordering (B join C), or it can be pushed into the right input of JoinY, since D joins with C.

The join on the left is the result of adding D to the end of JoinX's right input, while the one on the right is the result of pushing D to JoinY's right input. In both cases, assume D is the smaller table, so it appears on the right-hand side of the resulting joins.

      Join                 Join
     /    \               /    \
    A    Join            A    Join
        /    \               /    \
      Join    D             B    Join
     /    \                     /    \
    B      C                   C      D

The semijoins selected in the first phase will influence the cost of the plans generated in the second phase. From the N join orderings that are generated, the lowest cost one is selected.

Let's do a reality check on the algorithm just described, and see what type of join ordering would be generated if you have a star join with multiple dimension tables, all joining with a fact table. Let's consider the ordering for the case where the first table in the ordering is the fact table. When deciding which table to join next, the optimizer can choose from amongst each of the joining dimension tables. If all join with the fact table using equality join filters, then the decision on which to join first is based on which dimension table has the highest cardinality join keys. If the dimension join keys are unique, which generally is the case, then the largest dimension table will be chosen. Since it's generally smaller than the fact table, it appears on the right hand side of that first join. The next table chosen will be the next largest dimension table. Again, since it's smaller than the result of the join between the fact table and the largest dimension table, it appears on the right hand side of this new join. This repeats for the remaining dimension tables, and the resulting join is a left-deep join with the fact table as the first table, followed by the dimension tables ordered from largest to smallest.

Note that when adding D2 to (F join D1), the optimizer will consider pushing D2 into the middle of that join. But as you'll see in the next section, the cost of the plan ((F join D1) join D2) is the same as ((F join D2) join D1), so the first one takes precedence.

If you would like to see the join ordering that the optimizer has selected, execute EXPLAIN PLAN, and see Diagnosing Slow Queries Using Explain Plan#Run EXPLAIN PLAN and Interpret the Output for an overview on how to interpret the output.

Cost Functions

When optimizing join queries, the LucidDB optimizer utilizes cost functions as a basis for comparing one query plan against another. These cost functions compute estimates for all the different relational expressions. A relational expression is a tree of relational operators (e.g., filter, group by, join, project, table scan, etc.) The cost functions use data statistics, if they're available. The accuracy of these estimates is dependent on the accuracy of the statistics. These functions all contribute towards estimating the number of rows returned by the different relational expressions, which is important because the cost of executing a particular query plan in LucidDB is primarily a measure of the number of rows processed by the plan.

For example, an estimate of the number of rows returned from a table scan without filters is in fact the total number of rows in the table; whereas an estimate of the number of rows returned from that same table scan but with filters is the number of rows in the table multiplied by a fractional value that estimates the percentage of rows that qualify the filter. If you have two joins that are identical except one has a full table scan as one of its inputs and the other has the table scan with filters instead, then the latter is less costly because it has the smaller join input. Therefore, this correlates with the optimizer's goal of filtering as much data as possible early on during processing of joins.

Data statistics are gathered by running ANALYZE TABLE. See LucidDbDataStorageAndAccess#Analyze Table for details on the types of statistics gathered.

The following table summarizes some of the cost functions used by the optimizer, with an example of where each function is used, in relation to the join ordering selection algorithm described in the previous section.

LucidDB Cost Functions *(Partial List)*
Function	Description	Example Usage
Cumulative Cost	Estimates the total cost of a relational expression	Used to compare the cost of one join ordering against another
Row Count	Estimates the number of rows returned by a relational expression	Used to decide which input is the left vs right in a join
Selectivity	Estimates the fraction of rows that qualify filters applied on a relational expression	Used to compute row counts for operations involving filtering
Distinct Row Count	Estimates the number of rows that would be produced by a GROUP BY on a set of columns, where the input into the GROUP BY is optionally filtered	Used to compute the row count estimate for a join

For a given query, LucidDB caches the return values from calls into these cost functions. Therefore, a function does not need to be evaluated multiple times during a given optimization instance, provided the same inputs are being passed into the function.

In cases when statistics aren't available, the cost functions will assign default guesstimates. Depending on how close these estimates are to the values that would have been computed using actual statistics, that can influence the quality of the plans selected by the optimizer.

For example, if data distributions are unavailable for columns, the optimizer assigns a default value of .15 for equality predicates and .5 for greater than, less than predicates. Note that the IS NULL predicate is treated like an equality predicate, so its default selectivity is also .15.

If statistics are not available, the LucidDB optimizer will try make the most out of the information that it does have. One statistic that is always available is the row count for LucidDB tables. LucidDB can also infer the distinct row count for columns that correspond to either primary keys or unique constraints on non-null columns. As a result, provided the default selectivities described above are somewhat reasonable, LucidDB can come up with fairly reasonable estimates for the result size of star joins. This is because it assumes that the table containing the unique join columns is the dimension table, and therefore, the estimated row count for the join is the number of rows in the other join input, i.e., the fact table, after it has been filtered. The fact table is filtered by whatever filters are applied on the table itself, as well as any semijoins. The selectivity of the semijoin filters is the same as the selectivity of the table-level filters applied on the dimension tables. For example, if a dimension table is filtered by 1/3, then the selectivity of a semijoin between a fact table and that dimension table is also 1/3. Hence, the accuracy of the row count estimates in this case is dependent on the accuracy of the various table-level filters.

If column cardinalities are unavailable and the keys involved in the join are non-unique, then the estimated row count for a join with two equality join predicates is:

  (number of rows in the left join input) * (number of rows in the right join input) *
      .15 * .15

.15 is multiplied twice because .15 is the default selectivity for an equality predicate, and the selectivity of two predicates AND'd together is the product of their individual selectivities.

If this join is the input into another join, then the estimate above is used to compute the row count returned from the subsequent join, as these functions recursively call one another on a relational expression's inputs. Therefore, any inaccuracies in the estimates are exacerbated as they're used in the computations for these subsequent joins. In other words, if you're executing a join between a few tables, inaccurate statistics may result in a non-optimal join ordering, but not one that will result in poor performance. However, as the number of joins increases, the inaccuracies will compound, increasing the likelihood of the optimizer making a decision that can result in a poor-performing query.

Had statistics been available, the row count returned from that same join would have been the following instead:

  (number of rows in the left join input) * (number of rows in the right join input) /
     max(number of distinct values in the left input's join keys,
         number of distinct values in the right input's join keys)

The number of distinct values in a pair of keys is the product of the cardinalities of the individual keys. Note that this formula works out to be the same as the estimate that's described above for a star join, in the case where statistics are unavailable.

As you've hopefully concluded by now, the quality of the query plans chosen by the optimizer is dependent on the availability of stats. When stats aren't available, it should not come as a surprise if your query performs poorly.

Poor query plans can even result in queries that cannot execute to completion. Hash joins operate by building a hash table on its right input. See LucidDbJoinImplementation#Partitoning. Unavailable or incorrect stats can cause the optimizer to choose as the right input into a hash join an input that contains a large number of duplicate key values. If so, this could potentially lead to the hash recursion error.

The following table summarizes the scenarios discussed in this section.

Impact of Not Having Stats
Accurate Stats?	Type of Query	Formula Used to Estimate Join Result Sizes	Quality of Resulting Plan
Yes	Any	(number of rows in left join input) * (number of rows in right join input) / max(number of distinct values in left input's join keys, number of distinct values in right input's join keys)	Optimal
No	Star Join	number of rows in the fact table after applying table and semijoin filters; fact table is assumed to be the table without the unique keys	Good, provided default table selectivities are reasonably close to actual selectivities
No	Small number of joins, some of which are not star joins	For non-star joins: (number of rows in left join input) * (number of rows in right join input) * default selectivity of join predicates	Possibly non-optimal with some risk of poor performance
No	Large number of joins, some of which are not star joins	same as above	Likely non-optimal with high risk of poor performance, and possibly even hash recursion errors

One other key point is that it's not only important that you have stats but that they be relatively up-to-date. If stats are out-dated, that can potentially be worse than not having stats at all. Let's consider the worst-case scenario. Suppose that your stats indicate that your join columns have a single, unique value. If so, the computed size of the join is the cross product of the two join input sizes. However, suppose that the cardinality of the join key in one of the two join inputs is actually unique. That means then that the actual size of the join result is the size of the larger join input. Had statistics been unavailable, the estimated size of the join would have still been wrong, but its computed value would have been smaller than the estimate in the case of out-dated stats.

Outer Joins

Outer joins impose certain restrictions in the overall join ordering of a query. This section describes those restrictions.

Null Generating Inputs

The first restriction relates to how the tables referenced within a null-generating input are optimized. The rule is that the tables within a null-generating input must be joined with one another before they can be joined with other tables in the query. Otherwise, the query will return incorrect results. Therefore, in the queries below, A and B, as well as C and D, must be joined with one another before they can be joined with any of the other tables in the query.

  select * from
     X left outer join (select * from A, B where A.a = B.b) Y
     on X.x = Y.y;

  select * from
     (select * from A, B where A.a = B.b) X right outer join Y
     on X.x = Y.y;

  select * from
     (select * from A, B where A.a = B.b) X full outer join
     (select * from C, D where C.c = D.d) Y
     on X.a = Y.d and X.b = Y.c;

In other words, for the last query, the join ordering on the left is valid, but the one on the right is invalid.

       Join                   Join
      /    \                 /    \
    Join  Join             Join  Join
    /  \  /  \             /  \  /  \
    B  A  D  C             A  D  B  C
      Valid                 Invalid

Outer Join Dependencies

The second restriction relates to outer join dependencies in the case of LEFT and RIGHT outer joins. For those joins, the null-generating input of the join cannot be joined with other tables if it hasn't yet been joined with the tables that are part of its outer join. Again, this is required to ensure correct query results.

For example, if you have a query like the following:

  select * from
     (select * from A right outer join B on A.a = B.b) A,
     (select * from C left outer join D on C.c = D.d) Y
     where X.a = Y.c;

then the join ordering on the left is valid, but the one on the right isn't, since A is dependent on B, and D is dependent on C. In the plan on the right, A is joined with C before it's joined with B.

        Left Join               Left Join
         /     \                 /     \
       Join     D           Right Join  D
      /    \                  /    \
Right Join  C               Join    B
   /   \                   /    \
  A     B                 A      C
        Valid                 Invalid

Swapping Inputs

It may be advantageous for a null-generating input in a LEFT OUTER JOIN to be the left input in the join, e.g., if it's the larger of the two join inputs. If so, then that join needs to be converted to a RIGHT OUTER JOIN. The reverse holds true for a RIGHT OUTER JOIN.

For example, the following two joins are equivalent:

      Left Outer Join           Right Outer Join
         /      \                  /      \
        A        B                B        A

Redundant Join Removal

Joins are sometimes redundant and therefore can be removed from the query plan. This section describes the three instances where LucidDB currently removes redundant joins.

Redundant Outer Joins

Outer joins can be removed from a query plan if the following conditions are true:

The outer join is a LEFT or RIGHT outer join.
The columns from the null-generating input in the outer join are only referenced in the outer join condition.
The join condition is an equi-join where each filter is of the form <left input column> = <right input column>.
The join keys corresponding to the null-generating input are unique.

Since the join keys for the null-generating input are unique and the outer join condition is an equi-join between simple column references, this guarantees that each row from that input appears once in the join result. Since the columns from that null-generating input are only used in the join condition, there's no need to execute the outer join. Hence, the query plan reduces to the non-null generating input.

It's unlikely that anyone would ever explicitly write a query like this. The more likely scenario where this occurs is when you have a view definition that contains an outer join between two tables. If your select from the view doesn't need to project any of the view columns that map to columns from the null-generating join input, and the conditions specified above are true, then the outer join can be avoided.

For example, if you have the following view definition, and column t2a in T2 is unique

  create view V as
     select T1.t1b, T2.t2b from
        T1 left outer join T2
        on T1.t1a = T2.t2a;

then the following select results in a simple row scan on T1.

  select t1b from v;

Joins That Become Redundant Because of Semijoins

Normally, after applying a semijoin on a fact table, LucidDB will then join the result of that fact table with the dimension table that was used to filter the fact table. However, there are cases where that join can be avoided. The following conditions must be true for that to occur:

The join keys from the dimension table must be unique.
The only columns referenced from the dimension table are its join keys.

Since the semijoin has already filtered from the fact table the rows that join with the dimension table, the dimension join keys are unique, and semijoins are only used for INNER equi-joins, the only other reason to join back with the dimension table would be to project additional columns from the dimension table in the join result. If those columns are the dimension join keys, then because of the INNER equi-join, those columns can be derived from the corresponding join keys from the fact table.

In other words, the following query:

  SELECT factTab.colA, dimTab.colB FROM
     factTab, dimTab
     WHERE factTab.colB = dimTab.colB;

is equivalent to:

  SELECT factTab.colA, factTab.colB FROM
     factTab, dimTab
     WHERE factTab.colB = dimTab.colB;

In the case where a join can be removed because of a semijoin, that also reduces the number of join orderings the optimizer needs to consider. For example, if the join between dimTab and factTab is redundant, then the optimizer doesn't need to consider any join orderings where dimTab precedes factTab. Once the optimizer comes across the join ordering where dimTab is joined with factTab, it can immediately remove dimTab from the ordering.

Self-Joins

If a query contains a join between the same table, i.e., a self join, the self-join can be removed if the following conditions are met:

The self-join is an inner join.
The input into the self-join is a simple table, i.e., it cannot be the result of a full outer join or a group by.
The join keys in the self-join are the same, and make up a unique key.

In removing the self-join, LucidDB will combine any filtering and projections that are applied on the inputs into the self-join into a single row scan that's applied on the common table.

Although a user may not explicitly write a query containing a self-join, they may issue a join between a table and a view that references that table. This could occur as a result of an update-only MERGE statement, which implicitly executes an inner join between the target and source of the MERGE. An example of such a MERGE statement is shown below.

CREATE TABLE target(ta INT PRIMARY KEY, tb INT, tc INT);
CREATE TABLE dim(da INT PRIMARY KEY, db INT);
CREATE VIEW source(sta, stb, stc, sda, sdb) AS
   SELECT *  FROM target, dim WHERE target.tb = dim.da;

MERGE INTO target
   USING (SELECT * FROM source WHERE sta < 10) AS source
   ON target.ta = source.sta
   WHEN MATCHED THEN UPDATE SET tc = source.sdb;

After unfolding the source view, the original implicit join that's executed as part of this MERGE statement is:

SELECT t1.ta, t1.tc, t2.*, dim.* FROM target t1, target t2, dim
   WHERE t2.tb = dim.da AND t1.ta = t2.ta AND t2.ta < 10;

After removing the self-join, the query becomes:

SELECT * FROM target, dim WHERE target.tb = dim.da AND target.ta < 10;

Limitations

The following are limitations in LucidDB optimization due to unimplemented functionality.

Statistics For UDX's and Foreign Tables

Earlier it was noted that table row counts are always available. This only applies to LucidDB column store tables. In general, LucidDB does not have access to any statistics for foreign (aka external) tables. However, it estimates the row count statistics for the tables that result from UDX calls as well as flat file foreign tables. Flat file foreign table row counts are estimated by sampling some number of rows from the flat file to compute the width of a typical row of data and then dividing that width into the number of bytes in the flat file. The accuracy of this computation can be influenced by tweaking the NUM_ROWS_SCAN parameter when defining your flat file foreign server. See LucidDbFlatFileWrapper#Flat_File_Foreign_Server for details.

Because of these limited statistics, this could result in non-optimal join orderings when foreign tables are referenced in complex join queries. If row counts are unavailable, LucidDB assumes that the size of a foreign table is a small number. Therefore, in an equi-join with a larger table, the foreign table becomes the build side in a hash join. If the foreign table actually returns a large number of rows with a large number of duplicates, this can lead to partitioning during the hash join, even though the other table being joined fits in memory. That, by itself, may not result in significant slowdown because it's just a single join. But if there are additional joins in the query, the optimizer will under-estimate the size of the join containing the foreign table, and likely assign the result of that join as the build side in the next hash join in the query plan. If this continues, then the resulting query plan is a right-deep join plan, which can result in poor query performance due to excessive hash partitioning. Even worse, if any of these intermediate hash join results contain data skew, i.e., a large number of duplicates in the join keys for subsequent joins, then the hash join will abort because it cannot partition the data.

      Hash Join
       /     \
      T1   Hash Join
            /     \
           T2   Hash Join
                 /     \
                T3     ...
                         \
                     Hash Join
                      /     \
                     Tn  Foreign Table

The workaround is to materialize the foreign table into a temporary table and then to reference the temporary table in the complex join query. This workaround is also necessary if your query plan results in a cartesian product join involving a foreign table. See Diagnosing Slow Queries Using Explain Plan#Special Inputs for more details.

The principles just described also apply in the case of UDX calls. The exception is that the rowcount for a UDX is estimated to be the sum of the rowcounts of its inputs. So, you may be able to avoid materializing the UDX call result for simple joins. However because column cardinality and data distributions are not available, that may not be the case for more complex joins.

Pull Up of Aggregates

LucidDB currently does not pull up aggregates if an aggregate is referenced in a sub-select that appears in the FROM clause of a query. By pulling up the aggregate, the optimizer has a wider range of join ordering choices. When the aggregate isn't pulled up, it means the optimizer must consider the optimal join ordering for the tables referenced within the sub-select before it can join that sub-select with other tables in the query.

For example, if your query is as follows:

  select * from
     X,
     (select Y.y1, Z.z1, sum(Y.y2), min(Z.z2) from Y, Z group by Y.y1, Z.z1)
     where X.x1 = Y.y1 and X.x2 = Z.z1;

then Y and Z must be joined together before that result (a cartesian join) can be joined with X.

Converting Outer Joins To Inner Joins

In some cases it is possible to convert outer joins into inner joins. For example, if the result of the outer join is filtered by IS NOT NULL predicates on all of the columns referenced from the null-generating table in the outer join, then the outer join reduces to an inner join, since the rows containing nulls are going to be filtered out. In fact, the predicate can be any predicate that filters out NULL values, e.g., an equality join predicate between the null-generating column and a column from some other table. This, however, assumes that expressions are not applied in the projection of the outer join query, converting the null value to a non-null value.

As with redundant outer joins, you probably wouldn't ever explicitly write a query like this. This likely would only occur as a result of unfolding views or converting subqueries into equivalent outer joins.

Removing Redundant Joins When Primary/Foreign Key Relationships Exist

Similar to what was described earlier in the case of outer joins and semijoins, it is possible to remove an inner join to a table with a primary key if the join is between that table and a table that it has a primary-foreign key relationship with. (Note that additional conditions not described here must also be true.) However, since LucidDB does not currently support declarative foreign key constraints, this optimization is not possible.

LucidDbJoinOptimization

Contents