RelationalExpressionMetadata

From LucidDB Wiki
Jump to: navigation, search

Update 29-Mar-2006: The proposal on this page has been implemented (eigenchange 6044) as follows:

  • new package org.eigenbase.rel.metadata containing facade, metadata provider chaining, reflective metadata infrastructure, metadata caching, and default metadata providers
  • unit test org.eigenbase.test.RelMetadataTest
  • Farrago plugin mechanisms for namespaces and session personalities so that they can supply custom metadata providers during query processing
  • Mock plugin net.sf.farrago.test.FarragoTestPersonalityFactory, which can be used to test the personality plugin mechanism
  • Changes to unitsql/ddl/explain.sql to alter session personality using FarragoTestPersonalityFactory and verify that costing results change via EXPLAIN PLAN INCLUDING ALL ATTRIBUTES


During query processing, the following precedence (from highest to lowest) is used when searching the provider chain for metadata queries:

  1. planner
  2. session personality plugin
  3. SQL/MED namespace plugins (arbitrary order)
  4. defaults


I left out the concept of "merging" metadata because it wasn't obvious where to put it or how; if we need it later we can come back around on this one. Also, the actual planner interaction sequencing described at the end came out slightly differently.


Prior to this change, the RelNode interface exposed only a very limited amount of metadata and cost information for use by the optimizer in making decisions about "black-box" expressions. The rest of this page describes the state of the system as it was before the enhancement.


Contents

Existing Metadata

The following methods return information which (in theory) should always be the same for a given database state regardless of the physical implementation of the relational expression. In practice, these may vary due to unknowns (e.g. no statistics available on a table).

  • getRows: an estimate of the number of rows produced by the expression
  • getRowType: the type descriptor for the result (there are never any unknowns here, so the planner implementation can assert that this is preserved by all transformations)
  • isDistinct: whether the expression truly produces a relation instead of a multiset (there should be no unknowns here, but we don't currently assert)

New Logical Metadata

The additions in this section are logical properties, meaning (again in theory) it should be possible to infer them unambiguously from the underlying schema alone, similar to getRowType. However, in practice, information may be lost (e.g. when a materialized view is recognized, or when rel authors are lazy), so we may not want to be as strict about consistency as we are for getRowType. Methods are specified below as if they will be added to the RelNode interface, but later on in this page we consider an alternative which leaves the RelNode interface untouched.

Column Origins

Besides being useful to the optimizer for looking up catalog information such as index availability, column origin information is useful to consumers who need to trace column-level dataflow through a query. For example, Farrago's JDBC driver can use it to implement the ResultSetMetaData.getTableName(int column) method, which is currently unsupported. (But it might be better to let the validator derive this information.)

/**
 * For the given output column of this expression, determines all columns of underlying tables which contribute to
 * result values. An ouptut column may have more than one origin due to expressions such as UnionRel and
 * ProjectRel. The optimizer may use this information for catalog access (e.g. index availability).
 *
 * @param iOutputColumn 0-based ordinal for output column of interest
 *
 * @return set of origin columns, or null if this information cannot be determined
 * (whereas empty set indicates definitely no origin columns at all)
 */
public Set<RelColumnOrigin> getColumnOrigins(int iOutputColumn);

 
class RelColumnOrigin
{
    /**
     * Underlying table.
     */
    RelOptTable originTable;
 
    /**
     * 0-based offset of origin column in originTable. Whether this ordinal is flattened or unflattened depends on whether
     * UDT flattening has already been performed on the relational expression.
     */
    int iOriginColumn;
 
    /**
     * false if value taken directly from table; true otherwise.
     */
    boolean isDerived;
}


For example, consider the following plan explanation:

'ProjectRel(UNAME=[UPPER($1), DEPTNO=[$0]])'
' TableAccessRel(table=[[LOCALDB, SALES, DEPTS]])'


Calling getColumnOrigins(0) on the top-level ProjectRel would return a single RelColumnOrigin with DEPTS for the originTable, iOriginColumn=1, and isDerived=true. Calling getColumnOrigins(1) would return iOriginColumn=0 and isDerived=false.

Unique Keys

/**
 * Determines the set of unique minimal keys for this expression. A key is represented as a BitSet, where
 * each bit position represents a 0-based output column ordinal. (Note that RelNode.isDistinct should return true
 * if and only if at least one key is known.)
 *
 * @return set of keys, or null if this information cannot be determined
 * (whereas empty set indicates definitely no keys at all)
 */
public Set<BitSet> getUniqueKeys();


Example:

'ProjectRel(NAME=[$0], DEPTNO=[$1])'
' AggregateRel(groupCount=[1], agg#0=[MAX(1)])'
'  ProjectRel($f0=[$1], $f1=[$0])'
'   TableAccessRel(table=[[LOCALDB, SALES, DEPTS]])'


In theory, calling getUniqueKeys on the AggregateRel would return two BitSets. One would have bit 0 set, and the other would have bit 1 set, because NAME (as a group key) is unique by itself after aggregation, and so is MAX(DEPTNO) since DEPTNO is the primary key of the DEPTS table. In practice, probably only the BitSet for NAME would be returned. (Note that SUM(DEPTNO) is not unique, even though it is syntactically similar to MAX(DEPTNO)).

Functional Dependencies

Eventually. Make Chris Date happy, keep views updatable, and drop unnecessary joins during optimization. SQL:2003 spells out the rules for some of these in Part 2 Section 4.18, also specifying the unique keys as one aspect of the functional dependency.


New Costing Metadata

Unlike logical metadata, costing information depends on all current database state, not just the schema, so it is in the same category as the existing RelNode.getRows() method. Different optimizers may have different ideas of what kinds of statistics are important and how they should be computed, so it is desirable for this part of the interface to be extensible.


Here's a table of some of the costing queries under consideration. These are taken from the Broadbase optimizer being used as a basis for LucidDB; they are defined as generic, although not all optimizers would have a use for them.


Name Inputs Result Description
getStatistics none RelStatSource Characterizes the distribution of values produced by this relational expression; see <a class="wiki_link" href="/TableStatistics">TableStatistics</a> for more information.
getSelectivity RexNode predicate Double Estimates the percentage of the output rows which satisfy the given predicate. Returns null to indicate that no reliable estimate can be produced; same is true for other Double-returning methods in this table.
getDistinctRowCount BitSet groupKey, RexNode predicate Double Estimates the number of rows which would be produced by a GROUP BY on the set of columns indicated by groupKey, where the input to the GROUP BY has been pre-filtered by predicate. This quantity (leaving out predicate) is often referred to as cardinality (as in gender being a "low-cardinality column").
getPopulationSize BitSet groupKey Double Estimates the distinct row count in the original source for the given groupKey, ignoring any filtering being applied by the expression. Typically, "original source" means base table, but for derived columns, the estimate may come from a non-leaf rel such as a ProjectRel.
getPercentageOriginalRows none Double Estimates the ratio of the number of rows this expression produces to the number of rows it would produce if all single-table filter conditions were removed. This definition needs to be tightened up; it's the most special-cased in Broadbase.


Planner Issues

Adding extra metadata requires planner implementations to participate in metadata propagation:

  • The planner must decide what to do about inconsistency. For example, after firing a rule which transforms a query expression into access to an equivalent materialized view, the new expression may return a non-null value for getStatistics whereas the old expression returned null. A planner like Volcano which maintains the old and new expressions in an equivalence class must then decide how to combine the two results so that when the metadata is requested from the RelNode representing the equivalence class, a definite answer can be given. In this example, it's easy: having statistics is always better than having no statistics. But what if the expressions disagree about rowcount estimation? Do we need to introduce a notion of "confidence"? That's very complicated, so by default we'll omit it, but we'll leave the door open for optimizers that want to define such notions.
  • Even a planner which only maintains a single choice for a particular expression (e.g. FarragoHeuristicPlanner) still has to deal with similar issues. For example, perhaps the original estimate was better before a transformation rule was applied, or a physical expression cannot easily implement the interface even though the original logical expression could. Should the planner attempt to combine before and after results across transformations to preserve information (and possibly make life easier for the authors of physical expressions)? If so, it should probably use the same merge rules as Volcano.

Implementation Framework

This section proposes a common framework for adding all of the metadata above (both logical and cost-related) in an extensible fashion. As a bonus, the framework makes it easy for planner implementations to cache metadata results as a means of speeding up optimization. At the framework's heart is a new very-much-weakly-typed interface:

package org.eigenbase.rel.metadata;
 
/**
 * RelMetadataProvider defines an interface for obtaining and combining metadata
 * about relational expressions.
 */
public interface RelMetadataProvider
{
    /**
     * Retrieves metadata about a relational expression.
     *
     * @param rel relational expression of interest
     *
     * @param metadataQueryName name of metadata query to invoke
     *
     * @param args arguments to metadata query (expected number and type depend on query name;
     * must have well-defined hashCode/equals for use by caching)
     *
     * @return metadata result (actual type depends on query name)
     */
    public Object getRelMetadata(RelNode rel,String metadataQueryName, Object [] args);
 
    /**
     * Combines two results from the same metadata query on different expressions.
     *
     * @param metadataQueryName name of query which produced md1 and md2
     *
     * @param md1 metadata result obtained via getRelMetadata
     *
     * @param md2 another metadata result compatible with md1
     *
     * @return result of combining md1 with md2
     */
    public Object mergeRelMetadata(String metadataQueryName, Object md1, Object md2);
}


On top of this, we will layer strongly-typed facades for the standard metadata queries defined within Eigenbase. As an example, consider the getSelectivity method specified earlier. It takes a RexNode as input and returns a Double as output. We can add a corresponding strongly-typed method to a new utility class:

package org.eigenbase.rel.metadata;
 
public abstract class RelMetadataQuery
{
...
    public static Double getSelectivity(RelNode rel,RexNode predicate)
    {
        return (Double) rel.getCluster().getMetadataProvider().getRelMetadata(rel, "getSelectivity", predicate);
    }
}


Now, how does that getRelMetadata method get implemented? Readers familiar with Java reflection have probably already guessed from the definition of RelOptMetadataProvider's methods:

package org.eigenbase.rel.metadata;
public class ReflectiveRelMetadataProvider implements RelMetadataProvider
{
    protected Object result;
 
    public Object getRelMetadata(RelNode rel,String metadataQueryName, Object [] args)
    {
        // reflection magic not shown here:  something like ReflectUtil.invokeVisitor,
        // but need to deal with argument class mappings and return values;
        // anyway, dispatch to a strongly typed method in subclass below
    }
}
 
package my.custom.plugin;
 
public class SomeCustomRelMetadataProvider extends ReflectiveRelMetadataProvider
{
    public Double getSelectivity(JoinRel rel, RexNode predicate)
    {
        // custom formula for join selectivity
    }
 
    public Double getSelectivity(AggregateRel rel, RexNode predicate)
    {
        // custom formula for aggregate selectivity
    }
 
    public Double getPopulationSize(JoinRel rel, BitSet groupKey)
    {
        // custom formula for join population size
    }
 

    public Double getPopulationSize(AggregateRel rel, BitSet groupKey)
    {
        // custom formula for aggregate population size
    }
 
    ... lots of other queries and rel types ...
}


Finally, interaction with the planner implementation works as follows:

  • RelOptPlanner will be changed to extend RelMetadataProvider.
  • When a reflective provider fails to find a matching method, it will call to the planner instead. This handles the case where a planner provides its own internal rel wrappers (e.g. VolcanoPlanner uses RelSubset to represent an equivalence class).
  • If the planner doesn't recognize the rel type as its own, it will return null.
  • Otherwise, the planner will forward the call on to rels underlying the wrapper rel. Before doing so, it may check its metadata cache and return immediately if it finds a match. If not, it may cache the result returned by the underlying rels (using mergeRelMetadata to handle the case of multiple underlying rels).
Product Documentation