FarragoUDAFDesign

From LucidDB Wiki
Jump to: navigation, search

This page is an ongoing reference for the design, implementation, and usage of user defined aggregate functions.

Note: Nothing here is currently solidified, including the structure of the page.

Contents

Farrago User Defined Aggregate Functions

In progress...

What is an aggregate function?

An Aggregation Function (hereafter referred to as Aggregator) is not so much a function as it is a holder for one or more normal functions or a Java object implementing certain functions. On a SELECT query, specified columns are passed to the Aggregator row by row, with the Aggregator usually accumulating the data for a final value.

Parts of an Aggregator

  • Unique identifier based on NAME and INPUT TYPES
  • Internal State of type T
  • Transition Function, aka Receiver - Acts on given rows of data
  • (Optional) Final Function, aka Finalizer - Takes a given state and returns anything, determining the return type of the Aggregator if not T
  • (Optional) Initial Value - Defaults to null
  • (Optional-Java only, Receiver is equivalent for SQL) Merging Function

Implementation Requirements

  • Code can be simplified by bucketing aggregators into a hashmap with keys uniquely identified by the GROUP BY arguments. (But we may lose memory efficiency; consider this optional)
  • SQL-based Aggregate functions can respect the call-on-null/is-deterministic/etc. options of the SQL functions, while a Java-based Aggregate should let these be set in the registration. (Though is-deterministic could be determined through reflection and detection of final types?)
  • Merging Functions (Java side)
  • Should allow naming of an Aggregation class or a static method that returns an Aggregation object when using Java Aggregators.
  • Windows Aggregation
  • Client-side SQL registrations for CREATE, DROP, ALTER.
  • GROUP BY behavior
  • HAVING behavior (if the standard WHERE is incapable)
  • Should be able to bail out early (possibly throwing an exception that will be caught, stopping the aggregation, and then calling the final method).

Up-in-the-air considerations

Should we enforce a consistent behavior for SQL-based aggregates vs. Java-based aggregates? For example, in SQL, following PostgreSQL's example, the receiver's first argument should always be a reference to the internal state, with the following arguments matching those defined for the aggregate function, and it should always return a new state. Secondly, the Finalizer is optional in SQL-Land since we always have access to the state to return if the Finalizer is not specified.

On the Java side however we can gain considerable performance by making use of destructive updates of class-wide variables. Hive's implementation uses a boolean receive function (called iterate) that always returns true and only passes the row information, I think we should simply make it void if we do this. Also it could be said that a Merge function is the true Receiver, since it follows the SQL version more closely. And secondly, the Finalizer in Java-Land is really just the getResult() function we will need to get the internal state, thus it should be required.

Alternatively, should we even make the effort to support both SQL-based and Java-based Aggregators?

Java Interfaces

Should there be any for the client side? It seems like having things such as "supportsFinalizer" or "supportsMerging" are unnecessary given we can use reflection on classes to figure that stuff out.


UDAF Registration Syntax

SQL:

CREATE AGGREGATE FUNCTION qualified-function-name ( [ function-param-def, ... ] )
[ LANGUAGE SQL ]
[ FINALIZER qualified-function-name ]
[ INITVAL initial_value ]
RECEIVER qualified-function-name
STYPE data-type

function-param-def ::= param-name data-type
initial_value ::= character-value-expression

Return type is either STYPE if the FINALIZER is not specified, or the return type of the FINALIZER. INITVAL defaults to null.

Java:


UDAF Creation Examples

SQL:


Java:


UDAF Execution Examples

External Resources

Product Documentation