Gsoc2009IdeasList
This page lists ideas for students applying to work on Eigenbase for the Google Summer of Code 2009. If you are a student interested in taking on one of these or coming up with a new one, please contact John Sichi (the Eigenbase GSoC administrator) for help with starting a discussion on the right mailing list.
--Jvs 17:19, 14 April 2009 (EDT): Our 2009 application was rejected, but the ideas on this page are still good starting points for anyone who wants to hack on Eigenbase, or for a future GSoC.
Contents |
Freebase MQL Plugin Enhancement
Problem: FarragoMedMqlPlugin demonstrates how Eigenbase can be used to layer SQL support onto the Freebase web service, but it is currently just a proof of concept and not suitable for real use cases.
Objectives:
- Map relational operations such as joins and aggregations into the corresponding MQL structures, then develop optimizer pushdown rules for realizing these mappings.
- Rewrite the plugin to be industrial-strength, e.g. using a real JSON library instead of crufty/brittle string parsing.
Requirements:
- Familiarity with Java
- Good grasp of relational algebra
- Familiarity with ontology a plus
Related Ideas:
- Develop similar plugins for other webservice-based query languages such as those for Yahoo (YQL) and Facebook (FQL).
Non-Java User Defined Routines
Problem: Currently, SQL-invocable external routines can only be written in Java. This includes user-defined procedures, functions, and table transformations; see LucidDbUdxJavaHowto for an example of how these routines are currently developed in Java.
Objective: Make it possible to use other popular languages such as Python to easily develop and deploy these routines into any Eigenbase-derived server such as LucidDB.
Implementation Notes: Some of the pointers in LucidDbNonJavaClients may be helpful as a starting point.
Requirements:
- Familiarity with Java
- Familiarity with other languages to be integrated
LucidDB Optimizer/Executor Reality Check
Problem: Currently, the LucidDB optimizer is open loop; it bases a lot of decisions on estimates for selectivity, row counts, and distinct value counts, but these may be far off from reality in some cases. The only time cross-checking with reality happens is when someone is debugging the optimizer!
Objective: Develop a framework for correlating optimizer and executor stats. At a minimum, this could be exposed as a set of utilities for use by anyone developing or debugging the optimizer. More advanced would be an automatic historical feedback mechanism.
Plan sketch:
- Instrument the optimizer to produce row count estimates (this already exists, but needs some refinement to get it out in a usable form which can be correlated with the executor's view of the world)
- Instrument the executor to produce actual number of rows processed per execution stream, corresponding to optimizer plan nodes (see also FennelExecStreamGraphProfiling)
- Apply this to TPC-H queries, and find the topmost differences (errors in estimation)
- For the result of (3), backtrace through the stats and cost functions available to the optimizer, and use that to drive optimizer improvements (via better stats and/or cost functions and/or feedback into the optimizer from historical run information using the instrumentation)
Requirements:
- Familiarity with C++ and Java
- Good grasp of relational algebra
- Familiarity with DBMS implementation concepts
Packaged Builds for Specific Distros
Problem: Currently, Eigenbase and LucidDB binary releases are just tarballs; as a result, it's a matter of chance whether the binary releases work on a particular version of a given Linux distribution due to dependencies on C++ and Java runtime libraries.
Objective: Develop the process for producing .debs and .rpms (plus possibly others) for binary releases, then automate it so that release managers can execute it with minimal fuss. This would also involve studying the relevant distro guidelines for where to install scripts, libraries, man pages, etc; optionally, it would also involve developing init.d daemon scripts for installing services (something which is currently missing entirely).
Implementation notes: We are currently in the process of moving the C++ build from autotools to cmake. cmake has a companion utility called cpack which is supposed to be able to take care of a lot of the distro-specific details. This work would also serve as a basis for future ports such as MacOS and Solaris.
Requirements:
- Shell scripting skills
- Familiarity with Linux distributions