FennelResourceAllocation

From LucidDB Wiki
Jump to: navigation, search

This page describes some proposals for extending the Fennel resource allocation infrastructure. TBD: a design doc for what's there already; for now, the closest thing is the "Advanced Resource Allocation" section in ExecStreamHowTo.


Contents

Scratch Page Sub-allocation

Motivation

There may be lots of smallish buffers allocated by various ExecStreams within a plan. By itself, no one of these buffers is anywhere near big enough to make it worth dedicating an entire cache page to it, but aggregated over the whole graph they add up to a number of pages. Examples are tuple buffers, bitsets, lookup tables, etc. In a system with a large number of concurrent queries, getting this accounting right could be critical to avoid accidentally using virtual memory paging to manage the unaccounted overhead. (That's what would happen currently as a result of using the C++ heap.) Note that I'm not proposing arbitrary malloc/free heap allocation here; the ExecStream would allocate the allotted amount once after open, hold on to it for the duration of execution, and then stop referencing it at or before close (at which point it would be implicitly deallocated). Non-array data-structures such as maps, trees, TupleAccessors, etc. would continue to use normal dynamic memory, as would arrays which need to be dynamically resized over the course of execution. (The boost pool library could be used if we wanted something more dynamic, but unfortunately STL containers don't play well with non-static custom allocators.)

Proposed Changes

  1. Add a new data member nBytesOverhead to struct ExecStreamResourceQuantity. This would represent the extra amount of byte-level memory required by a particular consumer. From the point of view of a consumer, this would be accounted for separately from the existing nCachePages field. So, for example, a hypothetical FtpExecStream might need a full cache page for buffering bulk reads from a socket and a smaller 256-byte buffer for constructing commands to be written to the socket. It would set nCachePages to 1 and nBytesOverhead to 256 for both minQuantity and optQuantity.
  2. Define a new template class ExecStreamScopedArray<T>. This would work something like boost::scoped_array<T>, except after construction there would be a setMemoryProvider call to associate it with an ExecStreamGraph, and instead of a reset(T *) method, it would expose an allocate(n) method which would allocate n contiguous instances of T. The actual byte allocation would be delegated to the graph, which would suballocate the memory from cache pages supplied by its underlying ScratchSegment. In addition, the graph would keep a backpointer to the scoped array, and when the graph was closed, it would automatically reset the scoped array to NULL (this prevents accidental stray writes by converting them into segfaults). FtpExecStream would have one instance of ExecStreamScopedArray<char> as a data member, e.g. cmdBuffer. After allocation, it would call cmdBuffer.get() to access the array.
  3. Enhance ExecStreamGraph to implement the suballocation, keep track of the consumers, and deliver notifications on close. This should be straightforward.
  4. Eventually: use all of this in real scheduler resource allocation. Current resource allocation logic is minimal: streams always get the minimum page quantity they asked for, and these page quotas are enforced per stream, but no one checks the current global cache state or sets a quota in the scratch segment at the graph level. Suppose instead we wanted a scheduler which still stingily provides only the minimum, but prevents graph execution if even that amount is unavailable in the cache. As a contrived example, consider a big ExecStreamGraph involving 20 instances of FtpExecStream and nothing else. Assume a 4K page size. The scheduler would sum 20 cache pages from nCachePages and 20*256=5120 nBytesOverhead total. So the total number of cache pages required would be 22 (rounding up the overhead bytes to two pages for 8192 bytes). If only 21 cache pages were free, the scheduler would prevent the query from executing (either failing immediately or waiting until enough resources became available).

Issues

  • As defined above, the allocation interfaces are only usable from the exec layer of Fennel and other components which depend on exec. This might make it difficult to write low-level components which could be reused for purposes other than stream implementations. For example, btree does not depend on exec, and has to do some memory allocation for tuple buffers. To address this, we could define abstractions at the common layer instead, e.g. PoolMemoryProvider and PoolScopedArray. ExecStreamGraph would implement the PoolMemoryProvider interface, but other implementations would be possible.
  • 1156188121: Zelaine pointed out that the scheme above can't really work unless we also do the fine-grained budgeting described below. The reason is fragmentation: unless we know the exact breakdown of individual array allocations, we can't reliably compute the fragmentation overhead. But with that information, we can do an optimal fit to minimize fragmentation.

Fine-grained Budgeting

Motivation

Our current resource allocation model is analogous to a company in which each ExecStream is a department, and the CFO (ExecStreamScheduler) only sees budgets at the department level; nothing below. It's up to each stream to aggregate the requirements of any of its component objects and report the total up to the scheduler. Then, when it receives its actual budget back, it has to figure out how to parcel that out to its components. This is fine for simple streams, but painful for complex ones built on a large number of sub-components. Broadbase supported a more sophisticated approach (designed by Julian, natch) in which there was an IResourceConsumer interface which the equivalent of streams and their sub-components all implemented. The equivalent of the scheduler would query the graph for a full tree of consumers, and then do fine-grained allocation, so when execution started each component already knew exactly what had been allocated to it. There was also support for a notion of subsets of a graph being "exclusive", meaning they would never be executed simultaneously, so (assuming eager release of resources), the scheduler could take the max of their sizes rather than the sum. With the recent introduction of BarrierExecStream, we now have a similar notion (and other potential ones such as MergeExecStream running in a non-parallel scheduler).

Proposed Changes

TBD.

Issues

  • Similar to the one for sub-allocation: at which layer should all of this be defined?
Product Documentation