BeanCounting17Feb2006

From LucidDB Wiki
Jump to: navigation, search

Contents

Attendees

jhyde, jpham, jvs, mberkowitz, qtran, rchen, zfong

Summary

In response to some mailing list traffic, attendees discussed two related topics:

  • Job control at the system level (we avoid the term "global scheduling" since that has another meaning in distributed systems)
  • Fine-grained resource allocation at the ExecStreamGraph level (including some history on how the equivalent system worked in Broadbase)

The main purpose of the meeting was to familiarize everyone with the major challenges and to propose approaches so that we can start thinking about designing common interfaces at the Eigenbase level. Job control is forward-looking for everyone involved, so there was no specific action item on that topic other than to keep ruminating, but at the ExecStream-level, JVS will propose some interface enhancements in the near-term.

Job Control

This is the part of the system that is supposed to take an incoming workload of operations to be executed, possibly together with quality-of-service (QoS) constraints and affinity information, and make decisions such as:

  • Whether to allow execution at all. Some job controllers may reject operations based on resource limitations. (Question: is it reasonable to equate one operation with one SQL statement? Or is finer granularity such as individual cursor fetches required?).
  • When to begin execution. Job controllers may queue operations.
  • How much to allocate in terms of coarse-grained resources. Examples include cache memory and parallelism (degree in an SMP environment, plus node ID's in a distributed system). Distributed systems may account for network bandwidth.
  • Whether and when to kill jobs. Controllers may choose to do this for good reason, e.g. governance (the system detects that a query has gone wild), administration (a user requests that an operation be killed), QoS (a higher-priority operation has been requested and resources in use by a low-priority operation must be stolen to satisfy a service-level agreement), or replanning (some part of the system has realized that a better plan exists for a long-running operation).
  • Whether and when to attempt to dynamically adjust resource allocations for existing jobs.

No such critter exists yet in Eigenbase (statements are always executed immediately with minimum resources regardless of what else is already executing). And it's unlikely that there will be a one-size-fits-all implementation to satisfy everyone. So we will take our usual plugin approach:

  • Define the JobController interface in such a way that it is possible for multiple implementations to be written without higher-level components knowing which one is in use. Here "higher-level" includes important infrastructure such as tracing and workload history capture.
  • Provide a (possibly simplistic) reference implementation as a green-zone component.
  • Refine lower-level interfaces such as ExecStreamScheduler so that they can interact with a JobController, without either component being required to know anything about the other's implementation details. (Of course, covariant implementations which know about each other are allowed too.)

Reasoning About Resources and Workloads over Time

Julian drew a representation of resource allocation over a slice of time for a given workload; for some reason it came out looking like artwork by Piet Mondrian: mond.jpg One dimension (say width) represents time, and the other some resource quantity (e.g. memory) of which there is a fixed global supply (height). Individual rectangles represent allotments of resources to specific operations. In the image above, the system is running at full utilization (one can always dream).

Depending on the operation, it may be possible to resize the rectangle by adjusting the resource allocation. For example, if a sorter gets enough memory to fit the entire input in memory, it can run very quickly; if it gets only a tiny amount of memory, it can still finish the job, but it will take a long time. There are typically limits along either dimension (giving the sorter more memory than the input size doesn't provide any additional benefit, and below a certain minimum number of buffers, it can't run at all). Julian pointed out that the area of a rectangle always has a lower bound analogous to Planck's constant (physical processes can take take place over a long time with very little energy expenditure, or very quickly with a lot of energy, but the product of time and energy has a lower bound).

In reality, the job controller has to face a bit of adversity in its quest to achieve high utilization while meeting QoS constraints:

  • Ignorance of the future. The job controller has to base its decisions on current queue state and workload/utilization history (if available).
  • Imperfect knowledge. In a distributed utility-computing system, the node running the controller may receive only coarse state updates from worker nodes, and even that comes with lag. Even in a single-node system, measures such as completion progress of a statement's execution are approximate at best.
  • Algorithmic complexity. Even imagining a system with perfect lookahead and knowledge of global state, there are lots of known NP-hard problems in this area.

We discussed some approaches for dealing with these:

  • Add infrastructure for capturing, analyzing, and simulating histories as a means of predicting the future.
  • For jobs without low-latency constraints, always use queueing to give the controller a better chance of finding an optimal allocation.
  • Add interfaces for querying time estimates for a given job: answers might come from the optimizer, the executable plan, historical pattern-matching (e.g. the same ETL job runs every night), or external systems (e.g. a workflow engine). Actual progress updates from the executing statement are important too.
  • Leverage fault-tolerance. If the system already has to deal with node failures gracefully, then the controller can be somewhat reckless. For example, it might attempt to run a statement even when it can't guarantee its declared minimum of resources (it's easier to ask forgiveness than permission). And as mentioned earlier, the ability to explicitly kill a job (which is equivalent to a fault from the point of view of the job itself) gives the job controller the option to take action if it starts to "regret" its existing allocations.
  • Dynamic renegotiation. For some ExecStreams, it is possible to adjust their resource consumption during execution (either immediately on request, or after some delay). This allows for trapezoids and other interesting shapes instead of just rectangles in the diagram.

ExecStreamGraph

The embodiment of a portion of an executable job on a particular node is an ExecStreamGraph, and much of the information needed by the job controller has to come from this component:

  • Allocation requirements
  • Time estimates
  • Progress estimates

Limitations of the current interface are discussed in the second half of FennelResourceAllocation. In addition to the enhancements mentioned there and above, the following proposals were raised at the meeting:

  • Instead of characterizing the allocation-to-efficiency curve via just two breakpoints ("minimal" and "optimal"), come up with something having more expressive power (e.g. for a sorter, there is a discontinuity at the transition between in-memory sort and external sort, and between different depths of external sort).
  • Allow for expression of overlapping resource curves across mutiple ExecStreams (e.g. a hash-join releasing its memory as a consuming sorter starts increasing in tandem).
  • Turn overdraft into a recoverable exception (replacing the current assertion-failure behavior). Recoverable here means the consumer would call its scheduler to ask what to do (e.g. suspend, retry, fail), and in the failure case, would throw a well-defined exception so that the "reckless" job controller would know it was worth retrying (as opposed to a user error).
  • Infer resource consumption optimizations from specific graph topologies and vertices (e.g. resources consumed upstream of a BarrierExecStream acting as a bottleneck are reusable downstream of the barrier once it has returned EOS).

It is worth noting that resource allocation among the vertices of an ExecStreamGraph is similar to the job controller's problem of allocation across jobs, except with the additional constraint of dataflow dependencies.

Product Documentation