begriffs open source - ai-pg/blob - full-docs/txt/geqo-pg-intro.txt

   1
   2 61.3. Genetic Query Optimization (GEQO) in PostgreSQL #
   3
   4    61.3.1. Generating Possible Plans with GEQO
   5    61.3.2. Future Implementation Tasks for PostgreSQL GEQO
   6
   7    The GEQO module approaches the query optimization problem as though it
   8    were the well-known traveling salesman problem (TSP). Possible query
   9    plans are encoded as integer strings. Each string represents the join
  10    order from one relation of the query to the next. For example, the join
  11    tree
  12    /\
  13   /\ 2
  14  /\ 3
  15 4  1
  16
  17    is encoded by the integer string '4-1-3-2', which means, first join
  18    relation '4' and '1', then '3', and then '2', where 1, 2, 3, 4 are
  19    relation IDs within the PostgreSQL optimizer.
  20
  21    Specific characteristics of the GEQO implementation in PostgreSQL are:
  22      * Usage of a steady state GA (replacement of the least fit
  23        individuals in a population, not whole-generational replacement)
  24        allows fast convergence towards improved query plans. This is
  25        essential for query handling with reasonable time;
  26      * Usage of edge recombination crossover which is especially suited to
  27        keep edge losses low for the solution of the TSP by means of a GA;
  28      * Mutation as genetic operator is deprecated so that no repair
  29        mechanisms are needed to generate legal TSP tours.
  30
  31    Parts of the GEQO module are adapted from D. Whitley's Genitor
  32    algorithm.
  33
  34    The GEQO module allows the PostgreSQL query optimizer to support large
  35    join queries effectively through non-exhaustive search.
  36
  37 61.3.1. Generating Possible Plans with GEQO #
  38
  39    The GEQO planning process uses the standard planner code to generate
  40    plans for scans of individual relations. Then join plans are developed
  41    using the genetic approach. As shown above, each candidate join plan is
  42    represented by a sequence in which to join the base relations. In the
  43    initial stage, the GEQO code simply generates some possible join
  44    sequences at random. For each join sequence considered, the standard
  45    planner code is invoked to estimate the cost of performing the query
  46    using that join sequence. (For each step of the join sequence, all
  47    three possible join strategies are considered; and all the
  48    initially-determined relation scan plans are available. The estimated
  49    cost is the cheapest of these possibilities.) Join sequences with lower
  50    estimated cost are considered “more fit” than those with higher cost.
  51    The genetic algorithm discards the least fit candidates. Then new
  52    candidates are generated by combining genes of more-fit candidates —
  53    that is, by using randomly-chosen portions of known low-cost join
  54    sequences to create new sequences for consideration. This process is
  55    repeated until a preset number of join sequences have been considered;
  56    then the best one found at any time during the search is used to
  57    generate the finished plan.
  58
  59    This process is inherently nondeterministic, because of the randomized
  60    choices made during both the initial population selection and
  61    subsequent “mutation” of the best candidates. To avoid surprising
  62    changes of the selected plan, each run of the GEQO algorithm restarts
  63    its random number generator with the current geqo_seed parameter
  64    setting. As long as geqo_seed and the other GEQO parameters are kept
  65    fixed, the same plan will be generated for a given query (and other
  66    planner inputs such as statistics). To experiment with different search
  67    paths, try changing geqo_seed.
  68
  69 61.3.2. Future Implementation Tasks for PostgreSQL GEQO #
  70
  71    Work is still needed to improve the genetic algorithm parameter
  72    settings. In file src/backend/optimizer/geqo/geqo_main.c, routines
  73    gimme_pool_size and gimme_number_generations, we have to find a
  74    compromise for the parameter settings to satisfy two competing demands:
  75      * Optimality of the query plan
  76      * Computing time
  77
  78    In the current implementation, the fitness of each candidate join
  79    sequence is estimated by running the standard planner's join selection
  80    and cost estimation code from scratch. To the extent that different
  81    candidates use similar sub-sequences of joins, a great deal of work
  82    will be repeated. This could be made significantly faster by retaining
  83    cost estimates for sub-joins. The problem is to avoid expending
  84    unreasonable amounts of memory on retaining that state.
  85
  86    At a more basic level, it is not clear that solving query optimization
  87    with a GA algorithm designed for TSP is appropriate. In the TSP case,
  88    the cost associated with any substring (partial tour) is independent of
  89    the rest of the tour, but this is certainly not true for query
  90    optimization. Thus it is questionable whether edge recombination
  91    crossover is the most effective mutation procedure.