begriffs open source - ai-pg/blob - full-docs/txt/populate.txt

   1
   2 14.4. Populating a Database #
   3
   4    14.4.1. Disable Autocommit
   5    14.4.2. Use COPY
   6    14.4.3. Remove Indexes
   7    14.4.4. Remove Foreign Key Constraints
   8    14.4.5. Increase maintenance_work_mem
   9    14.4.6. Increase max_wal_size
  10    14.4.7. Disable WAL Archival and Streaming Replication
  11    14.4.8. Run ANALYZE Afterwards
  12    14.4.9. Some Notes about pg_dump
  13
  14    One might need to insert a large amount of data when first populating a
  15    database. This section contains some suggestions on how to make this
  16    process as efficient as possible.
  17
  18 14.4.1. Disable Autocommit #
  19
  20    When using multiple INSERTs, turn off autocommit and just do one commit
  21    at the end. (In plain SQL, this means issuing BEGIN at the start and
  22    COMMIT at the end. Some client libraries might do this behind your
  23    back, in which case you need to make sure the library does it when you
  24    want it done.) If you allow each insertion to be committed separately,
  25    PostgreSQL is doing a lot of work for each row that is added. An
  26    additional benefit of doing all insertions in one transaction is that
  27    if the insertion of one row were to fail then the insertion of all rows
  28    inserted up to that point would be rolled back, so you won't be stuck
  29    with partially loaded data.
  30
  31 14.4.2. Use COPY #
  32
  33    Use COPY to load all the rows in one command, instead of using a series
  34    of INSERT commands. The COPY command is optimized for loading large
  35    numbers of rows; it is less flexible than INSERT, but incurs
  36    significantly less overhead for large data loads. Since COPY is a
  37    single command, there is no need to disable autocommit if you use this
  38    method to populate a table.
  39
  40    If you cannot use COPY, it might help to use PREPARE to create a
  41    prepared INSERT statement, and then use EXECUTE as many times as
  42    required. This avoids some of the overhead of repeatedly parsing and
  43    planning INSERT. Different interfaces provide this facility in
  44    different ways; look for “prepared statements” in the interface
  45    documentation.
  46
  47    Note that loading a large number of rows using COPY is almost always
  48    faster than using INSERT, even if PREPARE is used and multiple
  49    insertions are batched into a single transaction.
  50
  51    COPY is fastest when used within the same transaction as an earlier
  52    CREATE TABLE or TRUNCATE command. In such cases no WAL needs to be
  53    written, because in case of an error, the files containing the newly
  54    loaded data will be removed anyway. However, this consideration only
  55    applies when wal_level is minimal as all commands must write WAL
  56    otherwise.
  57
  58 14.4.3. Remove Indexes #
  59
  60    If you are loading a freshly created table, the fastest method is to
  61    create the table, bulk load the table's data using COPY, then create
  62    any indexes needed for the table. Creating an index on pre-existing
  63    data is quicker than updating it incrementally as each row is loaded.
  64
  65    If you are adding large amounts of data to an existing table, it might
  66    be a win to drop the indexes, load the table, and then recreate the
  67    indexes. Of course, the database performance for other users might
  68    suffer during the time the indexes are missing. One should also think
  69    twice before dropping a unique index, since the error checking afforded
  70    by the unique constraint will be lost while the index is missing.
  71
  72 14.4.4. Remove Foreign Key Constraints #
  73
  74    Just as with indexes, a foreign key constraint can be checked “in bulk”
  75    more efficiently than row-by-row. So it might be useful to drop foreign
  76    key constraints, load data, and re-create the constraints. Again, there
  77    is a trade-off between data load speed and loss of error checking while
  78    the constraint is missing.
  79
  80    What's more, when you load data into a table with existing foreign key
  81    constraints, each new row requires an entry in the server's list of
  82    pending trigger events (since it is the firing of a trigger that checks
  83    the row's foreign key constraint). Loading many millions of rows can
  84    cause the trigger event queue to overflow available memory, leading to
  85    intolerable swapping or even outright failure of the command. Therefore
  86    it may be necessary, not just desirable, to drop and re-apply foreign
  87    keys when loading large amounts of data. If temporarily removing the
  88    constraint isn't acceptable, the only other recourse may be to split up
  89    the load operation into smaller transactions.
  90
  91 14.4.5. Increase maintenance_work_mem #
  92
  93    Temporarily increasing the maintenance_work_mem configuration variable
  94    when loading large amounts of data can lead to improved performance.
  95    This will help to speed up CREATE INDEX commands and ALTER TABLE ADD
  96    FOREIGN KEY commands. It won't do much for COPY itself, so this advice
  97    is only useful when you are using one or both of the above techniques.
  98
  99 14.4.6. Increase max_wal_size #
 100
 101    Temporarily increasing the max_wal_size configuration variable can also
 102    make large data loads faster. This is because loading a large amount of
 103    data into PostgreSQL will cause checkpoints to occur more often than
 104    the normal checkpoint frequency (specified by the checkpoint_timeout
 105    configuration variable). Whenever a checkpoint occurs, all dirty pages
 106    must be flushed to disk. By increasing max_wal_size temporarily during
 107    bulk data loads, the number of checkpoints that are required can be
 108    reduced.
 109
 110 14.4.7. Disable WAL Archival and Streaming Replication #
 111
 112    When loading large amounts of data into an installation that uses WAL
 113    archiving or streaming replication, it might be faster to take a new
 114    base backup after the load has completed than to process a large amount
 115    of incremental WAL data. To prevent incremental WAL logging while
 116    loading, disable archiving and streaming replication, by setting
 117    wal_level to minimal, archive_mode to off, and max_wal_senders to zero.
 118    But note that changing these settings requires a server restart, and
 119    makes any base backups taken before unavailable for archive recovery
 120    and standby server, which may lead to data loss.
 121
 122    Aside from avoiding the time for the archiver or WAL sender to process
 123    the WAL data, doing this will actually make certain commands faster,
 124    because they do not to write WAL at all if wal_level is minimal and the
 125    current subtransaction (or top-level transaction) created or truncated
 126    the table or index they change. (They can guarantee crash safety more
 127    cheaply by doing an fsync at the end than by writing WAL.)
 128
 129 14.4.8. Run ANALYZE Afterwards #
 130
 131    Whenever you have significantly altered the distribution of data within
 132    a table, running ANALYZE is strongly recommended. This includes bulk
 133    loading large amounts of data into the table. Running ANALYZE (or
 134    VACUUM ANALYZE) ensures that the planner has up-to-date statistics
 135    about the table. With no statistics or obsolete statistics, the planner
 136    might make poor decisions during query planning, leading to poor
 137    performance on any tables with inaccurate or nonexistent statistics.
 138    Note that if the autovacuum daemon is enabled, it might run ANALYZE
 139    automatically; see Section 24.1.3 and Section 24.1.6 for more
 140    information.
 141
 142 14.4.9. Some Notes about pg_dump #
 143
 144    Dump scripts generated by pg_dump automatically apply several, but not
 145    all, of the above guidelines. To restore a pg_dump dump as quickly as
 146    possible, you need to do a few extra things manually. (Note that these
 147    points apply while restoring a dump, not while creating it. The same
 148    points apply whether loading a text dump with psql or using pg_restore
 149    to load from a pg_dump archive file.)
 150
 151    By default, pg_dump uses COPY, and when it is generating a complete
 152    schema-and-data dump, it is careful to load data before creating
 153    indexes and foreign keys. So in this case several guidelines are
 154    handled automatically. What is left for you to do is to:
 155      * Set appropriate (i.e., larger than normal) values for
 156        maintenance_work_mem and max_wal_size.
 157      * If using WAL archiving or streaming replication, consider disabling
 158        them during the restore. To do that, set archive_mode to off,
 159        wal_level to minimal, and max_wal_senders to zero before loading
 160        the dump. Afterwards, set them back to the right values and take a
 161        fresh base backup.
 162      * Experiment with the parallel dump and restore modes of both pg_dump
 163        and pg_restore and find the optimal number of concurrent jobs to
 164        use. Dumping and restoring in parallel by means of the -j option
 165        should give you a significantly higher performance over the serial
 166        mode.
 167      * Consider whether the whole dump should be restored as a single
 168        transaction. To do that, pass the -1 or --single-transaction
 169        command-line option to psql or pg_restore. When using this mode,
 170        even the smallest of errors will rollback the entire restore,
 171        possibly discarding many hours of processing. Depending on how
 172        interrelated the data is, that might seem preferable to manual
 173        cleanup, or not. COPY commands will run fastest if you use a single
 174        transaction and have WAL archiving turned off.
 175      * If multiple CPUs are available in the database server, consider
 176        using pg_restore's --jobs option. This allows concurrent data
 177        loading and index creation.
 178      * Run ANALYZE afterwards.
 179
 180    A data-only dump will still use COPY, but it does not drop or recreate
 181    indexes, and it does not normally touch foreign keys. ^[14] So when
 182    loading a data-only dump, it is up to you to drop and recreate indexes
 183    and foreign keys if you wish to use those techniques. It's still useful
 184    to increase max_wal_size while loading the data, but don't bother
 185    increasing maintenance_work_mem; rather, you'd do that while manually
 186    recreating indexes and foreign keys afterwards. And don't forget to
 187    ANALYZE when you're done; see Section 24.1.3 and Section 24.1.6 for
 188    more information.
 189
 190    ^[14] You can get the effect of disabling foreign keys by using the
 191    --disable-triggers option — but realize that that eliminates, rather
 192    than just postpones, foreign key validation, and so it is possible to
 193    insert bad data if you use it.