begriffs open source - ai-pg/blob - full-docs/txt/wal-internals.txt

   1
   2 28.6. WAL Internals #
   3
   4    WAL is automatically enabled; no action is required from the
   5    administrator except ensuring that the disk-space requirements for the
   6    WAL files are met, and that any necessary tuning is done (see
   7    Section 28.5).
   8
   9    WAL records are appended to the WAL files as each new record is
  10    written. The insert position is described by a Log Sequence Number
  11    (LSN) that is a byte offset into the WAL, increasing monotonically with
  12    each new record. LSN values are returned as the datatype pg_lsn. Values
  13    can be compared to calculate the volume of WAL data that separates
  14    them, so they are used to measure the progress of replication and
  15    recovery.
  16
  17    WAL files are stored in the directory pg_wal under the data directory,
  18    as a set of segment files, normally each 16 MB in size (but the size
  19    can be changed by altering the --wal-segsize initdb option). Each
  20    segment is divided into pages, normally 8 kB each (this size can be
  21    changed via the --with-wal-blocksize configure option). The WAL record
  22    headers are described in access/xlogrecord.h; the record content is
  23    dependent on the type of event that is being logged. Segment files are
  24    given ever-increasing numbers as names, starting at
  25    000000010000000000000001. The numbers do not wrap, but it will take a
  26    very, very long time to exhaust the available stock of numbers.
  27
  28    It is advantageous if the WAL is located on a different disk from the
  29    main database files. This can be achieved by moving the pg_wal
  30    directory to another location (while the server is shut down, of
  31    course) and creating a symbolic link from the original location in the
  32    main data directory to the new location.
  33
  34    The aim of WAL is to ensure that the log is written before database
  35    records are altered, but this can be subverted by disk drives that
  36    falsely report a successful write to the kernel, when in fact they have
  37    only cached the data and not yet stored it on the disk. A power failure
  38    in such a situation might lead to irrecoverable data corruption.
  39    Administrators should try to ensure that disks holding PostgreSQL's WAL
  40    files do not make such false reports. (See Section 28.1.)
  41
  42    After a checkpoint has been made and the WAL flushed, the checkpoint's
  43    position is saved in the file pg_control. Therefore, at the start of
  44    recovery, the server first reads pg_control and then the checkpoint
  45    record; then it performs the REDO operation by scanning forward from
  46    the WAL location indicated in the checkpoint record. Because the entire
  47    content of data pages is saved in the WAL on the first page
  48    modification after a checkpoint (assuming full_page_writes is not
  49    disabled), all pages changed since the checkpoint will be restored to a
  50    consistent state.
  51
  52    To deal with the case where pg_control is corrupt, we should support
  53    the possibility of scanning existing WAL segments in reverse order —
  54    newest to oldest — in order to find the latest checkpoint. This has not
  55    been implemented yet. pg_control is small enough (less than one disk
  56    page) that it is not subject to partial-write problems, and as of this
  57    writing there have been no reports of database failures due solely to
  58    the inability to read pg_control itself. So while it is theoretically a
  59    weak spot, pg_control does not seem to be a problem in practice.