2 66.6. Database Page Layout #
4 66.6.1. Table Row Layout
6 This section provides an overview of the page format used within
7 PostgreSQL tables and indexes.^[19] Sequences and TOAST tables are
8 formatted just like a regular table.
10 In the following explanation, a byte is assumed to contain 8 bits. In
11 addition, the term item refers to an individual data value that is
12 stored on a page. In a table, an item is a row; in an index, an item is
15 Every table and index is stored as an array of pages of a fixed size
16 (usually 8 kB, although a different page size can be selected when
17 compiling the server). In a table, all the pages are logically
18 equivalent, so a particular item (row) can be stored in any page. In
19 indexes, the first page is generally reserved as a metapage holding
20 control information, and there can be different types of pages within
21 the index, depending on the index access method.
23 Table 66.2 shows the overall layout of a page. There are five parts to
26 Table 66.2. Overall Page Layout
28 PageHeaderData 24 bytes long. Contains general information about the
29 page, including free space pointers.
30 ItemIdData Array of item identifiers pointing to the actual items. Each
31 entry is an (offset,length) pair. 4 bytes per item.
32 Free space The unallocated space. New item identifiers are allocated
33 from the start of this area, new items from the end.
34 Items The actual items themselves.
35 Special space Index access method specific data. Different methods
36 store different data. Empty in ordinary tables.
38 The first 24 bytes of each page consists of a page header
39 (PageHeaderData). Its format is detailed in Table 66.3. The first field
40 tracks the most recent WAL entry related to this page. The second field
41 contains the page checksum if -k are enabled. Next is a 2-byte field
42 containing flag bits. This is followed by three 2-byte integer fields
43 (pd_lower, pd_upper, and pd_special). These contain byte offsets from
44 the page start to the start of unallocated space, to the end of
45 unallocated space, and to the start of the special space. The next 2
46 bytes of the page header, pd_pagesize_version, store both the page size
47 and a version indicator. Beginning with PostgreSQL 8.3 the version
48 number is 4; PostgreSQL 8.1 and 8.2 used version number 3; PostgreSQL
49 8.0 used version number 2; PostgreSQL 7.3 and 7.4 used version number
50 1; prior releases used version number 0. (The basic page layout and
51 header format has not changed in most of these versions, but the layout
52 of heap row headers has.) The page size is basically only present as a
53 cross-check; there is no support for having more than one page size in
54 an installation. The last field is a hint that shows whether pruning
55 the page is likely to be profitable: it tracks the oldest un-pruned
58 Table 66.3. PageHeaderData Layout
59 Field Type Length Description
60 pd_lsn PageXLogRecPtr 8 bytes LSN: next byte after last byte of WAL
61 record for last change to this page
62 pd_checksum uint16 2 bytes Page checksum
63 pd_flags uint16 2 bytes Flag bits
64 pd_lower LocationIndex 2 bytes Offset to start of free space
65 pd_upper LocationIndex 2 bytes Offset to end of free space
66 pd_special LocationIndex 2 bytes Offset to start of special space
67 pd_pagesize_version uint16 2 bytes Page size and layout version number
69 pd_prune_xid TransactionId 4 bytes Oldest unpruned XMAX on page, or
72 All the details can be found in src/include/storage/bufpage.h.
74 Following the page header are item identifiers (ItemIdData), each
75 requiring four bytes. An item identifier contains a byte-offset to the
76 start of an item, its length in bytes, and a few attribute bits which
77 affect its interpretation. New item identifiers are allocated as needed
78 from the beginning of the unallocated space. The number of item
79 identifiers present can be determined by looking at pd_lower, which is
80 increased to allocate a new identifier. Because an item identifier is
81 never moved until it is freed, its index can be used on a long-term
82 basis to reference an item, even when the item itself is moved around
83 on the page to compact free space. In fact, every pointer to an item
84 (ItemPointer, also known as CTID) created by PostgreSQL consists of a
85 page number and the index of an item identifier.
87 The items themselves are stored in space allocated backwards from the
88 end of unallocated space. The exact structure varies depending on what
89 the table is to contain. Tables and sequences both use a structure
90 named HeapTupleHeaderData, described below.
92 The final section is the “special section” which can contain anything
93 the access method wishes to store. For example, b-tree indexes store
94 links to the page's left and right siblings, as well as some other data
95 relevant to the index structure. Ordinary tables do not use a special
96 section at all (indicated by setting pd_special to equal the page
99 Figure 66.1 illustrates how these parts are laid out in a page.
101 Figure 66.1. Page Layout
103 66.6.1. Table Row Layout #
105 All table rows are structured in the same way. There is a fixed-size
106 header (occupying 23 bytes on most machines), followed by an optional
107 null bitmap, an optional object ID field, and the user data. The header
108 is detailed in Table 66.4. The actual user data (columns of the row)
109 begins at the offset indicated by t_hoff, which must always be a
110 multiple of the MAXALIGN distance for the platform. The null bitmap is
111 only present if the HEAP_HASNULL bit is set in t_infomask. If it is
112 present it begins just after the fixed header and occupies enough bytes
113 to have one bit per data column (that is, the number of bits that
114 equals the attribute count in t_infomask2). In this list of bits, a 1
115 bit indicates not-null, a 0 bit is a null. When the bitmap is not
116 present, all columns are assumed not-null. The object ID is only
117 present if the HEAP_HASOID_OLD bit is set in t_infomask. If present, it
118 appears just before the t_hoff boundary. Any padding needed to make
119 t_hoff a MAXALIGN multiple will appear between the null bitmap and the
120 object ID. (This in turn ensures that the object ID is suitably
123 Table 66.4. HeapTupleHeaderData Layout
124 Field Type Length Description
125 t_xmin TransactionId 4 bytes insert XID stamp
126 t_xmax TransactionId 4 bytes delete XID stamp
127 t_cid CommandId 4 bytes insert and/or delete CID stamp (overlays with
129 t_xvac TransactionId 4 bytes XID for VACUUM operation moving a row
131 t_ctid ItemPointerData 6 bytes current TID of this or newer row version
132 t_infomask2 uint16 2 bytes number of attributes, plus various flag bits
133 t_infomask uint16 2 bytes various flag bits
134 t_hoff uint8 1 byte offset to user data
136 All the details can be found in src/include/access/htup_details.h.
138 Interpreting the actual data can only be done with information obtained
139 from other tables, mostly pg_attribute. The key values needed to
140 identify field locations are attlen and attalign. There is no way to
141 directly get a particular attribute, except when there are only fixed
142 width fields and no null values. All this trickery is wrapped up in the
143 functions heap_getattr, fastgetattr and heap_getsysattr.
145 To read the data you need to examine each attribute in turn. First
146 check whether the field is NULL according to the null bitmap. If it is,
147 go to the next. Then make sure you have the right alignment. If the
148 field is a fixed width field, then all the bytes are simply placed. If
149 it's a variable length field (attlen = -1) then it's a bit more
150 complicated. All variable-length data types share the common header
151 structure struct varlena, which includes the total length of the stored
152 value and some flag bits. Depending on the flags, the data can be
153 either inline or in a TOAST table; it might be compressed, too (see
156 ^[19] Actually, use of this page format is not required for either
157 table or index access methods. The heap table access method always uses
158 this format. All the existing index methods also use the basic format,
159 but the data kept on index metapages usually doesn't follow the item