Who Am I?

Stefan Fercot
aka. pgstef
PostgreSQL user since 2010
involved in the community since 2016
@dalibo since 2017

Dalibo

Services

Support

Training

Advice

Based in France
Contributing to PostgreSQL community

Introduction

Write-Ahead Log (WAL)

transactions written sequentially
- COMMIT when data are flushed to disk
WAL replay after a crash
- make the database consistent

PostgreSQL WAL

REDO log only
- no UNDO log (yet)
- instant rollback

Structure

WAL is divided into WAL segments
- each segment is a file in pg_wal directory

$ ls pg_wal/
00000001000000010000002E
00000001000000010000002F
000000010000000000000030
...

Filenames

00000001000000010000002E
- 00000001 : TLI
- 000000010000002E : LSN
  - 00000001 : log id
  - 0000002E : segment number

Checkpoints

flush all data pages to disk
write a checkpoint record
recycle / remove old WAL

Archiving

old WAL segments are deleted / recycled after a checkpoint
can also be archived with archive_command

Allows online backups and Point-in-Time Recovery

Replication

apply WAL when generated on a standby server
- using WAL archives (files)
- or by streaming over a TCP connection

Streaming Replication

architecture/compile flag dependent
whole cluster only
read-only standby
no built-in cluster management
no (easy) fail-back

Setup

wal_level

wal_level = 'replica'

max_wal_senders

max_wal_senders=10 (default from v10)

Authentication

On primary
- CREATE ROLE replicator WITH LOGIN REPLICATION;
- … and setup a password
- in pg_hba.conf
  - host replication replicator standby_ip/32 md5

Data initialization

$ pg_basebackup -D /var/lib/pgsql/11/data \
    -h primary -U replicator -R -P

before v10, add -X stream

recovery.conf

standby_mode
primary_conninfo
recovery_target_timeline

standby_mode

standby_mode=on
continuous recovery by fetching new WAL segments
- using restore_command
- by connecting to the primary server

primary_conninfo

primary_conninfo = 'user=replicator host=primary'
connection string to the primary server

recovery_target_timeline

particular timeline for recovery
- latest is useful in a standby server
new timeline created after a recovery
- to identify the series of WAL records generated afterwards

PostgreSQL 12 changes

“Integrate recovery.conf into postgresql.conf” (2018-11-25)

recovery.signal / standby.signal
pg_basebackup -R append postgresql.auto.conf

Start

# systemctl start postgresql-11

Processes

On primary:

walsender replicator ... streaming 0/3BD48728

On standby:

walreceiver streaming 0/3BD48728

Monitoring

lag
- amount of WAL records generated in the primary
- not yet received / applied on the standby
pg_current_wal_lsn() on the primary
pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn() on the standby

pg_stat_replication

On primary:

usename          | replicator
application_name | walreceiver
state            | streaming
sent_lsn         | 0/3BD48728
write_lsn        | 0/3BD48728
flush_lsn        | 0/3BD48728
replay_lsn       | 0/3BD48728
sync_state       | async
...

pg_stat_wal_receiver

On standby:

status                | streaming
received_lsn          | 0/3BD48728
received_tli          | 1
...

Fail-over

Split-brain

if standby server becomes new primary
- make sure the old primary is no longer the primary
avoid situations where both systems think they are the primary
- lead to confusion and ultimately data loss

Check-up before clean promote

On primary:

# systemctl stop postgresql-11
$ pg_controldata -D /var/lib/pgsql/11/data/ \
| grep -E '(Database cluster state)|(REDO location)'
Database cluster state:               shut down
Latest checkpoint's REDO location:    0/3BD487D0

On standby:

$ psql -c 'CHECKPOINT;'
$ pg_controldata -D /var/lib/pgsql/11/data/ \
| grep -E '(Database cluster state)|(REDO location)'
Database cluster state:               in archive recovery
Latest checkpoint's REDO location:    0/3BD487D0

Promote

pg_ctl promote [-D datadir] [-W] [-t seconds] [-s]
trigger_file in recovery.conf

Logs after promote

LOG:  received promote request
LOG:  redo done at 0/3BD487D0
LOG:  last completed transaction was at log time ...
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
LOG:  database system is ready to accept connections

Fail-back

old primary as a standby
- full copy of the new primary
- pg_rewind
  - --source-pgdata
  - --source-server

pg_rewind

rewinding a cluster until its divergence with another
needs wal_log_hints or data checksums
--dry-run

pg_rewind (2)

$ pg_rewind -D /var/lib/pgsql/11/data/ \
  --source-server="user=postgres host=primary" -P
connected to server
servers diverged at WAL location 0/3BD48840 on timeline 1
rewinding from last common checkpoint at 0/3BD487D0 on timeline 1
reading source file list
reading target file list
reading WAL in target
need to copy 196 MB (total source directory size is 561 MB)
200806/200806 kB (100%) copied
creating backup label and updating control file
syncing target data directory
Done!

Troubles

What if the connection between primary and standby fails?

Replication slots

primary does not remove WAL segments
- until received by all standbys
pg_create_physical_replication_slot('slot_name');
primary_slot_name
max_replication_slots = 10 (default from v10)

Log-shipping

Don’t prevent the removal of old WAL segments, use the archives!

restore_command
archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'

PITR

Combine with PITR backups for easier fail-backs!

online backups
the standby use archives from the PITR repository
- to catchup the primary
faster standby creation through backup restore
- or refresh an old one

Synchronous replication

synchronous_commit
- off
- local
- remote_write
- on
- remote_apply
can be applied by transaction

synchronous_standby_names

Single (9.1)
- synchronous_standby_names = s1,s2,s3
First (9.6)
- synchronous_standby_names = 2(s1,s2,s3)
Quorum (10)
- synchronous_standby_names = ANY 2(s1,s2,s3)

Hot standby and conflicts

DROP TABLE on primary…
- cannot wait for the end of queries on standby
on standby (max_standby_archive_delay and max_standby_streaming_delay)
- delay application of WAL record
- or cancel the conflicting query

An example of the problem situation is an administrator on the primary server running DROP TABLE on a table that is currently being queried on the standby server. Clearly the standby query cannot continue if the DROP TABLE is applied on the standby. If this situation occurred on the primary, the DROP TABLE would wait until the other query had finished. But when DROP TABLE is run on the primary, the primary doesn’t have information about what queries are running on the standby, so it will not wait for any such standby queries. The WAL change records come through to the standby while the standby query is still running, causing a conflict. The standby server must either delay application of the WAL records (and everything after them, too) or else cancel the conflicting query so that the DROP TABLE can be applied.

Early cleanup

cleanup on the primary
- according to MVCC rules
- remove row versions still visible to a transaction on the standby
hot_standby_feedback
- or replication slots…

Updates

different minor release on primary and standby usually works
- not advised!
update the standby servers first

Tools

Automated Fail-over

Patroni
repmgr
PAF

Patroni

Python
“template” for high-availability
- with ZooKeeper, etcd, Consul or Kubernetes
integrates with HAProxy

repmgr

fewer prerequisites
easier for manual processing
- repmgrd for automatic fail-over
- witness to avoid split-brain
no connection management

PAF

agent for Pacemaker/Corosync
- linux HA
- possible management of other services
connection routing with virtual IP
STONITH

PITR

pgBackRest, …
- … but that’s for another talk!

pgBackRest Main Features

custom protocol
- local or remote operation (via SSH)
multi-process
full/differential/incremental backup
backup rotation and archive expiration
parallel, asynchronous WAL push and get
Amazon S3 support
encryption
…

Logical Replication

reconstructs changes by row
replicates row content, not SQL statements
table-level partial / bi-directional replication
data replication only
- no schema
- no sequences
suitable for data distribution
- but not for HA !

Conclusion

consolidated during 9.x versions
out of the box in 10
- wal_level
- max_wal_senders
- …

Thank you for your attention!

Streaming Replication, the basics

Who Am I?

Dalibo

Introduction

Write-Ahead Log (WAL)

PostgreSQL WAL

Structure

Filenames

Checkpoints

Archiving

Replication

Streaming Replication

Setup

wal_level

max_wal_senders

Authentication

Data initialization

recovery.conf

standby_mode

primary_conninfo

recovery_target_timeline

PostgreSQL 12 changes

Start

Processes

Monitoring

pg_stat_replication

pg_stat_wal_receiver

Fail-over

Split-brain

Check-up before clean promote

Promote

Logs after promote

Fail-back

pg_rewind

pg_rewind (2)

Troubles

Replication slots

Log-shipping

PITR

Synchronous replication

synchronous_standby_names

Hot standby and conflicts

Early cleanup

Updates

Tools

Automated Fail-over

Patroni

repmgr

PAF

PITR

pgBackRest Main Features

Logical Replication

Conclusion

Thank you for your attention!