Testing the Patroni PostgreSQL Cluster¶
This document covers the following scenarios to test the PostgreSQL cluster:
- replication,
- connectivity,
- failover, and
- manual switchover.
Testing replication¶
-
Connect to the cluster and establish the
psql
session from a client machine that can connect to the HAProxy node. Use the HAProxy-demo node’s public IP address:$ psql -U postgres -h 134.209.111.138 -p 5000
-
Run the following commands to create a table and insert a few rows:
CREATE TABLE customer(name text,age integer); INSERT INTO CUSTOMER VALUES('john',30); INSERT INTO CUSTOMER VALUES('dawson',35);
-
To ensure that the replication is working, we can log in to each PostgreSQL node and run a simple SQL statement against the locally running instance:
$ sudo psql -U postgres -c "SELECT * FROM CUSTOMER;"
The results on each node should be the following:
name | age --------+----- john | 30 dawson | 35 (2 rows)
Testing failover¶
In a proper setup, client applications won’t have issues connecting to the cluster, even if one or even two of the nodes go down. We will test the cluster for failover in the following scenarios:
Scenario 1. Intentionally stop the PostgreSQL on the primary node and verify access to PostgreSQL.¶
-
Run the following command on any node to check the current cluster status:
$ sudo patronictl -c /etc/patroni/patroni.yml list
Output:
+ Cluster: stampede1 (7011110722654005156) -----------+ | Member | Host | Role | State | TL | Lag in MB | +--------+-------+---------+---------+----+-----------+ | node1 | node1 | Leader | running | 1 | | | node2 | node2 | Replica | running | 1 | 0 | | node3 | node3 | Replica | running | 1 | 0 | +--------+-------+---------+---------+----+-----------+
-
node1
is the current leader. Stop Patroni innode1
to see how it changes the cluster:$ sudo systemctl stop patroni
-
Once the service stops in
node1
, check the logs innode2
andnode3
using the following command:$ sudo journalctl -u patroni.service -n 100 -f
Output
Sep 23 14:18:13 node03 patroni[10042]: 2021-09-23 14:18:13,905 INFO: no action. I am a secondary (node3) and following a leader (node1) Sep 23 14:18:20 node03 patroni[10042]: 2021-09-23 14:18:20,011 INFO: Got response from node2 http://node2:8008/patroni: {"state": "running", "postprimary_start_time": "2021-09-23 12:50:29.460027+00:00", "role": "replica", "server_version": 130003, "cluster_unlocked": true, "xlog": {"received_location": 67219152, "replayed_location": 67219152, "replayed_timestamp": "2021-09-23 13:19:50.329387+00:00", "paused": false}, "timeline": 1, "database_system_identifier": "7011110722654005156", "patroni": {"version": "2.1.0", "scope": "stampede1"}} Sep 23 14:18:20 node03 patroni[10042]: 2021-09-23 14:18:20,031 WARNING: Request failed to node1: GET http://node1:8008/patroni (HTTPConnectionPool(host='node1', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')))) Sep 23 14:18:20 node03 patroni[10042]: 2021-09-23 14:18:20,038 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds Sep 23 14:18:20 node03 patroni[10042]: 2021-09-23 14:18:20,043 INFO: promoted self to leader by acquiring session lock Sep 23 14:18:20 node03 patroni[13641]: server promoting Sep 23 14:18:20 node03 patroni[10042]: 2021-09-23 14:18:20,049 INFO: cleared rewind state after becoming the leader Sep 23 14:18:21 node03 patroni[10042]: 2021-09-23 14:18:21,101 INFO: no action. I am (node3) the leader with the lock Sep 23 14:18:21 node03 patroni[10042]: 2021-09-23 14:18:21,117 INFO: no action. I am (node3) the leader with the lock Sep 23 14:18:31 node03 patroni[10042]: 2021-09-23 14:18:31,114 INFO: no action. I am (node3) the leader with the lock ...
The logs in
node3
show that the requests tonode1
are failing, the watchdog is coming into action, andnode3
is promoting itself as the leader: -
Verify that you can still access the cluster through the HAProxy instance and read data:
$ psql -U postgres -h 10.104.0.6 -p 5000 -c "SELECT * FROM CUSTOMER;" name | age --------+----- john | 30 dawson | 35 (2 rows)
-
Restart the Patroni service in
node1
$ sudo systemctl start patroni
-
Check the current cluster status:
$ sudo patronictl -c /etc/patroni/patroni.yml list + Cluster: stampede1 (7011110722654005156) -----------+ | Member | Host | Role | State | TL | Lag in MB | +--------+-------+---------+---------+----+-----------+ | node1 | node1 | Replica | running | 2 | 0 | | node2 | node2 | Replica | running | 2 | 0 | | node3 | node3 | Leader | running | 2 | | +--------+-------+---------+---------+----+-----------+
As we see, node3
remains the leader and the rest are replicas.
Scenario 2. Abrupt machine shutdown or power outage¶
To emulate the power outage, let’s kill the service in node3
and see what happens in node1
and node2
.
-
Identify the process ID of Patroni and then kill it with a
-9
switch.$ ps aux | grep -i patroni postgres 10042 0.1 2.1 647132 43948 ? Ssl 12:50 0:09 /usr/bin/python3 /usr/bin/patroni /etc/patroni/patroni.yml $ sudo kill -9 10042
-
Check the logs on
node2
:$ sudo journalctl -u patroni.service -n 100 -f
Output
Sep 23 14:40:41 node02 patroni[10577]: 2021-09-23 14:40:41,656 INFO: no action. I am a secondary (node2) and following a leader (node3) … Sep 23 14:41:01 node02 patroni[10577]: 2021-09-23 14:41:01,373 INFO: Got response from node1 http://node1:8008/patroni: {"state": "running", "postprimary_start_time": "2021-09-23 14:25:30.076762+00:00", "role": "replica", "server_version": 130003, "cluster_unlocked": true, "xlog": {"received_location": 67221352, "replayed_location": 67221352, "replayed_timestamp": null, "paused": false}, "timeline": 2, "database_system_identifier": "7011110722654005156", "patroni": {"version": "2.1.0", "scope": "stampede1"}} Sep 23 14:41:03 node02 patroni[10577]: 2021-09-23 14:41:03,364 WARNING: Request failed to node3: GET http://node3:8008/patroni (HTTPConnectionPool(host='node3', port=8008): Max retries exceeded with url: /patroni (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f57e06dffa0>, 'Connection to node3 timed out. (connect timeout=2)'))) Sep 23 14:41:03 node02 patroni[10577]: 2021-09-23 14:41:03,373 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds Sep 23 14:41:03 node02 patroni[10577]: 2021-09-23 14:41:03,385 INFO: promoted self to leader by acquiring session lock Sep 23 14:41:03 node02 patroni[15478]: server promoting Sep 23 14:41:03 node02 patroni[10577]: 2021-09-23 14:41:03,397 INFO: cleared rewind state after becoming the leader Sep 23 14:41:04 node02 patroni[10577]: 2021-09-23 14:41:04,450 INFO: no action. I am (node2) the leader with the lock Sep 23 14:41:04 node02 patroni[10577]: 2021-09-23 14:41:04,475 INFO: no action. I am (node2) the leader with the lock … …
node2
realizes that the leader is dead, and promotes itself as the leader. -
Try accessing the cluster using the HAProxy endpoint at any point in time between these operations. The cluster is still accepting connections.
Manual switchover¶
Typically, a manual switchover is needed for planned downtime to perform maintenance activity on the leader node. Patroni provides the switchover
command to manually switch over from the leader node.
Run the following command on node2
(the current leader node):
$ sudo patronictl -c /etc/patroni/patroni.yml switchover
Patroni asks the name of the current primary node and then the node that should take over as the switched-over primary. You can also specify the time at which the switchover should happen. To trigger the process immediately, specify the value now:
primary [node2]: node2
Candidate ['node1', 'node3'] []: node1
When should the switchover take place (e.g. 2021-09-23T15:56 ) [now]: now
Current cluster topology
+ Cluster: stampede1 (7011110722654005156) -----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------+-------+---------+---------+----+-----------+
| node1 | node1 | Replica | running | 3 | 0 |
| node2 | node2 | Leader | running | 3 | |
| node3 | node3 | Replica | stopped | | unknown |
+--------+-------+---------+---------+----+-----------+
Are you sure you want to switchover cluster stampede1, demoting current primary node2? [y/N]: y
2021-09-23 14:56:40.54009 Successfully switched over to "node1"
+ Cluster: stampede1 (7011110722654005156) -----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------+-------+---------+---------+----+-----------+
| node1 | node1 | Leader | running | 3 | |
| node2 | node2 | Replica | stopped | | unknown |
| node3 | node3 | Replica | stopped | | unknown |
+--------+-------+---------+---------+----+-----------+
Restart the Patroni service in node2
(after the “planned maintenance”). The node rejoins the cluster as a secondary.