Scaling Bloated PostgreSQL DB

tags:

categories:

The story comes from OneSignal a messaging platform and depends on the postgresql listen/notify features in order to push notifications to end users. Lots of words of wisdom in this article.
The problems presented seems to be the case when the performance enhancements of the DB were already saturated (such as too many indices added) and a bigger change was needed.
“subscribers” makes up majority of the data bloat. Over 1 billion monthly active subscriptions. Heavy Insert and Update transactions.
“notifications” makes up the next largest data set. Records can be big. Heavy Update Transactions.

Table Bloats

When table records are deleted but the space does not get recycled. It becomes dead tuples still occupying the space.
VACCUM cleans up the table and allow reuse for that table of dead tuples for that table.
VACCUM FULL cleans up the table and allow reuse for all tables. (introduces table locking)
In most cases an UPDATE is INSERT + DELETE in DBs. Dead tuples are created from DELETE hence in a write heavy environment, dead couples are easily created. This is super interesting.
The unfortunate part is most of the time we want to use VACCUM FULL command, but this table introduces locking. Hence we cannot run this command without planning downtime. Running VACCUM will incur wasted space.

Index Bloats

Postgresql indices are direct indices. It points to the specific tuples.
Due to MVCC approach in postgresql, dead indices be created, similar to Table Bloats.
Every time a column is updated, the indices need to be updated as well. So a change on a column generates
- dead tuples on table record
- dead tuples on table index
Heap only Tuple (HOT) optimisation allows tuples to be stored adjacent to their previous version however this comes with trade off on read performance for index scans. (Makes sense, it is like searching through an bloated big Array once the tuple record is found.)

How to Prevent Bloat

The easiest way is to prevent from creating bloats in the first place. Otherwise might use fancy solutions that are difficult to maintain.
autovaccum spawns a process that vacuums. There’s a detail on tuning
Partition by date and run a data retention policy. Let’s say nothing over 30 days. So partition into 30 buckets and then run DROP TABLE to remove a single partitioned table from DB than deleting from table as a whole. pg_partman is a useful tool for this.
Separate by frequently updated columns that has smaller data footprint vs least updated columns that has high data footprints. This helps reduce data bloats since the column with high data footprint needs to copied every time and creates dead tuples but the change is not there. Ex: an int_column is updated frequently and a big_int column that is not updated frequently.

How to Reduce Bloat

VACCUM FULL – the obvious option is but this should be the last option as it locks the whole table and may put services offline.
pg_repack – similar to VACCUM FULL CONCURRENTLY and REINDEX CONCURRENTLY
- Upsides:
  - Does not lock the table until the very end when the when replacing the original table with bloat free table.
  - Able to copy most of the things over including usage stats.
- Drawbacks:
  - If there are long running queries and frequent access, it can still exhibit unavailability
  - This is a manual process, not a background process
  - Updates on the original table are stored in a log table. If the writes on the original table is faster than processing the log table, pg_repack can go into a situation that never finish.
- A separate daemon need to be crated if we want to automate pg_repack.
pgcompacttable – moves dead tuples to the front of the table. New inserts and updates will use the dead tuples. Sounds like two steps:
- Running pgcompacttable first to reuse the dead tuples
- Use Vaccum as second step

No one wants to use pg_repack or pgcompacttable when there is no problem with bloat.

Another tool not mentioned in the article is pg_squeeze focused on tackling similar bloat issues.

DB Major Upgrades (not related to scaling but is also in article)

Major upgrades via pg_upgrade requires DB to be shut down, and rewrite data into new format.
To achieve HA, use logical replication to copy over the data on a new DB, and then cutover.
Recommend using pglogical to tweak the logical replication behaviours during conflict resolution

DB Minor Upgrades

Just restart the DB if HA is not required. Seems to be really fast in order of 10 seconds even on massive data sizes.

Transaction Id (X ID)

32-bit transaction ID used to track new row versions.
Similar to circular buffer implementation.
If transactions are more than 10k rps, the “max” value is easily reached. This will cause data corruption.
The vaccum process can be used to freeze very old Transaction ids, but even then it may still fail to freeze extremely old ones. This would eventually lead to data corruption.
Should implement some observability tool to monitor remaining XIDs. The remaining XIDs cannot reach 1 million, otherwise DB will stop accepting commands. The only way to fix this is to restart the DB.

Below cmd to fetch remaining xIds

SELECT power(2, 31) - age(datfrozenxid) AS remaining
FROM pg_database
WHERE datname = current_database();

autovacuum_freeze_max_age is used when the value is ever less than 250 million @ OneSignal

Replica Promotion

A common DB architecture is to have single read/write db, and multiple replicas read db.
In case a a replica promotion is required, you want to make this transition quickly because writes are blocked since it has no dbs are serving these requests.
OneSignal implemented an HA proxy where there are multiple DBs that has read/write access, but one of them is designated as the primary one. When the Primary one is down through health check, the proxy will move the traffic to the “replica”
Instead of implementing custom solutions, use Patroni

Paritioning is a potential solution for autovaccums, index scans bloats.
Allowing multiple write DBs instead of just one.
File system performances can be a concern for single server setup as partitioning uses in-built os file systems. Using EXT4 is fine.

Sharding

While partitioning is separating those data into different buckets, partitioning can still be done on single DB servers. Sharding on the other hand means to physically store those files on different servers.
Only big data sets need to be sharded.
The application side should not need to know how the data is sharded (ie: know the ip address of the dbs). This was a data regret at OneSignal
OneSignal implementation is sharded by tenant Ids using V4 UUIDs. Application side can deterministically find what is the UUID of a given tenant Id, then go to the correct DB and find the record.

Ref:

Buy me a coffee

Coffee Break