Time to Add Databases to your Spring Cleaning - VACUUMs in Postgres

Deleted records from your database could still be taking up space on your disk. Learn how to use VACUUM to reduce size and improve performance of your databases.

5 months ago   •   6 min read

By Rajendra gupta

Every time you delete or update a row in your database, the old records are secretly still hiding in the background and taking up space on your hard drive.

A VACUUM process is like emptying the recycling bin on your laptop. It clears up space, reduces indexing time, and keeps your database squeaky clean.

But it's important to understand how VACUUM works so you can avoid the equivalent of vacuuming your house in the middle of a dinner party.

By understanding how and when Postgres and other databases clean themselves, you will be able to tune your database for low response times and your database server for the right amount of storage.

In this post we will cover:

  1. What a VACUUM process is and how it works
  2. Full vs Auto VACUUM
  3. How to modify and inspect this process

How Postgres Executes SQL Statements - DEAD vs Removed Tuples

Updating a Postgres Table. SQL Editor: Arctype

During the life of a database, we might make thousands of changes to a database like above. But deleting a record does not actually free up the disk space.

Postgres uses multi-version concurrency control (also known as MVCC) to ensure concurrent data access and consistency.

Whenever we execute a SQL statement, it uses a snapshot of data instead of every row. This prevents users from viewing inconsistent data generated by concurrent transactions. It also minimizes lock contentions for different sessions trying to read or write data.

Each transaction gets a transaction ID – XID for a block of BEGIN and COMMIT statement. When a user inserts a row, but the transaction is not committed, other users cannot see the newly inserted row.

For example, in the below image, User A inserts two records into an empty table. If User B were to run a SELECT statement, they would get zero rows before the record  committed. Similarly, another user still gets table data for a delete statement if the delete transaction statement is not committed.

MVCC data access example
MVCC data access example

When we execute a DELETE or UPDATE statement, Postgres does the following:

  • Delete command: Postgres does not remove the tuples physically; it marks the existing tuple as DEAD.
  • Update command: The update statement is similar to a DELETE plus an INSERT statement.  Therefore, it marks the existing tuple as DEAD and inserts a new tuple.

If you have frequent DML( INSERT, UPDATE, DELETE) statements, these DEAD tuples can take unnecessary space. To free up space, we have to run the following maintenance processes:

  • Remove the dead tuples
  • Remove index tuples pointing to the dead tuples
  • Update the statistics

With this knowledge of how a SQL statement is executed and the maintenance requirements, we can discuss the VACUUM process.

Cleaning up dead tuples with a VACUUM in Postgres

Postgres uses the VACUUM maintenance process for removing DEAD tuples. It reclaims space occupied by dead tuples for reuse in the further data insertion.

The VACUUM process can run concurrently with other DML transactions because it does not put an exclusive lock on the table. It carries out following operations for removing dead tuples:

  1. Postgres scans all pages of a target table and builds a list of dead tuples. It freezes the old tuples if required.
  2. It removes the index tuples pointing to the dead tuples by referencing to the dead tuple list.
  3. It updates the statistics as well as the system catalog for the target table after the VACCUM processing. It also updates the FSM (Free Space Map) and VM(Visibility Map).

Postgres VACUUM Example

I'll demonstrate the impact of VACUUM by creating an example table, deleting some values, and then running a VACUUM command. In this example I also use the pg_freespacecamp extension to monitor improvements in space utilization.

Create a table with a auto-generated series of data

create table SampleTable(id1 int, id2 int);

insert into
	SampleTable
values (
	generate_series(1,100000),
    generate_series(1,100000)
);

Measure free space usage using pg_freespacemap

create extension pg_freespacemap;
 
SELECT
  count(*) as npages,
  round(100 * avg(avail) / 8192, 2) as avg_fs_ratio
FROM
  pg_freespace(
    'SampleTable');
SQL Editor: Arctype

The table we created uses up 443 pages of space on the hard drive, and because all of the data was added sequentially, it has a perfect free space ratio of 0.

Now I'll delete every record with a value greater that 100. But if we re-run the command above, the number of pages and free space ratio don't change.

delete from SampleTable where id1 > 10;

To remove the unused pages, we have to run a VACUUM command.

VACUUM sampletable;
SQL Editor: Arctype

Now when we track the space usage we can see the number of used pages has gone down from 443 to 1! But the ratio of free to used space on this page has also increased from 0 to 49%. We can return this space to the operating system with a VACUUM FULL .

What does VACUUM FULL do?

Vacuum full diagram
Vacuum full diagram

The VACUUM process removes DEAD tuples for future usage, but it does not return the space to the operating system.

Therefore, if you perform a bulk data deletion or updates, you might be using too much storage due to space occupied by these DEAD tuples. The VACUUM FULL process returns the space to the operating system, as seen in the picture below.  It does the following tasks.

  1. VACUUM FULL process obtains an exclusive lock on the table.
  2. It creates a new empty storage table file.
  3. Copy the live tuples to the new table storage.
  4. Removes the old table file and frees the storage.
  5. It rebuilds all associated table indexes, updates the system catalogs and statistics.

Let's see how running the VACUUM FULL command impacts our previous sample table:

vacuum full SampleTable;

SELECT
  count(*) as npages,
  round(100 * avg(avail) / 8192, 2) as avg_fs_ratio
FROM
  pg_freespace(
    'SampleTable');
SQL Editor: Arctype

The free space ratio is now down from almost 50% to 0.

Postgres VACUUM Performance

VACUUM cleaning is costly because it needs to scan all pages of a target table. If you have a large table with million rows, it can be harmful to your database resources. To preserve resources, Postgres uses the Visibility Map.  Each table in Postgres has a VM that determines whether the page in the table has dead tuples.  If the page does not have a dead tuple, the vacuum processing skips the page.

For example, in the below image, we have a table with four pages. Two pages have of DEAD tuples.  The visibility map uses a bitmap that defines dead tuples on a specific page.

  • Bit 0: No dead tuples on the page, therefore skip VACCUM processing.
  • Bit 1: Page consists of dead tuples; therefore, VACCUM that specific page.
Postgres visibility map to track dead tuples
Postgres visibility map to track dead tuples

Postgres Autovacuum Daemon

Postgres automates the VACCUM processing using auto vacuum daemon. By default, it runs every 1 minute. When the VACCUM wakes up, it invokes three workers. These workers do the VACCUM processing on the target tables.

You can query pg_settings to check various configurations for the autovacuum process in Postgres:

select name,
	setting,
    category,
    short_desc
from pg_settings 
where name like '%autovacuum%'
Postgres autovacuum configuration. SQL Editor: Arctype

How to modify autovacuum for a specific table in Postgres

Automatically cleaning and reindexing the database every minute might not be optimal if you have millions of rows.  Therefore, we can configure table level. If you specify a table-level configuration, it bypasses the global setting.

For example, in the below query, we set autovacuum for SampleTable2 if it has more than 100 DEAD tuples.

ALTER TABLE SampleTable2 SET (autovacuum_vacuum_scale_factor = 0, autovacuum_vacuum_threshold = 100)

VACUUM vs VACUUM FULL

As we know, full vacuum process reclaims space to the operating system. However, FULL VACUUM requires an exclusive lock on the table for its processing, and it blocks all other transactions.  In the below table, we can summaries the difference between these processes.

Vacuum vs Vacuum Full
Vacuum vs Vacuum Full

Conclusion

In this article, we covered:

  • How Postgres implements delete and update statements
  • Using VACUUM to remove DEAD tuples
  • Using VACUUM FULL to return space back to the OS
  • Configuring auto vacuum for individual tables

These concepts are useful for reducing the server costs of databases while maintaining high availability. With the queries shown in the above article you can inspect the internals of your database and its memory consumption.

If you want a SQL editor with an intuitive interface and easy data visualizations, try Arctype today.

The Collaborative SQL Editor

Spread the word

Keep reading