Overview

pg_sparse is pgvector for sparse vectors. It enables the storage and retrieval of sparse vectors within Postgres.

Generated by sparse embedding models like SPLADE, sparse vectors are different from dense vectors because they contain significantly more entries, most of which are zero. Storing raw sparse vectors in Postgres and calculating distances between them incurs extremely high storage and performance costs. pg_sparse solves this by compressing the sparse vectors and constructing an HNSW graph over the compressed vector representations.

pg_sparse is a fork of pgvector and is supported on all versions supported by the PostgreSQL Global Development Group, which includes PostgreSQL 12+.

Installation

From ParadeDB

The easiest way to use the extension is to run the ParadeDB Dockerfile:

docker run \
  -e POSTGRES_USER=<user> \
  -e POSTGRES_PASSWORD=<password> \
  -e POSTGRES_DB=<dbname> \
  -p 5432:5432 \
  -d \
  paradedb/paradedb:latest

This will spin up a Postgres instance with pg_sparse preinstalled.

From Self-Hosted PostgreSQL

If you are self-hosting Postgres and would like to use the extension within your existing Postgres, follow these steps:

Debian/Ubuntu

We provide pre-built binaries for Debian-based Linux for PostgreSQL 16. You can download the latest version for your architecture from the releases page.

ParadeDB does not currently collect telemetry from self-hosted pg_sparse.

macOS and Windows

We don't suggest running production workloads on macOS or Windows. As a result, we don't provide prebuilt binaries for these platforms. If you are running Postgres on macOS or Windows and want to install pg_sparse, please follow the development instructions, followed by make install. This will build the extension from source and install it in your Postgres instance.

You can then create the extension in your database by running:

CREATE EXTENSION svector;

Note: If you are using a managed Postgres service like Amazon RDS, you will not be able to install pg_sparse until the Postgres service explicitly supports it.

Usage

Getting Started

Enable the extension (do this once in each database where you want to use it):

CREATE EXTENSION svector;

Create a sparse vector column with 4 dimensions:

CREATE TABLE items (id bigserial PRIMARY KEY, embedding svector(4));

Insert vectors:

INSERT INTO items (embedding) VALUES ('[1,2,0,3]'), ('[0,4,0,6]');

Get the nearest neighbors by L2 distance:

SELECT * FROM items ORDER BY embedding <-> '[0,1,0,2]' LIMIT 5;

Also supports inner product (<#>) and cosine distance (<=>).

Note: <#> returns the negative inner product since Postgres only supports ASC order index scans on operators

Storing

Create a new table with a sparse vector column:

CREATE TABLE items (id bigserial PRIMARY KEY, embedding svector(3));

Or add a sparse vector column to an existing table:

ALTER TABLE items ADD COLUMN embedding svector(3);

Insert sparse vectors:

INSERT INTO items (embedding) VALUES ('[0,1,2,3]'), ('[4,5,0,6]');

Upsert sparse vectors:

INSERT INTO items (id, embedding) VALUES (1, '[0,1,2,3]'), (2, '[4,5,0,6]')
    ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;

Update sparse vectors:

UPDATE items SET embedding = '[0,1,2,3]' WHERE id = 1;

Delete sparse vectors:

DELETE FROM items WHERE id = 1;

Querying

Get the nearest neighbors to a sparse vector:

SELECT * FROM items ORDER BY embedding <-> '[3,0,1,2]' LIMIT 5;

Get the nearest neighbors to a row:

SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;

Get rows within a certain distance:

SELECT * FROM items WHERE embedding <-> '[3,0,1,2]' < 5;

Note: Combine with ORDER BY and LIMIT to use an index.

Distances

Get the distance:

SELECT embedding <-> '[3,0,1,2]' AS distance FROM items;

For inner product, multiply by -1 (since <#> returns the negative inner product):

SELECT (embedding <#> '[3,0.1,2]') * -1 AS inner_product FROM items;

For cosine similarity, use 1 - cosine distance:

SELECT 1 - (embedding <=> '[3,0,1,2]') AS cosine_similarity FROM items;

HNSW Index

By default, pg_sparse performs exact nearest neighbor search, which provides perfect recall.

You can add an HNSW index to use approximate nearest neighbor search, which trades some recall for speed. Unlike typical indexes, you will see different results for queries after adding an approximate index.

Add an index for each distance function you want to use.

L2 distance:

CREATE INDEX ON items USING shnsw (embedding svector_l2_ops);

Inner product:

CREATE INDEX ON items USING shnsw (embedding svector_ip_ops);

Cosine distance:

CREATE INDEX ON items USING shnsw (embedding svector_cosine_ops);

Sparse vectors with up to 2,000 non-zero dimensions can be indexed.

Index Options

Specify HNSW parameters:

m - the max number of connections per layer (16 by default)
ef_construction - the size of the dynamic candidate list for constructing the graph (64 by default)

CREATE INDEX ON items USING shnsw (embedding svector_l2_ops) WITH (m = 16, ef_construction = 64);

Query Options

Specify the size of the dynamic candidate list for search (40 by default):

SET shnsw.ef_search = 100;

A higher value provides better recall at the cost of speed.

Use SET LOCAL inside a transaction to set it for a single query:

BEGIN;
SET LOCAL shnsw.ef_search = 100;
SELECT ...
COMMIT;

Indexing Progress

Check indexing progress with Postgres 12+:

SELECT phase, round(100.0 * blocks_done / nullif(blocks_total, 0), 1) AS "%" FROM pg_stat_progress_create_index;

The phases for HNSW are:

initializing
loading tuples

Filtering

There are a few ways to index nearest neighbor queries with a WHERE clause:

SELECT * FROM items WHERE category_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Create an index on one or more of the WHERE columns for exact search:

CREATE INDEX ON items (category_id);

Or a partial index on the svector column for approximate search:

CREATE INDEX ON items USING shnsw (embedding svectorl2_ops) WITH (lists = 100)
    WHERE (category_id = 123);

Use partitioning for approximate search on many different values of the WHERE columns:

CREATE TABLE items (embedding svector(3), category_id int) PARTITION BY LIST(category_id);

Performance

Use EXPLAIN ANALYZE to debug performance.

EXPLAIN ANALYZE SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Please refer to the documentation for a more thorough overview of pg_sparse's query support.

Development

Prerequisites

Before develping the extension, ensure that you have make installed, via clang or gcc.

If you are on macOS and using Postgres.app, you'll first need to add the pg_config binary to your path:

export PATH="$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin"

Then, compile and install via:

make
make install

Running the Extension

First, start a local Postgres instance:

# macOS
brew services start postgresql

# Linux
sudo systemctl start postgresql

And connect to it via psql:

psql

You can create databases and tables, and drop them as needed. Inside Postgres, create the extension by running:

CREATE EXTENSION svector;

Now, you have access to all the extension functions.

Modifying the Extension

If you make changes to the extension code, follow these steps to update it:

Recompile the extension:

make
make install

Recreate the extension to load the latest changes:

DROP EXTENSION svector;
CREATE EXTENSION svector;

Testing

To run the integration and unit test suite, simply run:

./test/runtests.sh -p threaded

This will create a temporary database, and run the tests in /test/sql against it. To add a new test, simply add a new .sql file to /test/sql and a corresponding .out file to /test/expected for the expected output, and it will automatically get picked up by the test suite.

Note: the bash script takes arguments and allows you to run tests either sequentially or in parallel. For more info run ./test/runtests.sh -h

License

pg_sparse is licensed under the PostgreSQL License, in accordance with pgvector's license.

pg_sparse