pg_sparse
Sparse Vector Similarity Search Inside Postgres
Install
trunk install pg_sparse
Overview
pg_sparse
is pgvector
for sparse vectors. It enables the storage and retrieval of sparse vectors within Postgres.
Generated by sparse embedding models like SPLADE, sparse vectors are different from dense vectors because they contain significantly more entries, most of which are zero. Storing raw sparse vectors in Postgres and calculating distances between them incurs extremely high storage and performance costs. pg_sparse
solves this by compressing the sparse vectors and constructing an HNSW graph over the compressed vector representations.
pg_sparse
is a fork of pgvector
and is supported on all versions supported by the PostgreSQL Global Development Group, which includes PostgreSQL 12+.
Installation
From ParadeDB
The easiest way to use the extension is to run the ParadeDB Dockerfile:
docker run \
-e POSTGRES_USER=<user> \
-e POSTGRES_PASSWORD=<password> \
-e POSTGRES_DB=<dbname> \
-p 5432:5432 \
-d \
paradedb/paradedb:latest
This will spin up a Postgres instance with pg_sparse
preinstalled.
From Self-Hosted PostgreSQL
If you are self-hosting Postgres and would like to use the extension within your existing Postgres, follow these steps:
Debian/Ubuntu
We provide pre-built binaries for Debian-based Linux for PostgreSQL 16. You can download the latest version for your architecture from the releases page.
ParadeDB does not currently collect telemetry from self-hosted pg_sparse
.
macOS and Windows
We don't suggest running production workloads on macOS or Windows. As a result, we don't provide prebuilt binaries for these platforms. If you are running Postgres on macOS or Windows and want to install pg_sparse
, please follow the development instructions, followed by make install
. This will build the extension from source and install it in your Postgres instance.
You can then create the extension in your database by running:
CREATE EXTENSION svector;
Note: If you are using a managed Postgres service like Amazon RDS, you will not be able to install pg_sparse
until the Postgres service explicitly supports it.
Usage
Getting Started
Enable the extension (do this once in each database where you want to use it):
CREATE EXTENSION svector;
Create a sparse vector column with 4 dimensions:
CREATE TABLE items (id bigserial PRIMARY KEY, embedding svector(4));
Insert vectors:
INSERT INTO items (embedding) VALUES ('[1,2,0,3]'), ('[0,4,0,6]');
Get the nearest neighbors by L2 distance:
SELECT * FROM items ORDER BY embedding <-> '[0,1,0,2]' LIMIT 5;
Also supports inner product (<#>
) and cosine distance (<=>
).
Note: <#>
returns the negative inner product since Postgres only supports ASC
order index scans on operators
Storing
Create a new table with a sparse vector column:
CREATE TABLE items (id bigserial PRIMARY KEY, embedding svector(3));
Or add a sparse vector column to an existing table:
ALTER TABLE items ADD COLUMN embedding svector(3);
Insert sparse vectors:
INSERT INTO items (embedding) VALUES ('[0,1,2,3]'), ('[4,5,0,6]');
Upsert sparse vectors:
INSERT INTO items (id, embedding) VALUES (1, '[0,1,2,3]'), (2, '[4,5,0,6]')
ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;
Update sparse vectors:
UPDATE items SET embedding = '[0,1,2,3]' WHERE id = 1;
Delete sparse vectors:
DELETE FROM items WHERE id = 1;
Querying
Get the nearest neighbors to a sparse vector:
SELECT * FROM items ORDER BY embedding <-> '[3,0,1,2]' LIMIT 5;
Get the nearest neighbors to a row:
SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;
Get rows within a certain distance:
SELECT * FROM items WHERE embedding <-> '[3,0,1,2]' < 5;
Note: Combine with ORDER BY
and LIMIT
to use an index.
Distances
Get the distance:
SELECT embedding <-> '[3,0,1,2]' AS distance FROM items;
For inner product, multiply by -1 (since <#>
returns the negative inner product):
SELECT (embedding <#> '[3,0.1,2]') * -1 AS inner_product FROM items;
For cosine similarity, use 1 - cosine distance:
SELECT 1 - (embedding <=> '[3,0,1,2]') AS cosine_similarity FROM items;
HNSW Index
By default, pg_sparse
performs exact nearest neighbor search, which provides perfect recall.
You can add an HNSW index to use approximate nearest neighbor search, which trades some recall for speed. Unlike typical indexes, you will see different results for queries after adding an approximate index.
Add an index for each distance function you want to use.
L2 distance:
CREATE INDEX ON items USING shnsw (embedding svector_l2_ops);
Inner product:
CREATE INDEX ON items USING shnsw (embedding svector_ip_ops);
Cosine distance:
CREATE INDEX ON items USING shnsw (embedding svector_cosine_ops);
Sparse vectors with up to 2,000 non-zero dimensions can be indexed.
Index Options
Specify HNSW parameters:
m
- the max number of connections per layer (16 by default)ef_construction
- the size of the dynamic candidate list for constructing the graph (64 by default)
CREATE INDEX ON items USING shnsw (embedding svector_l2_ops) WITH (m = 16, ef_construction = 64);
Query Options
Specify the size of the dynamic candidate list for search (40 by default):
SET shnsw.ef_search = 100;
A higher value provides better recall at the cost of speed.
Use SET LOCAL
inside a transaction to set it for a single query:
BEGIN;
SET LOCAL shnsw.ef_search = 100;
SELECT ...
COMMIT;
Indexing Progress
Check indexing progress with Postgres 12+:
SELECT phase, round(100.0 * blocks_done / nullif(blocks_total, 0), 1) AS "%" FROM pg_stat_progress_create_index;
The phases for HNSW are:
initializing
loading tuples
Filtering
There are a few ways to index nearest neighbor queries with a WHERE
clause:
SELECT * FROM items WHERE category_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
Create an index on one or more of the WHERE
columns for exact search:
CREATE INDEX ON items (category_id);
Or a partial index on the svector column for approximate search:
CREATE INDEX ON items USING shnsw (embedding svectorl2_ops) WITH (lists = 100)
WHERE (category_id = 123);
Use partitioning for approximate search on many different values of the WHERE
columns:
CREATE TABLE items (embedding svector(3), category_id int) PARTITION BY LIST(category_id);
Performance
Use EXPLAIN ANALYZE
to debug performance.
EXPLAIN ANALYZE SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
Please refer to the documentation for a more thorough overview of pg_sparse
's query support.
Development
Prerequisites
Before develping the extension, ensure that you have make
installed, via clang
or gcc
.
If you are on macOS and using Postgres.app, you'll first need to add the pg_config
binary to your path:
export PATH="$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin"
Then, compile and install via:
make
make install
Running the Extension
First, start a local Postgres instance:
# macOS
brew services start postgresql
# Linux
sudo systemctl start postgresql
And connect to it via psql
:
psql
You can create databases and tables, and drop them as needed. Inside Postgres, create the extension by running:
CREATE EXTENSION svector;
Now, you have access to all the extension functions.
Modifying the Extension
If you make changes to the extension code, follow these steps to update it:
- Recompile the extension:
make
make install
- Recreate the extension to load the latest changes:
DROP EXTENSION svector;
CREATE EXTENSION svector;
Testing
To run the integration and unit test suite, simply run:
./test/runtests.sh -p threaded
This will create a temporary database, and run the tests in /test/sql
against it. To add a new test, simply add a new .sql
file to /test/sql
and a corresponding .out
file to /test/expected
for the expected output, and it will automatically get picked up by the test suite.
Note: the bash script takes arguments and allows you to run tests either sequentially or in parallel. For more info run ./test/runtests.sh -h
License
pg_sparse
is licensed under the PostgreSQL License, in accordance with pgvector
's license.