Vector Benchmark

Nowadays, vector search is becoming a common component in many products: site search, recommendations, semantic autocomplete, support tooling, and AI agents that retrieve the right chunks before calling an LLM. The implementation choices change a lot as the data size grows. With a few thousand vectors, an exact k-NN scan can be perfectly fine. Once you move into larger collections, approximate nearest neighbor (ANN) indexing becomes the practical approach. You build an index, persist it, and query it efficiently.

At that point, many teams reach for a full-featured vector database because it bundles ANN with a service layer. That bundle is valuable when you need it. It also comes with a baseline overhead: extra moving parts, background processes, configuration surface area, and memory overhead that is often “always on” even for small-to-mid sized datasets. If you are deploying in tight containers, edge machines, or low-cost instances, the baseline matters as much as raw search speed.

So the first question is “what do I actually need to run in production” instead of “which system is fastest”. If you need pre-filtering, rich metadata, payload indexing, authentication, replication, multi-tenancy, and operational tooling, and you are operating at tens of millions of vectors, then a full vector database is usually the right choice.

A lot of real systems sit in a different zone. They need fast ANN search, plus the core lifecycle operations: insert, upsert, delete, local sharding, and periodic rebuild/compaction. They already have a metadata store, so duplicating that layer inside a vector DB is redundant. In that setup, a full DB can feel like paying in RAM and operational complexity for features that aren’t used.

Building an index engine from scratch is also a rarely worth it for most teams. It’s time-consuming, and it pulls attention away from the core product. The usual alternative is in-process libraries such as FAISS and hnswlib. They are quite fast, with great accuracy, yet they often push you toward a RAM-first model where large portions of the index and vectors live in memory. In some cases, they consume more RAM than a full vector database. Production details like persistence workflow, safe mutation, concurrency, predictable memory growth should also be written on top of them.

brinicle Vector Engine targets this gap: a production-oriented ANN index engine designed to stay usable under strict resource budgets. It focuses on disk-first operation and low memory overhead, while still supporting the operations you typically need in a real service: build/load, search, insert/upsert/delete, local sharding, and rebuild.

brinicle: disk-first ANN indexing for low-RAM vector search

brinicle is an open source C++ vector index engine for approximate nearest neighbor search. It is built for disk-first operation and low-RAM environments. The goal is simple: keep RAM usage predictable, keep tail latency stable, and still hit high recall.

brinicle supports:

building and loading indexes
parallel insert, upsert, delete, and rebuild
safe search

It also ships with a Python wrapper (pybind), so you can use it directly from Python.

Project links:

brinicle: https://github.com/bicardinal/brinicle
benchmark harness: https://github.com/bicardinal/db_bench

What brinicle is, and what it is not

brinicle is an index engine. You embed it in a service or pair it with your own metadata store.

brinicle is not a vector database. It does not aim to provide database features like filtering, payload indexing, distributed replication, auth, or multi-tenancy. If you need those features, use a vector database.

This separation is intentional. The benchmarks in this post show why: a full DB stack often has a baseline memory footprint that is not compatible with extreme RAM caps, even before you start tuning.

When brinicle is a fit

You’re under 10M vectors and already have a metadata store
You must run in tight RAM (≤1–2GB) or pack many tenants per node
You want ANN + CRUD + rebuild/compaction, not DB features

When a vector DB is the right tool

You need filtering/payload indexing as part of retrieval
You need replication, auth, multi-tenancy, operational UI/tooling
You're operating at large scale (tens/hundreds of millions) and want a managed service

What you trade

Lower baseline RAM / simpler stack in exchange for bringing your own metadata + service layer.

Benchmark setup

We cover two kinds of comparisons.

1) Vector databases tested as services over HTTP:

Qdrant, Weaviate, Milvus, Chroma

2) In-process ANN libraries imported directly:

FAISS, hnswlib

These are different deployment models. The DB results include server overhead. The in-process results do not. We first build the index, and then run the search for 10 times, and take the average of search latency and recall@10.

Environment

Host OS: Ubuntu 25.10
CPU: Intel Core i7-13650HX (20 cores)
RAM: 32 GiB
Storage: NVMe SSD
Docker: 29.1.3
Storage driver: overlay2

Datasets and distance

Datasets are downloaded directly from ann-benchmarks.com with no preprocessing.
Distance metric is L2 across all systems.
Parameters are fixed (M=16, ef_construction=200, ef_search varies where explicitly swept).

Recall@K

Recall is computed as average overlap between predicted top K and ground truth top K:

def compute_recalls(pred_ids: np.ndarray, gt_top: np.ndarray, K: int):
    nq = gt_top.shape[0]
    out = {}
    hits = 0
    for i in range(nq):
        a = pred_ids[i, :K]
        b = gt_top[i, :K]
        hits += len(set(a.tolist()) & set(b.tolist()))
    out[f"recall@{K}"] = hits / (nq * K)
    return out

Important detail about configuration
We did not tune database configs or do a parameter search. We kept parameters fixed to reduce degrees of freedom and to keep the comparison reproducible.

Result 1: extreme RAM caps (256MB) are a hard boundary for many DBs

This is the core motivation for brinicle. In a constrained container (MNIST, 256MB RAM, 1 CPU), the following happened. All failures were verified as OOMKilled by Docker.

MNIST (60K, 784 dim), 256MB RAM, 1 CPU

System	Outcome
brinicle	PASS
chroma	PASS
qdrant	OOMKilled
weaviate	OOMKilled
milvus	OOMKilled

This table answers a practical question: if you want vector search in a very small container, which systems can actually complete a build and serve queries without being killed by the memory limit.

Result 2: latency and memory profiles under constrained DB deployments

Below are snapshots from the constrained HTTP service benchmark runs.

Fashion-MNIST (60K, 784 dim), 512MB RAM, 2 CPU

System	Build (s)	Recall@10	Avg (ms)	P50 (ms)	P95 (ms)	P99 (ms)	QPS	Build peak (MB)	Search peak (MB)
qdrant	14.827	0.9979	1.579	1.173	3.441	6.947	690.38	512.0	282.8
chroma	29.299	0.9978	3.085	3.080	4.646	5.205	328.20	512.00	512.01
weaviate	45.387	0.99786	3.559	3.314	5.104	10.330	281.49	512.10	512.03
brinicle	144.223	0.99782	0.927	0.797	1.705	2.266	1086.64	469.85	285.20
milvus	18.617	0.99886	2.672	2.665	3.636	4.513	376.09	1024.00	887.67

Note: Milvus required 1024MB in this setup because it was OOMKilled at 512MB.

MNIST (60K, 784 dim), 256MB RAM, 1 CPU

System	Build (s)	Recall@10	Avg (ms)	P50 (ms)	P95 (ms)	P99 (ms)	QPS	Build peak (MB)	Search peak (MB)
brinicle	147.435	0.99818	1.018	0.865	1.943	2.452	991.01	256.00	224.95
chroma	49.928	0.99807	2.009	1.741	3.667	4.539	505.67	256.20	255.89

Note: Only brinicle, and chroma survived.

SIFT (1M, 128 dim), 4096MB RAM, 2 CPU

System	Build (s)	Recall@10	Avg (ms)	P50 (ms)	P95 (ms)	P99 (ms)	QPS	Build peak (MB)	Search peak (MB)
weaviate	937.592	0.96276	2.420	2.390	2.966	3.207	413.20	4096.00	3594.80
qdrant	14.115	0.99450	4.570	3.046	10.294	24.532	599.22	1986.83	1480.99
milvus	204.410	0.98432	2.463	2.449	3.142	5.681	406.54	2732.63	2445.63
chroma	228.988	0.96352	2.942	3.000	4.222	4.670	341.23	1705.38	1705.62
brinicle	387.065	0.96993	0.838	0.746	1.477	2.036	1204.12	1552.76	982.94

Result 3: recall versus latency tradeoff (ef_search sweep)

Higher recall usually costs more latency. To make that tradeoff explicit, we ran a sweep:

Dataset: SIFT (1M, 128 dim)
Resources: 4GB RAM, 2 CPU
Distance: L2
Fixed: M=16, ef_construction=200
Sweep: ef_search [16, 32, 64, 128, 256]

Result 4: in-process libraries (FAISS, hnswlib, brinicle)

How does brinicle compare when used the same way you would use FAISS or hnswlib, inside one process, with no network overhead.

GIST (1M, 960 dim)

System	Build (s)	Recall@10	Avg (ms)	P50 (ms)	P95 (ms)	P99 (ms)	QPS
faiss	872.273	0.77270	0.335	0.343	0.408	0.445	2981.32
hnswlib	1032.707	0.75620	0.408	0.397	0.470	0.499	2450.41
brinicle	1138.479	0.77020	0.494	0.450	0.848	1.549	2023.60

SIFT (1M, 128 dim)

System	Build (s)	Recall@10	Avg (ms)	P50 (ms)	P95 (ms)	P99 (ms)	QPS
faiss	237.282	0.96999	0.092	0.095	0.115	0.127	10857.43
hnswlib	241.301	0.96364	0.093	0.092	0.110	0.120	10711.86
brinicle	234.572	0.97004	0.095	0.095	0.120	0.133	10563.05

MNIST (60K, 784 dim)

System	Build (s)	Recall@10	Avg (ms)	P50 (ms)	P95 (ms)	P99 (ms)	QPS
brinicle	20.754	0.99818	0.161	0.159	0.221	0.255	6208.06
faiss	19.798	0.99806	0.142	0.139	0.192	0.221	7062.90
hnswlib	21.474	0.99808	0.177	0.176	0.239	0.273	5663.67

Fashion-MNIST (60K, 784 dim)

System	Build (s)	Recall@10	Avg (ms)	P50 (ms)	P95 (ms)	P99 (ms)	QPS
hnswlib	17.800	0.99778	0.157	0.151	0.202	0.234	6362.37
brinicle	17.064	0.99782	0.147	0.144	0.201	0.237	6817.81
faiss	16.787	0.99770	0.125	0.125	0.167	0.194	7976.62

What to take away from these results

1) Survivability under hard RAM caps matters. In the 256MB MNIST run, multiple database containers were OOMKilled. brinicle completed build and search.

2) Tail latency is a primary metric for search systems. Average latency is useful, but p95 and p99 are where disk-first and constrained environments tend to show problems. That is why the tables and plots emphasize percentiles.

3) The recall-latency curve is the most informative comparison. The ef_search sweep shows how each system behaves as you push toward higher recall.

4) brinicle is positioned as an engine. If you want a full vector database, you should use one. If you want the index layer with a small memory footprint, brinicle is designed for that.

Future work: more realistic workload benchmarks

This blog focuses on a clean workflow, build the index, then run read-only search, to make comparisons reproducible. Real production systems are more complicated, and the benchmark suite should expand to reflect that.

Mixed read/write workloads: run sustained search traffic while one or more writer processes perform insert/upsert in parallel, and report the impact on p95/p99 latency and recall.
Delete-heavy workloads and long-running degradation: repeatedly delete a fraction of vectors (and optionally reinsert) to measure how tombstones/fragmentation affect recall and tail latency over time, and how often optimize graph is needed to recover performance.
Update patterns: build once with 90% of the dataset and insert the rest of the data to see how recall/latency will be affected by insertion.
Bigger and more diverse datasets: include higher-scale datasets (multi-million to 10M) across a wider range of dimensionalities, plus different query batch sizes and multi-threaded query loads.

Reproducing the benchmarks

The benchmark harness is public:

https://github.com/bicardinal/db_bench

brinicle is here:

https://github.com/bicardinal/brinicle