Benchmark Results

Comprehensive performance comparison of brinicle with vector databases and in-process ANN libraries

Benchmark Setup

We cover two kinds of comparisons:

  1. Vector databases tested as services over HTTP: Qdrant, Weaviate, Milvus, Chroma
  2. In-process ANN libraries imported directly: FAISS, hnswlib

These are different deployment models. The DB results include server overhead. The in-process results do not. We first build the index, and then run the search for 10 times, and take the average of search latency and recall@10.

Environment

  • Host OS: Ubuntu 25.10
  • CPU: Intel Core i7-13650HX (20 cores)
  • RAM: 32 GiB
  • Storage: NVMe SSD
  • Docker: 29.1.3
  • Storage driver: overlay2

Datasets and Distance

  • Datasets are downloaded directly from ann-benchmarks.com with no preprocessing.
  • Distance metric is L2 across all systems.
  • Parameters are fixed (M=16, ef_construction=200, ef_search varies where explicitly swept).

Recall@K

Recall is computed as average overlap between predicted top K and ground truth top K.

def compute_recalls(pred_ids: np.ndarray, gt_top: np.ndarray, K: int):    nq = gt_top.shape[0]    out = {}    hits = 0    for i in range(nq):        a = pred_ids[i, :K]        b = gt_top[i, :K]        hits += len(set(a.tolist()) & set(b.tolist()))    out[f"recall@{K}"] = hits / (nq * K)    return out

Important detail about configuration: We did not tune database configs or do a parameter search. We kept parameters fixed to reduce degrees of freedom and to keep the comparison reproducible.

Result 1: Extreme RAM Caps (256MB) are a Hard Boundary for Many DBs

This is the core motivation for brinicle. In a constrained container (MNIST, 256MB RAM, 1 CPU), the following happened. All failures were verified as OOMKilled by Docker.

MNIST (60K, 784 dim), 256MB RAM, 1 CPU

SystemOutcome
briniclePASS
chromaPASS
qdrantOOMKilled
weaviateOOMKilled
milvusOOMKilled

This table answers a practical question: if you want vector search in a very small container, which systems can actually complete a build and serve queries without being killed by the memory limit.

Result 2: Latency and Memory Profiles under Constrained DB Deployments

Below are snapshots from the constrained HTTP service benchmark runs.

Fashion-MNIST (60K, 784 dim), 512MB RAM, 2 CPU

SystemBuild (s)Recall@10Avg (ms)P50 (ms)P95 (ms)P99 (ms)QPSBuild peak (MB)Search peak (MB)
qdrant14.8270.99791.5791.1733.4416.947690.38512282.8
chroma29.2990.99783.0853.084.6465.205328.2512512.01
weaviate45.3870.997863.5593.3145.10410.33281.49512.1512.03
brinicle144.2230.997820.9270.7971.7052.2661086.64469.85285.2
milvus18.6170.998862.6722.6653.6364.513376.091024887.67

Note: Milvus required 1024MB in this setup because it was OOMKilled at 512MB.

MNIST (60K, 784 dim), 256MB RAM, 1 CPU

SystemBuild (s)Recall@10Avg (ms)P50 (ms)P95 (ms)P99 (ms)QPSBuild peak (MB)Search peak (MB)
brinicle147.4350.998181.0180.8651.9432.452991.01256224.95
chroma49.9280.998072.0091.7413.6674.539505.67256.2255.89

Note: Only brinicle and chroma survived.

SIFT (1M, 128 dim), 4096MB RAM, 2 CPU

SystemBuild (s)Recall@10Avg (ms)P50 (ms)P95 (ms)P99 (ms)QPSBuild peak (MB)Search peak (MB)
weaviate937.5920.962762.422.392.9663.207413.240963594.8
qdrant14.1150.99454.573.04610.29424.532599.221986.831480.99
milvus204.410.984322.4632.4493.1425.681406.542732.632445.63
chroma228.9880.963522.94234.2224.67341.231705.381705.62
brinicle387.0650.969930.8380.7461.4772.0361204.121552.76982.94

Result 3: Recall versus Latency Tradeoff (ef_search sweep)

Higher recall usually costs more latency. To make that tradeoff explicit, we ran a sweep:

  • Dataset: SIFT (1M, 128 dim)
  • Resources: 4GB RAM, 2 CPU
  • Distance: L2
  • Fixed: M=16, ef_construction=200
  • Sweep: ef_search [16, 32, 64, 128, 256]

P95 Latency vs Recall

Latency vs Recall Curve P95

P99 Latency vs Recall

Latency vs Recall Curve P99

Memory Usage Comparison

Memory Usage Comparison

Result 4: In-Process Libraries (FAISS, hnswlib, brinicle)

How does brinicle compare when used the same way you would use FAISS or hnswlib, inside one process, with no network overhead.

GIST (1M, 960 dim)

SystemBuild (s)Recall@10Avg (ms)P50 (ms)P95 (ms)P99 (ms)QPS
faiss872.2730.77270.3350.3430.4080.4452981.32
hnswlib1032.7070.75620.4080.3970.470.4992450.41
brinicle1138.4790.77020.4940.450.8481.5492023.6

SIFT (1M, 128 dim)

SystemBuild (s)Recall@10Avg (ms)P50 (ms)P95 (ms)P99 (ms)QPS
faiss237.2820.969990.0920.0950.1150.12710857.43
hnswlib241.3010.963640.0930.0920.110.1210711.86
brinicle243.750.969890.1030.1010.1380.1719730.65

MNIST (60K, 784 dim)

SystemBuild (s)Recall@10Avg (ms)P50 (ms)P95 (ms)P99 (ms)QPS
brinicle20.7540.998180.1610.1590.2210.2556208.06
faiss19.7980.998060.1420.1390.1920.2217062.9
hnswlib21.4740.998080.1770.1760.2390.2735663.67

Fashion-MNIST (60K, 784 dim)

SystemBuild (s)Recall@10Avg (ms)P50 (ms)P95 (ms)P99 (ms)QPS
hnswlib17.80.997780.1570.1510.2020.2346362.37
brinicle17.0640.997820.1470.1440.2010.2376817.81
faiss16.7870.99770.1250.1250.1670.1947976.62

What to Take Away from These Results

  1. Survivability under hard RAM caps matters. In the 256MB MNIST run, multiple database containers were OOMKilled. brinicle completed build and search.
  2. Tail latency is a primary metric for search systems. Average latency is useful, but p95 and p99 are where disk-first and constrained environments tend to show problems. That is why the tables and plots emphasize percentiles.
  3. The recall-latency curve is the most informative comparison. The ef_search sweep shows how each system behaves as you push toward higher recall.
  4. brinicle is positioned as an engine. If you want a full vector database, you should use one. If you want the index layer with a small memory footprint, brinicle is designed for that.

Reproducing the Benchmarks