Benchmark Results
Comprehensive performance comparison of brinicle with vector databases and in-process ANN libraries
Benchmark Setup
We cover two kinds of comparisons:
- Vector databases tested as services over HTTP: Qdrant, Weaviate, Milvus, Chroma
- In-process ANN libraries imported directly: FAISS, hnswlib
These are different deployment models. The DB results include server overhead. The in-process results do not. We first build the index, and then run the search for 10 times, and take the average of search latency and recall@10.
Environment
- Host OS: Ubuntu 25.10
- CPU: Intel Core i7-13650HX (20 cores)
- RAM: 32 GiB
- Storage: NVMe SSD
- Docker: 29.1.3
- Storage driver: overlay2
Datasets and Distance
- Datasets are downloaded directly from ann-benchmarks.com with no preprocessing.
- Distance metric is L2 across all systems.
- Parameters are fixed (M=16, ef_construction=200, ef_search varies where explicitly swept).
Recall@K
Recall is computed as average overlap between predicted top K and ground truth top K.
def compute_recalls(pred_ids: np.ndarray, gt_top: np.ndarray, K: int): nq = gt_top.shape[0] out = {} hits = 0 for i in range(nq): a = pred_ids[i, :K] b = gt_top[i, :K] hits += len(set(a.tolist()) & set(b.tolist())) out[f"recall@{K}"] = hits / (nq * K) return outImportant detail about configuration: We did not tune database configs or do a parameter search. We kept parameters fixed to reduce degrees of freedom and to keep the comparison reproducible.
Result 1: Extreme RAM Caps (256MB) are a Hard Boundary for Many DBs
This is the core motivation for brinicle. In a constrained container (MNIST, 256MB RAM, 1 CPU), the following happened. All failures were verified as OOMKilled by Docker.
MNIST (60K, 784 dim), 256MB RAM, 1 CPU
| System | Outcome |
|---|---|
| brinicle | PASS |
| chroma | PASS |
| qdrant | OOMKilled |
| weaviate | OOMKilled |
| milvus | OOMKilled |
This table answers a practical question: if you want vector search in a very small container, which systems can actually complete a build and serve queries without being killed by the memory limit.
Result 2: Latency and Memory Profiles under Constrained DB Deployments
Below are snapshots from the constrained HTTP service benchmark runs.
Fashion-MNIST (60K, 784 dim), 512MB RAM, 2 CPU
| System | Build (s) | Recall@10 | Avg (ms) | P50 (ms) | P95 (ms) | P99 (ms) | QPS | Build peak (MB) | Search peak (MB) |
|---|---|---|---|---|---|---|---|---|---|
| qdrant | 14.827 | 0.9979 | 1.579 | 1.173 | 3.441 | 6.947 | 690.38 | 512 | 282.8 |
| chroma | 29.299 | 0.9978 | 3.085 | 3.08 | 4.646 | 5.205 | 328.2 | 512 | 512.01 |
| weaviate | 45.387 | 0.99786 | 3.559 | 3.314 | 5.104 | 10.33 | 281.49 | 512.1 | 512.03 |
| brinicle | 144.223 | 0.99782 | 0.927 | 0.797 | 1.705 | 2.266 | 1086.64 | 469.85 | 285.2 |
| milvus | 18.617 | 0.99886 | 2.672 | 2.665 | 3.636 | 4.513 | 376.09 | 1024 | 887.67 |
Note: Milvus required 1024MB in this setup because it was OOMKilled at 512MB.
MNIST (60K, 784 dim), 256MB RAM, 1 CPU
| System | Build (s) | Recall@10 | Avg (ms) | P50 (ms) | P95 (ms) | P99 (ms) | QPS | Build peak (MB) | Search peak (MB) |
|---|---|---|---|---|---|---|---|---|---|
| brinicle | 147.435 | 0.99818 | 1.018 | 0.865 | 1.943 | 2.452 | 991.01 | 256 | 224.95 |
| chroma | 49.928 | 0.99807 | 2.009 | 1.741 | 3.667 | 4.539 | 505.67 | 256.2 | 255.89 |
Note: Only brinicle and chroma survived.
SIFT (1M, 128 dim), 4096MB RAM, 2 CPU
| System | Build (s) | Recall@10 | Avg (ms) | P50 (ms) | P95 (ms) | P99 (ms) | QPS | Build peak (MB) | Search peak (MB) |
|---|---|---|---|---|---|---|---|---|---|
| weaviate | 937.592 | 0.96276 | 2.42 | 2.39 | 2.966 | 3.207 | 413.2 | 4096 | 3594.8 |
| qdrant | 14.115 | 0.9945 | 4.57 | 3.046 | 10.294 | 24.532 | 599.22 | 1986.83 | 1480.99 |
| milvus | 204.41 | 0.98432 | 2.463 | 2.449 | 3.142 | 5.681 | 406.54 | 2732.63 | 2445.63 |
| chroma | 228.988 | 0.96352 | 2.942 | 3 | 4.222 | 4.67 | 341.23 | 1705.38 | 1705.62 |
| brinicle | 387.065 | 0.96993 | 0.838 | 0.746 | 1.477 | 2.036 | 1204.12 | 1552.76 | 982.94 |
Result 3: Recall versus Latency Tradeoff (ef_search sweep)
Higher recall usually costs more latency. To make that tradeoff explicit, we ran a sweep:
- Dataset: SIFT (1M, 128 dim)
- Resources: 4GB RAM, 2 CPU
- Distance: L2
- Fixed: M=16, ef_construction=200
- Sweep: ef_search [16, 32, 64, 128, 256]
P95 Latency vs Recall

P99 Latency vs Recall

Memory Usage Comparison

Result 4: In-Process Libraries (FAISS, hnswlib, brinicle)
How does brinicle compare when used the same way you would use FAISS or hnswlib, inside one process, with no network overhead.
GIST (1M, 960 dim)
| System | Build (s) | Recall@10 | Avg (ms) | P50 (ms) | P95 (ms) | P99 (ms) | QPS |
|---|---|---|---|---|---|---|---|
| faiss | 872.273 | 0.7727 | 0.335 | 0.343 | 0.408 | 0.445 | 2981.32 |
| hnswlib | 1032.707 | 0.7562 | 0.408 | 0.397 | 0.47 | 0.499 | 2450.41 |
| brinicle | 1138.479 | 0.7702 | 0.494 | 0.45 | 0.848 | 1.549 | 2023.6 |
SIFT (1M, 128 dim)
| System | Build (s) | Recall@10 | Avg (ms) | P50 (ms) | P95 (ms) | P99 (ms) | QPS |
|---|---|---|---|---|---|---|---|
| faiss | 237.282 | 0.96999 | 0.092 | 0.095 | 0.115 | 0.127 | 10857.43 |
| hnswlib | 241.301 | 0.96364 | 0.093 | 0.092 | 0.11 | 0.12 | 10711.86 |
| brinicle | 243.75 | 0.96989 | 0.103 | 0.101 | 0.138 | 0.171 | 9730.65 |
MNIST (60K, 784 dim)
| System | Build (s) | Recall@10 | Avg (ms) | P50 (ms) | P95 (ms) | P99 (ms) | QPS |
|---|---|---|---|---|---|---|---|
| brinicle | 20.754 | 0.99818 | 0.161 | 0.159 | 0.221 | 0.255 | 6208.06 |
| faiss | 19.798 | 0.99806 | 0.142 | 0.139 | 0.192 | 0.221 | 7062.9 |
| hnswlib | 21.474 | 0.99808 | 0.177 | 0.176 | 0.239 | 0.273 | 5663.67 |
Fashion-MNIST (60K, 784 dim)
| System | Build (s) | Recall@10 | Avg (ms) | P50 (ms) | P95 (ms) | P99 (ms) | QPS |
|---|---|---|---|---|---|---|---|
| hnswlib | 17.8 | 0.99778 | 0.157 | 0.151 | 0.202 | 0.234 | 6362.37 |
| brinicle | 17.064 | 0.99782 | 0.147 | 0.144 | 0.201 | 0.237 | 6817.81 |
| faiss | 16.787 | 0.9977 | 0.125 | 0.125 | 0.167 | 0.194 | 7976.62 |
What to Take Away from These Results
- Survivability under hard RAM caps matters. In the 256MB MNIST run, multiple database containers were OOMKilled. brinicle completed build and search.
- Tail latency is a primary metric for search systems. Average latency is useful, but p95 and p99 are where disk-first and constrained environments tend to show problems. That is why the tables and plots emphasize percentiles.
- The recall-latency curve is the most informative comparison. The ef_search sweep shows how each system behaves as you push toward higher recall.
- brinicle is positioned as an engine. If you want a full vector database, you should use one. If you want the index layer with a small memory footprint, brinicle is designed for that.
Reproducing the Benchmarks
The benchmark harness is public:
brinicle is here: