Hybrid Search Benchmark

Single-graph hybrid retrieval for product search: evaluating Brinicle against Weaviate, Meilisearch, Typesense, and OpenSearch

Abstract

Hybrid search is commonly implemented by combining lexical retrieval over an inverted index with semantic retrieval over a vector index, followed by score fusion or reranking. This paper studies an alternative formulation: representing lexical and semantic product-search signals inside a single HNSW graph. Brinicle encodes product-title tokens and dense title embeddings into one searchable representation. A custom distance function combines symbolic title matching and vector similarity during graph traversal, allowing lexical, semantic, and hybrid retrieval behavior to be expressed through the same graph structure. We evaluate this approach on WANDS and US-filtered Amazon ESCI using title-based hybrid product retrieval. Brinicle is compared with Weaviate, Meilisearch, Typesense, and OpenSearch under shared resource limits and the same precomputed embedding model. Across both datasets, Brinicle achieves competitive retrieval quality while reducing search memory usage and P99 latency relative to the compared systems. These results indicate that, for title-based product retrieval, hybrid search can be modeled as a single-graph retrieval problem rather than as post-hoc fusion over separate lexical and vector retrieval structures.

1. Introduction

Hybrid retrieval is commonly implemented as a coordination problem between two search systems. A lexical index retrieves documents through exact or near-exact term matching, while a vector index retrieves documents through dense semantic similarity. The final ranking is then produced by score fusion, reranking, or another combination strategy.

This architecture has become a practical default for modern search applications. Lexical retrieval preserves exact terms, identifiers, numbers, and product-specific fragments. Vector retrieval improves tolerance to vocabulary mismatch and natural-language variation. Used together, they often provide better retrieval behavior than either method alone.

The architectural cost is that hybrid retrieval usually requires multiple retrieval structures. A system may need to store and operate an inverted index, a vector index, and a fusion layer with its own scoring assumptions. This increases memory usage, tuning surface area, and operational complexity.

This paper studies a different formulation of hybrid retrieval: representing lexical and semantic signals inside a single HNSW graph.

1.1 Hybrid Retrieval as a Graph-Distance Problem

Brinicle treats hybrid product retrieval as a distance-function problem over one encoded representation. Each item is encoded with symbolic title evidence and, in hybrid mode, a dense title embedding. Queries are encoded in the same representation family. Search is then performed through one HNSW graph using a custom distance function that combines title-token agreement and vector similarity during graph traversal.

The retrieval mode is controlled by the distance configuration. A lexical configuration emphasizes title-token matching. A vector configuration emphasizes embedding similarity. A hybrid configuration combines both signals inside the same graph search process. This differs from the common architecture in which lexical and vector retrieval produce separate candidate sets that are merged afterward. In Brinicle, candidate exploration itself is hybrid-aware because the graph traversal uses the combined distance.

1.2 Product Search as a Motivating Task

Product retrieval is a useful setting for evaluating this idea because it requires both exactness and tolerance. A product query may contain short fragments that carry precise meaning:

iphone 15 256gb

rtx 4060

sony wh-1000xm5

m2 macbook air

In these cases, numbers, model identifiers, and capacities are not incidental text. They are part of the user’s intent. A semantically related result with the wrong model or capacity may be commercially incorrect.

At the same time, product titles are often long and noisy. They may contain brands, colors, editions, years, bundle descriptions, packaging terms, seller formatting, and marketing phrases. Users rarely type the full title. A retrieval system therefore needs to tolerate partial queries and vocabulary mismatch without losing exact symbolic evidence. This makes product search a natural hybrid retrieval task. Lexical matching helps preserve exact constraints. Dense embeddings help recover semantically related titles when surface forms differ.

1.3 Brinicle’s Approach

Brinicle encodes product-title tokens and dense embeddings into a single HNSW-searchable representation. The graph is built over this representation, and a custom scorer defines how the symbolic and semantic components contribute to distance.

At a high level, the method consists of three parts:

encoded item representation

+ single HNSW graph

+ hybrid-aware distance function

The encoded representation stores title-token evidence and, when enabled, a dense vector. The HNSW graph provides approximate nearest-neighbor traversal. The distance function determines whether the search behaves lexically, semantically, or as a hybrid of both.

This paper focuses on title-based hybrid retrieval. The evaluated configuration uses product titles and precomputed title embeddings for both documents and queries. Brinicle’s broader item-search representation can include structured fields such as category, subcategory, and attributes, but the benchmark isolates the title + vector retrieval setting.

1.4 Evaluation Overview

We evaluate Brinicle on WANDS and US-filtered Amazon ESCI, comparing it with Weaviate, Meilisearch, Typesense, and OpenSearch. All systems are tested under shared CPU and memory limits using the same precomputed embedding model.

The evaluation reports ranking quality, search latency, throughput, memory usage, and build cost. The main result is a system-level trade-off: Brinicle achieves competitive retrieval quality while reducing search memory usage and P99 latency in the tested setup. The results support the architectural claim that hybrid product retrieval can be expressed through one graph and one distance function, rather than requiring post-hoc fusion over separate lexical and vector retrieval structures.

1.5 Contributions

This paper makes four contributions.

  1. It presents a single-graph formulation for hybrid product retrieval, where lexical and semantic evidence are represented inside one HNSW-searchable object.
  2. It describes Brinicle’s encoded item representation and hybrid-aware distance function, including symbolic title matching, dense-vector similarity, and the alpha mechanism used to control semantic bias.
  3. It evaluates the approach on two product-search benchmarks against four established hybrid search systems under shared resource limits.
  4. It reports the resulting trade-off between retrieval quality, memory usage, and search latency, showing that a single-graph design can provide competitive hybrid retrieval behavior with a smaller search-time resource footprint.

2. One-Graph Hybrid Retrieval

Hybrid retrieval is often described as a fusion problem: lexical retrieval produces one ranked list, semantic retrieval produces another, and a combination layer merges the two into a final ranking. Brinicle uses a different formulation. It treats hybrid retrieval as graph traversal over a representation that contains both symbolic and semantic evidence.

In this formulation, each item is encoded into one HNSW-searchable object. The object contains lexical title evidence and, in hybrid mode, a dense embedding. The HNSW graph is built over these encoded objects, and retrieval is controlled by a distance function that can read and combine the different regions of the representation.

At a high level, the retrieval pipeline is: 1. document title + optional dense embedding 2. encoded item representation 3. single HNSW graph 4. hybrid-aware distance function 5. ranked results. The key design choice is that lexical and semantic evidence participate in the same graph traversal. Candidate exploration is therefore influenced by the combined distance, rather than by a post-processing step over independently retrieved lexical and vector candidates.

2.1 Retrieval as Distance over Structured Representations

Brinicle represents each item as a structured numeric object rather than as an ordinary dense vector alone. The representation contains enough information for the distance function to interpret different components separately.

For title-based hybrid retrieval, the relevant components are:

title-token evidence

+ dense title embedding

The query is encoded in the same representation family:

query-token evidence

+ dense query embedding

The distance function then compares the query and document through both symbolic and semantic components. Title-token overlap contributes lexical evidence. Vector similarity contributes semantic evidence. The final distance is a weighted combination of these signals.

This makes the HNSW graph a retrieval structure over hybrid-search objects. The graph organizes items according to the distance function used during construction, and the same family of distance functions is used during search.

2.2 Unified Candidate Exploration

In a two-index hybrid system, candidate generation is usually split across retrieval structures. A lexical index explores term-based candidates, while a vector index explores embedding-based candidates. Fusion happens after those candidate sets have already been produced. Brinicle moves the hybrid decision earlier. Since graph traversal uses a distance function that includes both title-token matching and vector similarity, lexical and semantic evidence affect candidate exploration directly. This changes the role of the hybrid scorer. It is not only a final ranking function. It also helps define local neighborhoods in the graph and influences which candidates are reached during approximate search. The result is a single candidate-exploration structure: 1. encoded query 2. HNSW traversal using hybrid distance 3. candidate set 4. ranked results.

This is the central architectural distinction. Hybrid behavior is part of the graph-search process itself.

2.3 Retrieval Modes as Distance Configurations

Brinicle’s retrieval modes are expressed through distance configuration. The same encoded representation can support lexical, vector, or hybrid retrieval by changing the active components and their weights:

Lexical Retrieval

Title distance active

Vector distance inactive

Vector Retrieval

Vector distance active

Title distance inactive

Hybrid Retrieval

Title distance active

Vector distance active

This gives the system a single conceptual model:

same representation family

same graph structure

different distance configurations

In lexical mode, retrieval is driven by symbolic title evidence. In vector mode, retrieval is driven by embedding similarity. In hybrid mode, both signals contribute to the distance used during graph traversal. The benchmark in this paper focuses on the hybrid configuration, where product titles and title embeddings are both active.

2.4 Product-Title Hybrid Retrieval

Product titles provide a useful test case for one-graph hybrid retrieval because they combine short exact identifiers with longer noisy descriptions. A title may contain model numbers, capacities, color names, brand names, technical variants, and marketing text. Some tokens are highly specific and must be matched carefully. Other parts of the title provide broader semantic context.

Brinicle’s representation preserves title-token evidence explicitly while also attaching dense semantic vectors. This allows the distance function to reward exact symbolic matches and semantic proximity within the same graph search. For example, a query such as “iphone 15 256gb” benefits from exact matching on iphone, 15, and 256gb, while vector similarity can still help when relevant product titles use different surrounding language.

The same principle applies to product queries involving model identifiers, abbreviated names, or partial descriptions. The graph does not need to choose between symbolic and semantic retrieval as separate execution paths. Both signals are available to the distance function.

2.5 Summary

One-graph hybrid retrieval can be summarized as follows:

encoded item =

lexical title evidence

+ optional dense vector

retrieval structure =

one HNSW graph over encoded items

retrieval behavior =

distance configuration over lexical and vector components

This formulation makes hybrid product search a graph-distance problem. The next section describes how Brinicle encodes items and queries so that the distance function can compare symbolic and semantic evidence inside one representation.

3. Encoding Items and Queries

One-graph hybrid retrieval requires documents and queries to be represented in a form that can be compared by a single distance function. Brinicle uses a structured numeric representation for this purpose. The representation is compact enough to be indexed by HNSW, while preserving separate regions for lexical, structured, and semantic evidence.

In the benchmarked configuration, each document is represented by its product title and a dense embedding of that title. Each query is represented by query text and a dense query embedding. Both are encoded into the same representation family, allowing the distance function to compare symbolic and semantic evidence during graph traversal.

3.1 Encoded Object Layout

Each encoded object begins with a fixed-size header followed by a variable-length payload. The header stores metadata needed by the distance function:

[version, title_count, attr_pair_count, category_id, subcategory_id, vector_dim, payload...]

The payload stores the searchable content:

title token ids

+ optional attribute key/value ids

+ optional dense vector

The header allows the scorer to parse the representation without external metadata. It can determine how many title tokens are present, whether structured fields exist, whether a dense vector is attached, and where each region begins.

For title-based hybrid retrieval, the active regions are:

title token ids

+ dense title embedding

This layout keeps the representation numeric while preserving internal structure. The scorer can interpret title evidence and vector evidence separately instead of treating the object as an opaque dense vector.

3.2 Title-Token Encoding

Product titles are converted into sorted token identifiers. The title encoding pipeline is: 1. title text 2. normalization 3. isolated tokenization 4. token-id extraction 5. special-token filtering 6. term-frequency packing 7. sorted title-token representation.

The tokenizer preserves short product-specific fragments such as numbers, model names, and compact identifiers. These fragments are important in product retrieval because small textual differences can change the target item. Examples include:

4060    256gb    13 inch    a54    m2    wh-1000xm5

Dense embeddings can place related products near each other, but exact fragments still need to remain available to the scorer. Brinicle therefore stores symbolic title evidence explicitly as part of the indexed representation.

3.3 Term-Frequency Packing

Title tokens include a small saturated term-frequency signal. Conceptually, each stored title token combines a token id with a compact frequency component:

packed_title_token = token_id + small_tf_component

The frequency component allows repeated title terms to contribute additional evidence without making repetition dominate the score. This is useful for product titles, where repeated words may reflect emphasis, formatting, or seller-side noise rather than true relevance. The term-frequency signal is intentionally bounded. A repeated token can matter slightly more than a single occurrence, but excessive repetition is saturated by the scorer.

3.4 Dense Vector Attachment

In hybrid mode, Brinicle appends a dense embedding to the lexical representation. The benchmark uses title embeddings for documents and query embeddings for queries.

A document is encoded as:

header + title-token representation + dense title embedding

A query is encoded as:

header + query-token representation + dense query embedding

The vector region is parsed using the vector_dim value stored in the header. This allows the same distance function to combine token-based title matching with vector similarity. The resulting object is still a single HNSW-searchable representation, but the scorer can evaluate its regions separately.

3.5 Optional Structured Fields

Brinicle’s general item representation can also encode structured fields:

category

subcategory

attributes

Category and subcategory are stored as stable identifiers. Attributes are stored as sorted key/value id pairs:

[key_id_1, value_id_1, key_id_2, value_id_2, ...]

This allows structured evidence to participate in the same distance function as title tokens and dense vectors. For example, a product item may include title evidence, category identity, and attribute matches inside one encoded object. The experiments in this paper use the title + vector configuration, but the same representation layout supports richer item-search configurations.

3.6 Shared Representation Family

Documents and queries are encoded into the same representation family. This is what allows HNSW traversal to operate over hybrid-search objects directly.

A document may contain:

product-title tokens + product-title embedding

A query may contain:

query tokens + query embedding

The distance function compares the two encoded objects by reading their corresponding regions:

title-token agreement

+ optional structured-field agreement

+ vector similarity

This shared representation is central to the one-graph design. The graph stores encoded items, and the query enters the graph as a comparable encoded object.

3.7 Encoding Summary

Brinicle’s item/query representation can be summarized as:

encoded object =

  header

  + lexical title evidence

  + optional structured evidence

  + optional dense vector

The representation is numeric, but not unstructured. Its internal layout allows the distance function to combine symbolic and semantic evidence during graph traversal. The next section defines the distance function used to compare these encoded objects.

4. Distance Function

Brinicle’s encoded representation becomes searchable through a custom distance function. The distance function reads the structured regions of the encoded query and document, computes component-wise distances, and combines them into a single value used by HNSW during graph construction and search.

For the general item-search representation, the distance has the form:

D(q, d) =

  w_title · D_title(q, d)

  + w_attr · D_attr(q, d)

  + w_category · D_category(q, d)

  + w_subcat · D_subcat(q, d)

  + w_vector · D_vector(q, d)

where q is the encoded query, d is the encoded document, and each component measures one region of the representation. The benchmarked hybrid configuration uses the title and vector components:

D(q, d) = w_title · D_title(q, d) + w_vector · D_vector(q, d)

Structured-field components are part of the broader scorer, but the main experiments isolate title-based hybrid retrieval.

4.1 Title Distance

The title component measures symbolic agreement between query tokens and document-title tokens. Product queries are usually shorter than product titles, so the title scorer uses an asymmetric overlap measure. Brinicle uses a Tversky-style similarity:

S_title(q, d) = matched / (matched + α_title · only_query + β_title · extra_document)

D_title(q, d) = 1 - S_title(q, d)

Here:

matched = weighted title-token matches

only_query = query tokens missing from the document title

extra_document = document-title tokens not present in the query

The parameters α_title and β_title control the relative cost of missing query tokens and extra document tokens. This is useful for product retrieval because a relevant product title may contain all query terms plus additional descriptive text. For example, query “iphone 15 256gb”, and document title “Apple iPhone 15 256GB Blue Unlocked Smartphone 2023”. The extra document terms provide context, but missing query terms usually represent a stronger mismatch. The asymmetric title distance reflects this behavior.

4.2 Term-Frequency Saturation

Title-token matches use the packed term-frequency signal described in Section 3. Repeated terms are passed through a saturation function before contributing to the title score:

tf_sat(tf) = (tf · (k1 + 1)) / (tf + k1)

The saturation limits the effect of repeated title terms. A repeated token can increase the contribution of a match, but repeated words do not scale linearly without bound. This gives the title component a controlled lexical signal. Token match is positive evidence, repeated token is slightly stronger evidence, and excess repetition is saturated contribution.

4.3 Build-Time and Search-Time Title Configuration

Brinicle can use different title-distance settings during graph construction and query-time search.

Title Distance Configuration by Phase

Phaseα_titleβ_titleBehavior
Build11Symmetric overlap
Search10.06Stronger penalty for missing query tokens

The build-time configuration shapes graph neighborhoods using balanced title overlap. The search-time configuration gives more weight to query coverage, which is appropriate for short product queries matched against longer product titles.

4.4 Vector Distance

The vector component measures semantic similarity between the query embedding and the document embedding. Brinicle uses scaled cosine distance:

D_vector(q, d) = 0.5 · (1 - cos(q, d))

The scaling maps cosine distance into a range compatible with the lexical distance components. Since cosine similarity lies in [-1, 1], the unscaled expression 1 - cos(q, d) lies in [0, 2]; multiplying by 0.5 maps it to [0, 1]. When vectors are normalized, cosine similarity can be computed through a dot product. The distance function can also use the general cosine path when normalization is not assumed.

4.5 Structured-Field Distances

The general Brinicle scorer can also compare structured fields. Category and subcategory are treated as identifier matches. Attribute fields are treated as sorted key/value pairs.

For category-like identifiers, the distance is direct:

D_id(a, b) =

  0 if a or b is unknown

  0 if a = b

  field_penalty otherwise

For attributes, the scorer compares matching keys and evaluates whether their values agree:

same key + same value → no penalty

same key + different value → mismatch penalty

missing field information → neutral or soft contribution

These structured components allow category, subcategory, and attribute evidence to participate in the same distance function as title tokens and dense vectors. In the experiments reported in this paper, the active retrieval configuration uses title and vector evidence.

4.6 Hybrid Weighting

The general distance function is controlled through component weights. In title-based hybrid retrieval, the active weights are w_title and w_vector.

A lexical configuration sets the vector contribution to zero:

w_title > 0, w_vector = 0

A vector configuration sets the title contribution to zero:

w_title = 0, w_vector > 0

A hybrid configuration activates both:

w_title > 0, w_vector > 0

This makes retrieval behavior a property of the distance configuration. The same encoded representation can be searched with different component weights depending on the desired retrieval mode.

4.7 Brinicle Alpha

Brinicle uses an alpha parameter to control the balance between semantic distance and lexical correction. For 0 < p < 1, alpha p is converted into:

w_vector = 1

w_lexical = (1 - p) / p

The vector component keeps full weight, while the lexical components are scaled by w_lexical. At the boundaries:

p = 1 → vector retrieval

p = 0 → lexical retrieval

For example, when p = 0.90:

w_lexical = (1 - 0.90) / 0.90 = 0.1111

If the base title weight is 0.45, the effective title weight becomes:

0.45 · 0.1111 = 0.0500

while the vector weight remains:

w_vector = 1.0

This parameterization treats dense-vector distance as the primary semantic geometry and uses lexical evidence as a correction term. Lower alpha values increase the strength of lexical correction. Higher alpha values make retrieval more vector-oriented.

4.8 Alpha and Graph Construction

In Brinicle, the distance function is used during graph construction as well as query-time search. Therefore, the selected hybrid configuration affects both neighborhood formation and query traversal.

The build process uses the configured distance function to decide how items connect inside the HNSW graph. A more lexical configuration creates neighborhoods influenced more strongly by title-token overlap. A more semantic configuration creates neighborhoods influenced more strongly by vector similarity.

The same principle applies during search: the query traverses the graph using the configured distance function, and candidates are ranked according to the resulting distances. This makes alpha part of the index configuration. In the benchmark, Brinicle indexes are built with the selected alpha value for each dataset.

4.9 Distance-Function Summary

Brinicle’s distance function combines interpretable regions of the encoded representation:

title tokens → Tversky-style symbolic distance

dense vector → scaled cosine distance

structured fields → identifier and key/value penalties

For title-based hybrid retrieval, the main distance is:

D(q, d) = w_title · D_title(q, d) + w_vector · D_vector(q, d)

This distance is used by HNSW for graph construction and search, making hybrid behavior part of candidate exploration rather than a separate fusion stage.

encoded item:

  title tokens + optional structured fields + optional dense vector

distance function:

  title Tversky distance + optional structured penalties + scaled cosine vector distance

retrieval behavior:

  lexical, vector, or hybrid depending on weights

5. Experimental Setup

The experiments evaluate title-based hybrid product retrieval. Each engine receives a product query and returns a ranked list of product identifiers from a fixed corpus. Documents are indexed using product titles and precomputed dense title embeddings. Queries are represented using query text and precomputed dense query embeddings. The benchmark compares Brinicle with four existing search systems under the same host environment, container resource limits, embedding model, indexed field, and top-k retrieval setting.

5.1 Retrieval Task

For each query, the engine receives:

query text + query embedding

The corpus contains documents represented as:

product title + product title embedding

Each engine returns the top K product identifiers. The returned identifiers are compared against the relevance judgments provided by the dataset. All experiments use top_k = 100. Metrics are reported at K = 1, 5, 10, 20, 50, 100.

5.2 Datasets

Dataset Configuration

DatasetDocumentsQueriesTuning QueriesEvaluation QueriesIndexed Field
WANDS42,99445030420Title
Amazon ESCI, US locale1,215,85420,4582,00018,458Title

Both datasets are evaluated using exact-match relevance only. For Amazon ESCI, only products labeled E are treated as relevant. S, C, and I labels are treated as non-relevant.

5.3 Compared Systems

Search Systems

SystemRetrieval Configuration
BrinicleSingle-graph hybrid retrieval
WeaviateHybrid BM25/vector retrieval
MeilisearchHybrid keyword/vector retrieval
TypesenseHybrid keyword/vector retrieval
OpenSearchHybrid BM25/vector retrieval

All systems index the same product title field and use the same precomputed dense embeddings. Brinicle is evaluated through its server adapter.

5.4 Embedding Model

Dense embeddings are generated using nomic-ai/nomic-embed-text-v1.5:

  • Document prefix: search_document: {title}
  • Query prefix: search_query: {query}

Embeddings are computed before the benchmark runs. Search latency measurements therefore cover retrieval-engine behavior and do not include embedding generation.

5.5 Indexed Fields

All engines index the product title as the lexical search field. For hybrid retrieval, each document also contains a dense vector field holding the precomputed title embedding.

The Brinicle configuration used in the benchmark activates title-token evidence and dense-vector evidence. Structured fields such as category, subcategory, and attributes are part of Brinicle’s general item representation, but they are not active in this benchmark configuration.

5.6 Runtime Environment

Host Configuration

ComponentValue
Host OSUbuntu 25.10
CPUIntel Core i7-13650HX
Host RAM32 GiB
StorageNVMe SSD
Docker version29.2.1
Docker storage driveroverlay2

Container Resource Limits

ResourceLimit
CPU cores16
RAM16 GiB

Only one engine container is active during each benchmark run.

5.7 Retrieval Parameters

HNSW and Brinicle Configuration

ParameterValue
M8
ef_construction512
ef_search1024
top_k100
Lexical dimension (Brinicle)70

Lexical dimension specifies how many slots are available for storage. More space means less title truncation but more memory usage.

5.8 Hybrid Parameter Tuning

Each system exposes its own parameter for controlling the lexical-semantic balance. The parameters are tuned separately for each engine and dataset using the held-out tuning queries.

Tuned Hybrid Parameters

DatasetBrinicleMeilisearchOpenSearchTypesenseWeaviate
WANDS0.950.550.60.80.7
ESCI0.90.40.40.20.5

For Brinicle, the selected alpha is part of the index configuration because the distance function is used during graph construction.

5.9 Benchmark Procedure

Each benchmark run has two phases. First, the engine builds or ingests the index. During this phase, the benchmark records build time and build memory. Second, the benchmark runs the evaluation queries. During this phase, the benchmark records returned product identifiers, per-query latency, throughput, and search memory.

The measured search outputs include:

ranked product ids

per-query latency

total query time

container memory profile

5.10 Memory Measurement

Memory is measured separately for build and search. The benchmark records multiple memory counters, including:

raw peak memory

working-set peak memory

anonymous memory

file-backed memory

kernel memory

slab memory

The main results report peak search memory. Additional memory counters are included in the appendix.

5.11 Evaluation Metrics

The benchmark reports ranking metrics at K = 1, 5, 10, 20, 50, 100.

Relevance Metrics

MetricDescription
Hit@KWhether at least one relevant product appears in the top K
Recall@KFraction of relevant products retrieved in the top K
nDCG@KGraded ranking quality in the top K
MRR@KReciprocal rank of the first relevant product

System Metrics

MetricDescription
Build timeTime required to build or ingest the index
Search latencyPer-query retrieval latency
QPSQueries processed per second
Build memoryPeak memory during index construction
Search memoryPeak memory during query execution

The main results focus on ranking quality, P99 latency, and peak search memory. Full metric tables are reported in the appendix.

6. Results

This section reports the main retrieval and system results on WANDS and US-filtered Amazon ESCI. The main text focuses on exact-relevance retrieval quality, P99 search latency, and peak search memory. Full metric tables, throughput measurements, build-time measurements, and additional memory counters are reported in the appendix.

6.1 WANDS Results

Table 1. WANDS Main Results

EngineHit@1nDCG@10Hit@100P99 LatencyPeak Search Memory
Brinicle0.48440.58510.74440.516 ms129 MB
Meilisearch0.48440.57240.73117.433 ms239 MB
OpenSearch0.49560.58550.74671.480 ms9,552 MB
Typesense0.48440.57790.73117.574 ms1,016 MB
Weaviate0.46220.56310.733310.758 ms597 MB

Relevance is evaluated using exact-match labels. Latency is reported as per-query P99 latency. Memory is reported as peak search memory.

On WANDS, OpenSearch has the highest Hit@1, nDCG@10, and Hit@100. Brinicle is close on all three relevance metrics, with the lowest P99 latency and the lowest peak search memory among the compared systems.

The WANDS results show a narrow relevance spread among the strongest systems. OpenSearch reaches 0.4956 Hit@1, while Brinicle, Meilisearch, and Typesense each reach 0.4844. At Hit@100, OpenSearch reaches 0.7467, while Brinicle reaches 0.7444.

The system measurements show a larger separation. Brinicle records 0.516 ms P99 latency and 129 MB peak search memory. The closest non-Brinicle P99 latency is OpenSearch at 1.480 ms, while the closest non-Brinicle search memory is Meilisearch at 239 MB.

6.2 ESCI Results

Table 2. ESCI Main Results

EngineHit@1nDCG@10Hit@100P99 LatencyPeak Search Memory
Brinicle0.4280.36610.89320.773 ms1,731 MB
Meilisearch0.41750.35660.886219.768 ms5,671 MB
OpenSearch0.42260.36010.90093.407 ms11,716 MB
Typesense0.41910.35250.879312.160 ms8,041 MB
Weaviate0.42030.35880.90549.483 ms4,794 MB

Relevance is evaluated using exact labels only. Latency is reported as per-query P99 latency. Memory is reported as peak search memory.

On ESCI, Brinicle has the highest Hit@1 and nDCG@10. Weaviate has the highest Hit@100, followed by OpenSearch. This indicates a difference between early exact-match ranking and deeper top-k retrieval.

Brinicle records 0.4280 Hit@1 and 0.3661 nDCG@10. The strongest non-Brinicle Hit@1 is OpenSearch at 0.4226, and the strongest non-Brinicle nDCG@10 is also OpenSearch at 0.3601. At Hit@100, Weaviate reaches 0.9054, OpenSearch reaches 0.9009, and Brinicle reaches 0.8932.

The system measurements again show the largest differences in latency and memory. Brinicle records 0.773 ms P99 latency and 1,731 MB peak search memory. The closest non-Brinicle P99 latency is OpenSearch at 3.407 ms. The closest non-Brinicle peak search memory is Weaviate at 4,794 MB.

6.3 P99 Search Latency

Figure 1. P99 search latency on WANDS

P99 search latency on WANDS

Figure 2. P99 search latency on ESCI

P99 search latency on ESCI

Table 3. P99 Search Latency

DatasetBrinicleMeilisearchOpenSearchTypesenseWeaviate
WANDS0.516 ms7.433 ms1.480 ms7.574 ms10.758 ms
ESCI0.773 ms19.768 ms3.407 ms12.160 ms9.483 ms

Brinicle has the lowest measured P99 latency on both datasets. On WANDS, its P99 latency is 0.516 ms, compared with 1.480 ms for OpenSearch, the closest non-Brinicle system. On ESCI, its P99 latency is 0.773 ms, compared with 3.407 ms for OpenSearch.

6.4 Search Memory

Figure 3. Peak search memory on WANDS

Peak search memory on WANDS

Figure 4. Peak search memory on ESCI

Peak search memory on ESCI

Table 4. Peak Search Memory

DatasetBrinicleMeilisearchOpenSearchTypesenseWeaviate
WANDS129 MB239 MB9,552 MB1,016 MB597 MB
ESCI1,731 MB5,671 MB11,716 MB8,041 MB4,794 MB

Brinicle has the lowest measured search memory on both datasets. On WANDS, Brinicle uses 129 MB, followed by Meilisearch at 239 MB. On ESCI, Brinicle uses 1,731 MB, followed by Weaviate at 4,794 MB. The memory difference is larger on ESCI, where the corpus is substantially larger. In that setting, Brinicle’s peak search memory is less than half of the closest non-Brinicle measurement.

6.5 Hit@K Curves

Figure 5. Hit@K curve on WANDS

Hit@K curve on WANDS using exact relevance

Figure 6. Hit@K curve on ESCI

Hit@K curve on ESCI using exact relevance

Figures 5 and 6 report Hit@K curves across K = 1, 5, 10, 20, 50, 100. On WANDS, OpenSearch is slightly ahead across the main reported relevance points, while Brinicle remains close. On ESCI, Brinicle leads at early ranking points reported in Table 2, while Weaviate and OpenSearch reach higher Hit@100. The full Hit@K, Recall@K, nDCG@K, and MRR@K tables are provided in the appendix.

6.6 Result Summary

Across both datasets, the results show three main patterns.

First, relevance is competitive across systems. On WANDS, OpenSearch has the strongest exact-relevance metrics among the reported values. On ESCI, Brinicle has the strongest Hit@1 and nDCG@10, while Weaviate has the strongest Hit@100.

Second, Brinicle has the lowest measured P99 search latency on both datasets.

Third, Brinicle has the lowest measured peak search memory on both datasets.

These results support the single-graph formulation as a practical retrieval design for title-based hybrid product search: lexical and semantic evidence can be combined during graph traversal while maintaining competitive exact-relevance quality and a smaller search-time resource footprint.

7. Discussion

The results show that hybrid product retrieval can be implemented through a single HNSW graph while preserving competitive exact-relevance quality. Brinicle’s main distinction is not a single isolated relevance score, but the combination of retrieval quality, low search memory, and low search latency under the tested configuration. This section discusses the implications of the benchmark results for hybrid retrieval design, product-title search, and deployment trade-offs.

7.1 Interpreting the Retrieval Trade-Off

The relevance results differ across datasets and ranking depths. On WANDS, OpenSearch has the strongest reported exact-relevance metrics. Brinicle remains close across the main relevance points, with a small difference in Hit@1, nDCG@10, and Hit@100.

On ESCI, Brinicle has the strongest Hit@1 and nDCG@10, while Weaviate has the strongest Hit@100. This indicates that Brinicle performs strongly in early ranking, while other systems retrieve more exact matches at deeper top-k positions.

This pattern is useful because it separates two retrieval behaviors:

early ranking quality

deep candidate coverage

For product search, both behaviors can matter. Early ranking is important when results are shown directly to users. Deeper candidate coverage is important when the retrieval stage feeds reranking, recommendation, or downstream selection. The benchmark results therefore describe an operating profile rather than a single leaderboard. Brinicle’s profile is strongest in search-time efficiency and early exact-match ranking on the larger ESCI benchmark, while other systems show advantages in specific relevance metrics and deeper retrieval settings.

7.2 Hybrid Retrieval Inside Graph Traversal

The central architectural result is that lexical and semantic evidence can participate in the same graph traversal. In a conventional hybrid system, lexical and vector retrieval are usually performed through separate structures, and hybrid behavior is introduced through score fusion or reranking. Brinicle moves this combination into the distance function used by HNSW.

This has two consequences. First, hybrid scoring affects candidate exploration, not only final ranking. The graph traversal is guided by a distance function that includes both symbolic title evidence and dense-vector similarity. Second, the retrieval system has a smaller structural surface. The benchmarked Brinicle configuration uses one encoded representation, one HNSW graph, and one hybrid-aware distance function for title-based hybrid retrieval.

The results suggest that this design is sufficient to produce competitive retrieval behavior on the evaluated product-search tasks.

7.3 Early Ranking and Deeper Top-K Behavior

The ESCI results show a clear distinction between early ranking and deeper top-k retrieval. Brinicle leads the reported early-ranking metrics:

Hit@1

nDCG@10

Weaviate leads the reported deeper metric:

Hit@100

This distinction is important for interpreting hybrid retrieval systems. A method can be strong at placing an exact result near the top while another method can be stronger at retrieving more exact results somewhere inside a larger candidate set. The appropriate retrieval profile depends on the application. A direct product-search interface benefits from strong early ranking. A multi-stage ranking system may prefer broader top-k coverage before reranking. In this benchmark, Brinicle’s strongest relevance behavior appears in early exact-match ranking on ESCI, while its strongest system behavior appears consistently in latency and memory across both datasets.

7.4 Search-Time Resource Profile

The memory and latency measurements show the clearest separation between Brinicle and the compared systems. Brinicle has the lowest measured peak search memory on both WANDS and ESCI. The difference is especially visible on ESCI, where Brinicle uses less than half the search memory of the closest non-Brinicle system. Brinicle also has the lowest measured P99 latency on both datasets. This result is consistent across the smaller WANDS corpus and the larger ESCI corpus. Together, these measurements show that the single-graph design changes the search-time resource profile of hybrid retrieval. The system does not maintain separate lexical and vector retrieval structures for the benchmarked hybrid task, and the measured search memory reflects that architectural choice.

7.5 Alpha as Index Configuration

Brinicle’s alpha affects graph construction as well as query-time search. This makes the hybrid parameter part of the index configuration rather than only a runtime fusion parameter. When the graph is built, the configured distance function influences neighborhood formation. A more semantic configuration creates graph neighborhoods shaped more strongly by vector similarity. A stronger lexical correction changes how symbolic title evidence contributes to those neighborhoods.

This is different from hybrid systems where the lexical and vector indexes are built independently and the hybrid parameter only affects query-time score combination. In the benchmark, each Brinicle index is built using the tuned alpha selected for that dataset. This means the reported Brinicle results reflect both the encoded representation and the graph topology produced by the selected hybrid distance.

7.6 Deployment Implications

The measured trade-off is relevant for deployments where search memory and latency are important constraints. A lower search-time memory footprint can reduce infrastructure cost, allow more indexes to run on the same machine, or leave more memory available for application logic. Lower latency can improve interactive search behavior and increase the headroom available for additional downstream processing.

Brinicle’s design is therefore most directly relevant to search systems where hybrid retrieval is needed but maintaining multiple retrieval structures is expensive. The benchmarked setting is title-based product retrieval, but the architectural pattern is broader: encode multiple retrieval signals into one comparable object, then use a distance function that combines those signals during graph traversal.

7.7 Discussion Summary

The results support three main observations. First, title-based hybrid product retrieval can be expressed through one HNSW graph and one hybrid-aware distance function. Second, Brinicle achieves competitive exact-relevance quality on both evaluated datasets, with stronger early-ranking results on ESCI and close relevance results on WANDS. Third, Brinicle shows a consistent search-time resource advantage in the reported measurements, with the lowest P99 latency and lowest peak search memory on both datasets. These observations support the paper’s main claim: hybrid product retrieval can be modeled as a single-graph retrieval problem, with lexical and semantic evidence combined during graph traversal rather than through post-hoc fusion over separate retrieval structures.

8. Limitations

This benchmark evaluates title-based hybrid product retrieval using precomputed embeddings and exact-match relevance labels. It does not measure multi-field ranking, faceted filtering, personalized retrieval, distributed deployment, or reranking pipelines. The results should therefore be interpreted as evidence for the tested title + vector retrieval setting, not as a complete evaluation of every product-search workload.

The compared systems were tuned through held-out queries under a shared benchmark configuration, but each engine has additional parameters and deployment modes that may change its behavior. Brinicle’s alpha also affects graph construction, so changing the hybrid balance requires rebuilding the index. Future experiments should evaluate richer metadata, structured filters, additional datasets, and multi-stage retrieval pipelines.

9. Conclusion

This paper studied a single-graph formulation for hybrid product retrieval. Instead of combining separate lexical and vector retrieval results through post-hoc fusion, Brinicle encodes title-token evidence and dense embeddings into one HNSW-searchable representation. A custom distance function then combines symbolic and semantic evidence during graph construction and search.

The experiments on WANDS and US-filtered Amazon ESCI show that this approach achieves competitive exact-relevance quality under the tested title + vector configuration. Brinicle has the lowest measured P99 latency and peak search memory on both datasets, while relevance leadership varies by dataset and metric.

The main result is architectural: hybrid product retrieval can be modeled as graph traversal over a structured representation, rather than as coordination between separate retrieval structures. For workloads where exact product identifiers and semantic tolerance both matter, this opens a practical design space for lower-memory and lower-latency hybrid search.

Future work should evaluate the same approach with richer product metadata, structured filters, additional datasets, different embedding models, and multi-stage reranking pipelines.

Appendix A. Code, Data, and Citations

Repositories and Datasets

ResourceRepository
Briniclegithub.com/bicardinal/brinicle
Benchmark Harnessgithub.com/bicardinal/item_search_bench
Amazon ESCI Datasetgithub.com/amazon-science/esci-data
WANDS Datasetgithub.com/wayfair/WANDS

WANDS Citation

@InProceedings{wands,
  title = {WANDS: Dataset for Product Search Relevance Assessment},
  author = {Chen, Yan and Liu, Shujian and Liu, Zheng and Sun, Weiyi and Baltrunas, Linas and Schroeder, Benjamin},
  booktitle = {Proceedings of the 44th European Conference on Information Retrieval},
  year = {2022},
  numpages = {12}
}

Amazon ESCI Citation

@article{reddy2022shopping,
  title = {Shopping Queries Dataset: A Large-Scale {ESCI} Benchmark for Improving Product Search},
  author = {Chandan K. Reddy and Llu\'{i}s M\'{a}rquez and Fran Valero and Nikhil Rao and Hugo Zaragoza and Sambaran Bandyopadhyay and Arnab Biswas and Anlu Xing and Karthik Subbian},
  year = {2022},
  eprint = {2206.06588},
  archivePrefix = {arXiv}
}

Appendix B. Benchmark Configuration

B.1 Dataset Configuration

DatasetDocumentsQueriesTuning QueriesEvaluation QueriesIndexed Field
WANDS42,99445030420Title
Amazon ESCI, US locale1,215,85420,4582,00018,458Title

B.2 Shared Retrieval Configuration

ParameterValue
Indexed lexical fieldTitle
Retrieval modeHybrid title + vector
top_k100
Reported K values1, 5, 10, 20, 50, 100

B.3 Embedding Configuration

ParameterValue
Embedding modelnomic-ai/nomic-embed-text-v1.5
Document prefixsearch_document: {title}
Query prefixsearch_query: {query}
Embedding timingPrecomputed before benchmark

Search latency does not include embedding generation.

B.4 Runtime Environment

ComponentValue
Host OSUbuntu 25.10
CPUIntel Core i7-13650HX
Host RAM32 GiB
StorageNVMe SSD
Docker version29.2.1
Docker storage driveroverlay2

B.5 HNSW and Brinicle Configuration

ParameterValue
M8
ef_construction512
ef_search1024
top_k100
Lexical dimension (Brinicle)70

Appendix C. Hybrid Parameter Tuning

Each engine’s hybrid parameter is selected using held-out tuning queries and then applied to the evaluation split.

DatasetBrinicleMeilisearchOpenSearchTypesenseWeaviate
WANDS0.950.550.60.80.7
ESCI0.90.40.40.20.5

For Brinicle, the selected alpha is part of the index configuration because the distance function is used during graph construction.

Appendix D. Full Relevance Metrics

All relevance metrics are computed using exact relevance only. Metrics reported at K = 1, 5, 10, 20, 50, 100.

D.1 WANDS Hit@K

Engine@1@5@10@20@50@100
Brinicle0.48440.59110.63560.69110.72220.7444
Meilisearch0.48440.58440.62670.67780.71330.7311
OpenSearch0.49560.60220.64670.68670.73330.7467
Typesense0.48440.59560.63330.68220.71560.7311
Weaviate0.46220.57780.63560.67330.70890.7333

D.2 WANDS Recall@K

Engine@1@5@10@20@50@100
Brinicle0.12380.20740.26370.35450.49490.6122
Meilisearch0.1250.20540.25760.33950.4630.573
OpenSearch0.12710.21120.27010.35490.4870.6008
Typesense0.12230.21070.26530.34880.4720.5803
Weaviate0.11980.1990.25370.33550.45570.5518

D.3 WANDS nDCG@K

Engine@1@5@10@20@50@100
Brinicle0.61240.5910.58510.58270.58040.5937
Meilisearch0.61240.58120.57240.56590.5560.5642
OpenSearch0.62640.59250.58550.57990.57310.5838
Typesense0.61240.58690.57790.57240.5620.5707
Weaviate0.58430.56850.56310.55690.54680.5472

D.4 WANDS MRR@K

Engine@1@5@10@20@50@100
Brinicle0.48440.52490.53080.53450.53540.5357
Meilisearch0.48440.52440.53040.5340.53520.5355
OpenSearch0.49560.53490.54110.5440.54560.5457
Typesense0.48440.52620.53130.53480.53590.5362
Weaviate0.46220.50940.51720.51990.52110.5215

D.5 ESCI Hit@K

Engine@1@5@10@20@50@100
Brinicle0.4280.66310.74440.80680.86340.8932
Meilisearch0.41750.64380.72430.78760.85060.8862
OpenSearch0.42260.660.74160.80460.86530.9009
Typesense0.41910.64930.72440.78750.84750.8793
Weaviate0.42030.66520.74750.8090.87270.9054

D.6 ESCI Recall@K

Engine@1@5@10@20@50@100
Brinicle0.06310.19520.28590.38260.49910.5789
Meilisearch0.06250.18980.27740.36890.47690.5518
OpenSearch0.0630.1930.28150.37750.49170.5701
Typesense0.0610.18690.27180.36180.46740.5398
Weaviate0.06280.19190.28090.37750.49440.576

D.7 ESCI nDCG@K

Engine@1@5@10@20@50@100
Brinicle0.4280.38470.36610.37730.42680.4585
Meilisearch0.41750.37470.35660.36560.4110.4406
OpenSearch0.42260.37840.36010.3710.4190.4498
Typesense0.41910.37330.35250.36040.40460.4332
Weaviate0.42030.37720.35880.37020.41960.4516

D.8 ESCI MRR@K

Engine@1@5@10@20@50@100
Brinicle0.4280.51760.52850.53290.53480.5352
Meilisearch0.41750.50370.51460.5190.52110.5216
OpenSearch0.42260.51240.52340.52790.52990.5304
Typesense0.41910.50670.51690.52130.52330.5238
Weaviate0.42030.51360.52470.52910.53120.5316

Appendix E. Latency and Throughput

Latency values are reported in milliseconds. Total query time is reported in seconds.

E.1 WANDS Latency and Throughput

EngineAvg msP50 msP95 msP99 msQPSTotal Query Time
Brinicle0.4270.4280.5160.5162357.60.192 s
Meilisearch7.097.0937.4337.433141.13.191 s
OpenSearch1.0831.0291.481.48926.80.487 s
Typesense6.7746.6967.5747.574147.73.048 s
Weaviate9.4279.56510.75810.758106.24.242 s

E.2 ESCI Latency and Throughput

EngineAvg msP50 msP95 msP99 msQPSTotal Query Time
Brinicle0.5560.5490.6920.7731800.111.366 s
Meilisearch15.05715.12217.40819.76866.4308.043 s
OpenSearch2.7042.6713.0533.40737155.321 s
Typesense9.3349.20210.37912.16107.1190.950 s
Weaviate8.5978.7499.2839.483116.3175.882 s

Appendix F. Search Memory

Memory values are reported in MB.

F.1 WANDS Search Memory

EngineRaw PeakWorking SetAnonymousFile-BackedKernelSlab
Brinicle129.3128.479.841.34.83.5
Meilisearch238.8238.8124.9107.44.21.6
OpenSearch9551.59544.29390128.929.67.3
Typesense1016.41016.4637.6349.726.39.2
Weaviate596.8596.8471.4119.12.60.9

F.2 ESCI Search Memory

EngineRaw PeakWorking SetAnonymousFile-BackedKernelSlab
Brinicle1731.11714.7480.81203.743.739
Meilisearch56715671955468726.312.1
OpenSearch11716.211704.5101711505.435.710.9
Typesense8040.58040.31601.36391.744.424.8
Weaviate479447942326.72446.117.36.6

Appendix G. Build Time and Build Memory

G.1 Build Time

DatasetBrinicleMeilisearchOpenSearchTypesenseWeaviate
WANDS11.4 s15.4 s27.7 s8.9 s4.8 s
ESCI405.8 s3136.3 s697.9 s339.2 s227.8 s

G.2 WANDS Build Memory (MB)

EngineRaw PeakWorking SetAnonymousFile-BackedKernelSlab
Brinicle160.2126.982.176.75.54.4
Meilisearch849.4849.4733.5111.55.51.7
OpenSearch9522.89515.49317.3178.528.27
Typesense897897312574.125.69.9
Weaviate535.2535.2410.4117.62.50.9

G.3 ESCI Build Memory (MB)

EngineRaw PeakWorking SetAnonymousFile-BackedKernelSlab
Brinicle2600.71665.9320.42204.27468.3
Meilisearch7059.76989.12276.14795.631.215.3
OpenSearch12963.412953.99645.53616.936.815
Typesense8432.88432.51708.37106.84426.4
Weaviate6228.46228.43770.32720.516.57.2

Appendix H. Figure Data

The main figures can be generated from the appendix tables as follows:

FigureSource Table
P99 latency comparisonAppendix E
Peak search memory comparisonAppendix F
Hit@K curvesAppendix D.1 and Appendix D.5

Appendix I. Raw Result Fields

The benchmark output uses the following fields:

FieldMeaning
nDCG@KNormalized discounted cumulative gain at K
Recall@KFraction of exact-relevant products retrieved in the top K
Hit@KWhether at least one exact-relevant product appears in the top K
MRR@KReciprocal rank of the first exact-relevant product in the top K
search_avg_latencyMean per-query search latency, in seconds
search_p50_latency50th percentile per-query search latency, in seconds
search_p95_latency95th percentile per-query search latency, in seconds
search_p99_latency99th percentile per-query search latency, in seconds
qpsQueries processed per second
search_total_query_timeTotal measured search time, in seconds
raw_peak_mbPeak raw memory usage, in MB
working_set_peak_mbPeak working-set memory usage, in MB
anon_peak_mbPeak anonymous memory usage, in MB
file_peak_mbPeak file-backed memory usage, in MB
kernel_peak_mbPeak kernel memory usage, in MB
slab_peak_mbPeak slab memory usage, in MB
build_latencyIndex build or ingestion time, in seconds
build_memory_profileMemory profile recorded during index build or ingestion