Build a Vector Search API with SQLite FTS5 + Python FastAPI in Minutes

Semantic search – finding answers by meaning instead of exact keywords, is all the rage now. In fact this is a trend that I think is going to last, as after ChatGPT’s release, none of you are searching in the ‘old way’. That’s one of the reason my team built relunit.com, because no one seems to remember product names, and you expect search to directly tell you what to buy your angry girlfriend of 2 months just because you forget her birthday because your attention span is less than that of a goldfish 😭

Vector Search Using Python fastAPI and SQLite

Anyway, back to the topic at hand, setting up a full-blown vector database (Pinecone, FAISS, etc.) can feel like bringing a howitzer to a knife fight if your app is small or personal. Having spent a decade building everything from mobile apps to VOIP backends, I’ve learned that sometimes the simplest tools work wonders. In fact, SQLite – that tiny embedded database sitting in your Python standard library can pack quite a punch. By combining SQLite’s FTS5 (full-text search) module with a bit of Python wizardry, you can roll your own vector-search API for documents or chat logs in a few hundred lines of code (You’re welcome 😏).

FTS5 is SQLite’s built-in full-text search extension. Think of it like a mini Google for your local data: it indexes words so you can quickly find all documents containing a query term. It’s lightning-fast for keyword searches, but it can only match exact terms or simple variants. To go beyond keywords, we use text embeddings: convert each document into a high-dimensional vector so that semantically similar texts have similar vectors (close in cosine distance). A widely-used approach is to use a model like HuggingFace’s SentenceTransformers or OpenAI’s API to produce these embeddings. At query time, we embed the user’s query text the same way, then find stored documents whose vectors are nearest (in cosine similarity) to that query vector.

This “hybrid” search (combining lexical FTS and vector similarity) gives the best of both worlds: you get the old-school keyword match boost and the neural-semantic boost. In fact, projects like Llama-Stack and others actively propose storing both an FTS5 index and a vector embedding per document, then scoring results by mixing BM25 (FTS relevance) with cosine similarity. We’ll follow a similar strategy in Python. Here’s the high-level plan:

  • Data Model: A SQLite table for documents (id, text), with an FTS5 virtual table for full-text search on text. We’ll also store each document’s vector (as a BLOB or JSON) in either the same table or a parallel table.
  • Embedding Model: Pick a text embedding model (e.g. all-MiniLM-L6-v2 from SentenceTransformers with 384 dimensions, or an OpenAI embedding with 1536 dims). In Python we’ll call it to convert text → vector.
  • Indexing: When adding a document, insert its text into the FTS5 table (for keyword queries) and compute+store its vector (for semantic queries).
  • Search Endpoint: Given a query string, embed it, then find candidate documents. We can either use FTS5 first to filter or do pure vector search. For simplicity, we’ll do a two-step hybrid search: run an FTS5 match to get top N hits, then re-rank those hits by cosine similarity of vectors. That limits our Python work to (say) a few dozen documents per query.
  • Performance: For small to medium datasets (up to ~100K docs), this approach is very reasonable. SQLite + brute-force Python search can handle “hundreds of thousands” of vectors with no exotic infrastructure. It’s not FAISS-fast for millions of vectors, but for many apps it’s more than enough.

Below I’ll walk through each piece in more detail, with tips and code snippets so go ahead and buy me a coffee here: https://buymeacoffee.com/thecodecity.  

If you do so, I’ll also share the trade-offs I’ve learned (after many late-night debugging sessions, all because you have it easier 😇) about choosing embedding size, enabling concurrency, and comparing this lightweight setup to heavyweight alternatives like FAISS or pgvector.

Why Not Just FAISS or Milvus? – The Case for “No Heavy Infra”

Before coding, let’s justify our lean choice. Specialized vector databases like FAISS (via PyTorch/C++) or Milvus provide super-fast similarity search with approximate nearest-neighbor (ANN) indices, multi-threading, and GPU support. They scale to millions of vectors quickly. But they come with unsurprising downsides: heavy dependencies, cluster ops, and often fees or complex setup. For many projects (small RAG apps, personal bots, prototypes) that’s an overkill.

Using Vector DBs for a weekend bot

SQLite + Python means no separate server or serverless fees. All your data lives in one file on disk (or in-memory), and you use plain SQL and Python code. SQLite is a great way to quickly get started. It enables you to quickly create a new vector-based database in minutes without any dependencies on third-party tools or complicated database setups. YAYYY! Even without loading a plugin, vanilla SQLite has FTS5 (for text) and we can brute-force vectors in Python.

In practice, this means:

  • Simplicity: pip install sqlite3 is all you need (it’s builtin). No Docker, no cloud, no warm-up. Perfect for local RAG/chatbot apps.
  • Local / Privacy: Your data never leaves your machine. All embedding storage and search is local.
  • Control: You see exactly what SQL is doing (FTS5 query or a vector similarity loop). You can profile, tweak, and understand it.
  • Portability: SQLite runs everywhere (Windows, Linux, Mac, even mobile/embedded). Your app can run totally offline, or inside an Electron/Flutter app even.

Of course, the trade-off is performance at scale. Pure SQLite+Python is not going to beat FAISS for 10 million vectors. It’s O(n) per query. But if you have only 10k–100k documents, it’ll feel quite snappy.Only when you’re pushing toward millions do specialized ANN indexes really shine. And even then, you could offload to something like SQLite-vss or SQLite-vec (I’ll mention these) later.

So yes, I could say “go FAISS” when we hit the scale. But as long as we know the limits and keep things optimized (like choosing a sensible embedding dimension), SQLite+FastAPI is a delightful minimal stack.

Choosing and Using Text Embeddings

The core of semantic search is the embedding model. We feed each document and query into a model that outputs a fixed-size vector capturing its meaning. Popular choices include HuggingFace/Sentence-Transformer models (which you can run locally) or APIs like OpenAI’s embeddings. A key question is: how big of a vector?

There’s a trade-off: larger vectors (e.g. 1024–1536 dims) can capture more nuance, but are slower to compute and slower to search. Smaller vectors (384–768) are faster and use less memory, but might miss some subtlety. Concretely, models like all-MiniLM-L6-v2 produce 384-dimensional embeddings. The OpenAI text-embedding-3-small model gives 1536 dims. Often I start with a 384 (MiniLM or a similar ‘small’ model) and see if that’s enough. If I really need fine-grained answers, I might switch to a bigger model later.

For example, MiniLM-v2 is tiny and fast to run. In Python you’d do something like:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384 dims
query_vec = model.encode(["Your query here"])[0]  # returns a list of vectors

The HuggingFace model card even says “It maps sentences & paragraphs to a 384 dimensional dense vector space”. For OpenAI’s ada-002 (legacy) or newer ada-003/3-small, the output is 1536 dims. More dims = heavier DB rows and more compute per cosine calculation. So keep dimension as low as feasible for your problem.

Best practice: Start with a 384-dim model and test. If your search results feel weak or your queries require more nuance (longer queries, professional domains, etc.), try 1024 or 1536 dims. In code this just means adjusting how you create the SQLite table (see below). There’s no universal ‘best’ dimension, only the right one for your use case.

Defining the SQLite Schema and FTS5 Table

Now let’s wire it up. We’ll use SQLite (via the sqlite3 module or any ORM) to define our tables. A simple schema might be:

CREATE TABLE documents (
    id   INTEGER PRIMARY KEY,
    text TEXT
);
CREATE VIRTUAL TABLE documents_fts USING fts5(text, content='documents', content_rowid='id');
CREATE TABLE embeddings (
    doc_id INTEGER PRIMARY KEY,
    vector BLOB
);

Here, documents holds our raw text. The documents_fts virtual table is an FTS5 index that automatically mirrors documents so you can do full-text searches on it (the content=’documents’ clause tells FTS5 to link them). Then embeddings stores each document’s vector as a blob (you could also use JSON, but blobs of floats or doubles are more compact).

When inserting, you’d do something like:

  1. Insert the text: INSERT INTO documents(text) VALUES(?).
  2. SQLite will auto-assign an id.
  3. Compute the embedding in Python (using your chosen model) and serialize it (e.g. with NumPy or array to bytes).
  4. Insert the vector: INSERT INTO embeddings(doc_id, vector) VALUES(?, ?) where doc_id is the same id.

Alternatively, you could combine steps with an external-content FTS5 table or triggers, but the above is straightforward. One nice thing: we use two tables so that FTS5 handles text and a normal table handles blobs; this keeps things simple.

Now, if you prefer to push more into SQL, there are SQLite extensions like sqlite-vec and sqlite-vss which allow you to do vector math inside SQL. For example, sqlite-vec lets you declare a table with a FLOAT[384] column and then even use syntax like WHERE myvector MATCH embed(‘query text’).
(For the curious: SQLite-vss (the Faiss-based plugin) brings vector search capabilities to SQLite… to build semantic search engines, but it essentially bundles FAISS under the hood. SQLite-vec is a newer C extension that does brute-force KNN in SQLite. Both are cool, but here we assume we haven’t compiled them into our Python app.)

FastAPI Endpoints

With our DB schema in place, let’s sketch out a FastAPI application. We’ll need two main endpoints: one to add documents, and one to search. In code it might look like this:

import sqlite3
import numpy as np
from fastapi import FastAPI
from sentence_transformers import SentenceTransformer

app = FastAPI()
model = SentenceTransformer('all-MiniLM-L6-v2')
conn = sqlite3.connect('docs.db', check_same_thread=False)
conn.execute("PRAGMA journal_mode=WAL;")  # Enable concurrent readers

We enable WAL mode right after connecting. Remember that SQLite only allows one writer at a time, but WAL lets reads run while writing, so it’s a good default for a web API. For safety, you’d also ensure you don’t do long transactions.

For Adding Documents

@app.post("/add_doc/")
def add_doc(text: str):
    # Insert into documents table
    cur = conn.cursor()
    cur.execute("INSERT INTO documents(text) VALUES(?)", (text,))
    doc_id = cur.lastrowid
    
    # Compute embedding
    vec = model.encode([text])[0]  # shape (dim,)
    vec_blob = vec.astype('float32').tobytes()
    cur.execute("INSERT INTO embeddings(doc_id, vector) VALUES(?, ?)",
                (doc_id, vec_blob))
    conn.commit()
    return {"id": doc_id}

A few notes here: we commit after both inserts in one go. If your app is async, you’d use a threadpool or an async DB library, but the logic is the same. We store the vector in raw bytes (float32); you could also use float64 if you want more precision, but float32 is usually fine and halves the size. The embeddings table now has one row per doc.

For Searching Documents 

from numpy.linalg import norm

@app.get("/search/")
def search(q: str, k: int = 5):
    # 1) Keyword search with FTS
    cur = conn.cursor()
    # Use MATCH to find doc ids matching query text
    cur.execute("SELECT doc_id FROM documents_fts WHERE documents_fts MATCH ? LIMIT 50", (q,))
    candidates = [row[0] for row in cur.fetchall()]
    
    # If FTS gave nothing, we might still do a fallback to searching all (or return empty)
    if not candidates:
        cur.execute("SELECT doc_id FROM documents")
        candidates = [row[0] for row in cur.fetchall()]

    # 2) Vector search: embed the query
    q_vec = model.encode([q])[0]
    q_norm = norm(q_vec)
    
    # 3) Fetch candidate vectors and rank by cosine
    results = []
    for doc_id in candidates:
        cur.execute("SELECT vector FROM embeddings WHERE doc_id=?", (doc_id,))
        blob = cur.fetchone()[0]
        v = np.frombuffer(blob, dtype=np.float32)
        # compute cosine similarity
        sim = float(np.dot(q_vec, v) / (q_norm * norm(v)))
        results.append((doc_id, sim))
    # Sort by similarity descending
    results.sort(key=lambda x: x[1], reverse=True)
    # Return top-k results
    return {"results": results[:k]}

This is a straightforward approach. First we do an FTS5 query (MATCH ?) to quickly narrow down to some candidate documents containing similar keywords (the LIMIT 50 means we take at most 50 docs). Then we fall back to pure embedding search: compute the cosine similarity between the query vector and each candidate’s vector (loaded from SQLite as a BLOB). Finally we sort and return the top-k.

This hybrid can often yield better relevance than vector-only: FTS5 ranks by keyword and BM25, catching obvious hits, then we refine by true semantic similarity. (Some advanced schemes use reciprocal rank fusion to combine the scores, but here we simply re-rank.)

You could also skip FTS5 entirely and compare against all documents, but that’s O(n) per query. For a few thousand docs that’s no problem; for tens of thousands, it might get slow. Our compromise – limit to N FTS hits – usually keeps k small.This approach should hold up to medium scales.

Performance Tips

To keep this API snappy, here are some pointers:

  • Use WAL mode: As shown above, we set PRAGMA journal_mode=WAL. This allows SQLite to handle concurrent readers with a writer, and is generally recommended for better throughput. Just remember SQLite still only allows one writer at a time. So try to keep each transaction short.
  • Batch inserts: If you’re ingesting many docs at once (say on startup), wrap them in a single transaction. That avoids SQLite thrashing. The Llama-Stack project even adds “chunked writes” helpers to speed this up.
  • Monitor query cost: For each search, we fetch vectors and compute cosine. If you get slowdowns, check if your embedding dimension is too large or your candidate set too big. You can always reduce LIMIT 50 to a smaller number, or prune candidates whose BM25 score (FTS rank) is low.
  • Embedding dimension: As discussed, smaller dims = faster. For example, a 384-dim vector takes 384 multiplies per dot-product; a 1536-dim vector takes 4x that work. In SQLite-vss benchmarks, higher dims and fancy indexes (IVF, HNSW) make a huge difference. So if speed is key and your data is simple (e.g. short Q&A pairs), don’t over-dimension. Quick tip: even quantizing or using float16 in Python can cut work in half (though then accuracy may drop).
  • Number of vectors: The sqlite-vec FAQ suggests you can handle hundreds of thousands of vectors with brute force. In practice, I’ve found SQLite + NumPy can cruise through ~50k docs under 1 second per query on a midrange laptop, if dims are ~384. Past ~200k, queries start to lag. If you approach that, consider adding an ANN index or switching to FAISS. But often, you can trim your dataset (e.g. split very long docs into chunks, use a smaller model, or cache results).

FTS5 vs. Heavyweight Vector Stores

Let’s compare our SQLite+Python approach with heavier alternatives:

  • SQLite FTS5 (keyword search): Blazingly fast for term queries. However, it doesn’t understand synonyms or concepts – it only sees tokens. It’s great for initial filtering (as we’re doing) or for strictly keyword-based search.
  • SQLite (brute-force vector): Easy to implement and fully local. Performance is O(n) per query. For a few thousand vectors you probably won’t notice. It handles up to ~1e5–1e6 vectors in “reasonable” time with good code, but you might need to lower dims or add bits for more. It uses very little RAM since vectors stay on disk and we stream them (similar to DuckDB’s approach). No indexing overhead (apart from FTS5). Multi-platform (even runs in a browser via WASM).
  • FAISS / Annoy / HNSW (in-memory): These specialized libs are optimized for speed at large scale. They typically load all vectors into RAM and build an index (IVF, HNSW, etc.). Queries are sub-linear (often millisecond-scale for 1M items). However, they require more memory and have extra setup. For a small app, this is often unnecessary. sqlite-vec uses SIMDinstructions to speed up its raw loop, but it still can’t outrun FAISS’s multi-threading for 10K+ items.
  • Hybrid (FTS5 + FAISS): Some systems combine both: e.g. use FTS5 to get 1000 candidates and then run those through FAISS in Python. Or use Reciprocal Rank Fusion to merge results. This can give great accuracy, but now you have two systems. Our all-in-one approach has fewer moving parts.

In summary, if you need sub-50ms queries on a million documents, go FAISS or an external vector DB. But if you’re fine with 100ms–1s on a few thousand docs (common for small apps or prototyping), sticking with SQLite is often “good enough” and so much simpler. It has the added bonus that you can also do keyword MATCH queries for free (FTS5 was designed for that).

Putting It All Together: Personal Thoughts

Having built RAG apps, chatbots, and search engines both with heavyweight stacks and with humble tools, I find the minimal SQLite approach deeply satisfying. It reminds me of my early days of programming: one language, one file, and it just works. With SQLite+FastAPI, I can throw a vector search endpoint onto any Python service without adding Docker or new infra.

Of course, this simplicity comes with caveats: you must watch out for write concurrency (SQLite allows one writer at a time) and careful schema design. But for read-heavy search services, that’s usually fine – use WAL mode and you’re golden. I also keep an eye on embedding dimension. In one project, I initially used 1536-dim embeddings and found my queries were a bit sluggish; dropping to a 384-dim model (MiniLM) cut my CPU usage by ~4x and still gave excellent results.

Ultimately, building a semantic search API this way has let me focus on the application, not on wrangling servers. I’ve shipped it as part of knowledge bases, personal chatbots, and even embedded tools on phones. And the warm fuzzy feeling of knowing your data never left your machine (no cloud API calls required at runtime) can’t be overstated.

Give it a try: you may discover that SQLite – the “little engine that could” – is quite capable of powering your first vector-based feature. Happy coding!

Don’t miss these tips!

We don’t spam! Read our privacy policy for more info.

You’ve been successfully subscribed to our newsletter! See you on the other side!

Sharing is caring!

Leave a Comment

Your email address will not be published.

Exit mobile version