Static site full text search with Hugo, S3, Python and JS

2021-04-26 (Monday) | 1900 words (~9 minutes reading)

Notes to myself.

Static sites are great for their speed and ease of deployment at low cost. One thing they might seem to miss out on is full-text search, but you can actually achieve reasonable full-text search functionality on a static site with a fairly simple solution.

This approach can be expanded in various ways to offer more functionality, but the basic core of it is:

Use Python to iterate all of the searchable items, and create an index.json file for each term found.
Deploy those index.json files with the rest of the static site to S3.
Create a static file at /search/index.html with a JS script to handle the search behaviour.
That JS script tries to fetch the index.json file for each search term, combines all the items found, scores them by how many index files they appear in and sorts them by that score.
Finally the JS script inserts a DOM element for each item using the data it got from the index.json files.

You can see a demo of this static site search here: https://momentwallart.co.uk/search/?q=japanese+mountains

Let’s look at each of those stages in a bit more detail.

Building search index files with Python

It’s handy to have a re-usable content file iterator in Python for working with markdown files in Hugo or other static site generators:

import os
import glob

from itertools import islice
from typing import List

def content_items_glob(limit: int = 10000) -> List[str]:
    return list(
        islice(
            (
                os.path.abspath(p)
                for p in glob.glob(
                    os.path.join(
                        os.path.dirname(__file__),
                        "../content/**/*.md"
                    ),
                    recursive=True
                )
            ),
            limit,
        )
    )

With that it’s easy to iterate all content item paths wherever you need to.

The other function that’s useful is one for loading a content item from a markdown file so that you can interact with it in Python. This uses the frontmatter package to load the frontmatter YAML as a Python dict, but you could handle that yourself if you prefer:

import frontmatter
from typing import Dict

def load_content_item(content_item_path: str) -> Union[frontmatter.Post, Dict]:
    return frontmatter.load(content_item_path)

You can then do something like this to create the search indexes as a Python dictionary. As with the rest, this is a little bit simplified for the example:

def build_search_indexes():
    indexes = {}
    for path in content_items_glob():
        index_content_item(load_content_item(path), indexes)

The job of index_content_item() is to extract the individual search terms for that item, and add the item to each relevant index key:

def index_content_item(load_content_item(path), indexes) -> None:
    terms = item_index_terms(item)
    for term in terms:
        if term not in indexes:
            indexes[term] = {}
        indexes[term][item["item_id"]] = index_item(item)

The index_item() produces a minimal dict of the item data that is relevant for showing a search page:

def index_item(product: Dict[str, str]) -> Dict[str, str]:
    return {
        "t": product["title"],
        "p": product_json["permalink"],
        "i": product_json["grid_image_url"],
    }

This uses single letter keys to try and shave off some more bytes in the final index JSON file, which is maybe pointless, but why not.

Next we need a item_index_terms() function that can pull out relevant search terms for an individual content item:

import re

def item_index_terms(item: Dict) -> List[str]:
    return list(set(
        finalise_terms(re.split(r"\b", item["title"]))
        + finalise_terms(item["tags"])
    ))

This is simplified, but you can do things like the regex word boundary split and combine various taxonomy lists on the content item to piece together a list of possible search terms.

The finalise_terms() function expands, filters and sanitises the terms. We need this to remove empty terms, punctuation, short terms, and so on:

import re
import unicodedata

STOP_WORDS = ("for", "the", "with", "and") # etc

def finalise_terms(terms: List[str]) -> List[str]:
    final_terms = []
    for item in terms:
        for individual_term in re.split(r"\b", item):
            tidied_term = tidy_term(individual_term)
            if len(tidied_term) > 2 and tidied_term not in STOP_WORDS:
                for expanded in expand_term(tidied_term):
                    final_terms.append(tidy_term(expanded))
    return final_terms

def tidy_term(term: str) -> str:
    term = (
        unicodedata.normalize("NFD", term)
        .encode("ascii", "ignore")
        .decode("utf-8")
        .lower()
        .translate(str.maketrans("", "", string.punctuation + string.whitespace + "‘’"))
    )
    return re.sub(r"\\p{P}", "", str(term).lower()).strip()

While we’re finalising the terms, we can also expand them. This means e.g. adding "apples" for "apple" and vice versa. In my case I also used some country data to add demonyms ("france" -> "french") and vice versa, and also regional names ("japan" -> "asia"), as these are relevant for searches in my use case.

The basic expansion function looks like this, using the pluralizer package:

from pluralizer import Pluralizer

PLURAL = Pluralizer()

def expand_term(term: str) -> List[str]:
    expanded = {
        term,
        PLURAL.pluralize(term),
        PLURAL.singular(term),
    }
    return [t for t in list(expanded) if t]

Note that we’re doing a list(set()) cast on the final list of terms, so duplicates don’t matter here.

As always, YMMV. These transformations and expansions work OK for my use case, but the nice thing about this DIY search is that you can make it do whatever you want.

This builds the whole search index in memory at once, but it’s probably fine as your data fits in memory.

With that, you just need to write out the index.json files from the Python dictionary in memory. I have this write to the static/ directory in the Hugo project, and then ignore that in .gitignore. This has the advantage of being included by hugo serve when you’re working on the site locally, and being built into the public/ directory when you run hugo before deployment to S3.

The index file writing can be as simple as this:

import os
import pathlib

SEARCH_DIR = os.path.join(DIRNAME, "../static/search")

def build_search_indexes():
    # ... indexing building logic above ...

    if os.path.isdir(SEARCH_DIR):
        shutil.rmtree(SEARCH_DIR)
    
    for term, index in indexes.items():
        write_index_file(term, index)

def write_index_file(term: str, index: Dict) -> None:
    index_path = os.path.join(SEARCH_DIR, term, "index.json")
    pathlib.Path(os.path.dirname(index_path)).mkdir(parents=True, exist_ok=True)
    with open(index_path, "w") as out:
        json.dump(list(index.values()), out, indent=None, separators=(",", ":"))

It uses the indent= and separators= kwargs to get the output JSON as minified as possible, which might save a few bytes when the client downloads them later.

Building and writing index files in this way for a few thousand content items with a few tens of thousands of search terms takes less than 2 seconds on my average spec laptop.

You end up with a directory structure like this:

tree static/

static/
└── search
    ├── alphonse
    │   └── index.json
    ├── flowers
    │   └── index.json
    ├── japanese
    │   └── index.json
    ├── mountains
    │   └── index.json
    ├── spacecraft
    │   └── index.json
    └── zoology
        └── index.json

The contents of each of those index.json files might look like this (but as minified JSON):

[
  {
    "t": "Lovely Item Title 1",
    "p": "https://website.tld/lovely-item-1/",
    "i": "https://website.tld/img/lovely-item-1.jpg"
  },
  {
    "t": "Lovely Item Title 2",
    "p": "https://website.tld/lovely-item-2/",
    "i": "https://website.tld/img/lovely-item-2.jpg"
  }
]

Deploying search index files with the static site to S3

As mentioned above, the index.json files are written to the static/ directory in the Hugo project, at static/search. This directory is ignored in .gitignore as it will only contain these generated search index files.

This means that when you run either hugo serve or hugo, the search index files get included in the built site, so it works for local development and for deployment to S3.

In my case, after building the search index files (described above), they have a total size of about 17 Mb on disk:

du -h --max-depth=0 static/search/
# 17 Mb

This might seem like quite a lot, but that data is separated across thousands of separate index files, each of which is usually less than 10 Kb. The client will only be downloading a couple of those small files at a time. The largest single index file is 87 Kb, which is smaller than many image and asset files on a lot of websites. If you have Cloudfront in front of S3, you can also have these plain json files gzipped for transfer to the client, reducing the size even more.

With the index files generated, the deployment can then be as simple as this:

hugo
aws s3 sync public/ s3://yourbucket.tld/ --acl=public-read

Search page Hugo template

You can create a dedicated search page at e.g. content/search/index.md in your Hugo project. The content is just:

---
title: Search
layout: search
slug: search
---

Then you need an HTML template file for it at e.g. themes/foobar/layouts/_default/search.html:

{{ define "main" }}

    <h1 class="item-grid-page-title" id="search-title">
        <span class="search-query"></span>
        Foobar Search
    </h1>

    <div class="item-grid" id="searchGrid">
    </div>

{{ end }}

{{ define "extraScript" }}
    {{ $searchScript := resources.Get "js/search.js" | babel | resources.Minify | resources.Fingerprint }}
    <script async defer type="application/javascript" src="{{ $searchScript.RelPermalink }}"></script>
{{ end }}

Note the "extraScript" content block used there. I find this a useful way to have arbitrary script blocks on individual templates that don’t need to appear across the site.

This template also uses Hugo’s JS compilation tools to build the search.js source file into the site.

JS script on search page

The search is handled by JS on the front-end, from a small script that can be placed at e.g. themes/foobar/assets/js/search.js.

This JS script takes the search terms from the URL, tries to fetch an index.json file for each one, merges and scores the items, sorts them by score, and finally inserts an element into the DOM for each one. In practice this is perfectly fast and straightforward with vanilla JS.

First we get the search terms and process them a little bit:

const urlSearchParams = new URLSearchParams(window.location.search);
if (!urlSearchParams.has("q")) {
    return;
}
const searchQuery = urlSearchParams.get("q")
    .replace(/[\p{P}$+<=>^`|~]/gu, '')
    .replace(/\s+/g, ' ')
    .trim();
let searchTerms = searchQuery
    .split(/\b/)
    .map(t => t.trim().toLowerCase())
    .filter(t => t.length > 2);
searchTerms = [...new Set(searchTerms)];
searchTerms = searchTerms.slice(0, 5);

This tidies up the search terms a little bit, and limits them to a maximum of 5. That limit is not essential and would vary by use case, but it helps to avoid a horribly slow experience and wasted bandwidth if a user pastes in a wall of text or something like that.

The next thing the JS script does is update the page a little bit to show the user what they’re searching for:

document.title = `“${searchQuery}” Search`;
document.querySelectorAll("input[name=q]").forEach(inp => inp.value = searchQuery);
document.querySelectorAll(".search-query").forEach(el => el.innerHTML = `&ldquo;${searchQuery}&rdquo;`);

The most important part of the JS search script is fetching, merging, scoring and sorting the items we find for the user’s search:

const allIndexes = await Promise.all(searchTerms.map(t => {
    return fetch(`/search/${t}/index.json`)
        .then(res => {
            if (res && res.status === 200) {
                return res.json();
            }
            return [];
        })
        .then(json => json || [])
}));
const items = {};
for (const index of allIndexes) {
    for (const item of index) {
        if (!(item["t"] in items)) {
            items[item["t"]] = {
                score: 0,
                ...item,
            }
        }
        items[item["t"]]["score"]++;
    }
}
const itemsToDisplay = Object.values(items)
    .sort((a, b) => b["score"] - a["score"])
    .slice(0, 15);

Again there are some somewhat arbitrary decisions in here, such as limiting the search results to 15 items. This works OK for my use case, but you might want to do something like implement pagination etc.

Rendering search results with JS

Finally, the JS scripts creates an element for each search result item and inserts it into the DOM:

for (const item of itemsToDisplay) {
    const itemCell = document.createElement("a");
    itemCell.href = item["p"];
    itemCell.classList.add("item-grid-item");
    itemCell.innerHTML = `
    <img class="item-grid-image" src="${item["i"]}" alt="${item["t"]}">
    <h2 class="item-grid-item-title">${item["t"]}</h2>
`;
    searchGrid.append(productCell);
}

This is how the search feature on Moment Wall Art works.

For example, this search for “japanese mountains” fetches the index files at /search/japanese/index.json and /search/mountains/index.json, creates a combined score for each item in those indexes together, and sorts the merged items by that score. Finally, it inserts elements into the item grid using document.createElement(). The whole thing feels fast and responsive.

NotesToSelf.Dev