Static sites are great for their speed and ease of deployment at low cost. One
thing they might seem to miss out on is full-text search, but you can actually
achieve reasonable full-text search functionality on a static site with a fairly
simple solution.
This approach can be expanded in various ways to offer more functionality, but
the basic core of it is:
- Use Python to iterate all of the searchable items, and create an
index.json
file for each term found.
- Deploy those
index.json files with the rest of the static site to S3.
- Create a static file at
/search/index.html with a JS script to handle the
search behaviour.
- That JS script tries to fetch the
index.json file for each search term,
combines all the items found, scores them by how many index files they appear
in and sorts them by that score.
- Finally the JS script inserts a DOM element for each item using the data it
got from the index.json files.
You can see a demo of this static site search here: https://momentwallart.co.uk/search/?q=japanese+mountains
Let’s look at each of those stages in a bit more detail.
Building search index files with Python
It’s handy to have a re-usable content file iterator in Python for working with
markdown files in Hugo or other static site generators:
import os
import glob
from itertools import islice
from typing import List
def content_items_glob(limit: int = 10000) -> List[str]:
return list(
islice(
(
os.path.abspath(p)
for p in glob.glob(
os.path.join(
os.path.dirname(__file__),
"../content/**/*.md"
),
recursive=True
)
),
limit,
)
)
With that it’s easy to iterate all content item paths wherever you need to.
The other function that’s useful is one for loading a content item from a
markdown file so that you can interact with it in Python. This uses the
frontmatter package to load the frontmatter YAML as a Python dict, but you
could handle that yourself if you prefer:
import frontmatter
from typing import Dict
def load_content_item(content_item_path: str) -> Union[frontmatter.Post, Dict]:
return frontmatter.load(content_item_path)
You can then do something like this to create the search indexes as a Python
dictionary. As with the rest, this is a little bit simplified for the example:
def build_search_indexes():
indexes = {}
for path in content_items_glob():
index_content_item(load_content_item(path), indexes)
The job of index_content_item() is to extract the individual search terms for
that item, and add the item to each relevant index key:
def index_content_item(load_content_item(path), indexes) -> None:
terms = item_index_terms(item)
for term in terms:
if term not in indexes:
indexes[term] = {}
indexes[term][item["item_id"]] = index_item(item)
The index_item() produces a minimal dict of the item data that is
relevant for showing a search page:
def index_item(product: Dict[str, str]) -> Dict[str, str]:
return {
"t": product["title"],
"p": product_json["permalink"],
"i": product_json["grid_image_url"],
}
This uses single letter keys to try and shave off some more bytes in the final
index JSON file, which is maybe pointless, but why not.
Next we need a item_index_terms() function that can pull out relevant
search terms for an individual content item:
import re
def item_index_terms(item: Dict) -> List[str]:
return list(set(
finalise_terms(re.split(r"\b", item["title"]))
+ finalise_terms(item["tags"])
))
This is simplified, but you can do things like the regex word boundary split
and combine various taxonomy lists on the content item to piece together a list
of possible search terms.
The finalise_terms() function expands, filters and sanitises the terms. We
need this to remove empty terms, punctuation, short terms, and so on:
import re
import unicodedata
STOP_WORDS = ("for", "the", "with", "and") # etc
def finalise_terms(terms: List[str]) -> List[str]:
final_terms = []
for item in terms:
for individual_term in re.split(r"\b", item):
tidied_term = tidy_term(individual_term)
if len(tidied_term) > 2 and tidied_term not in STOP_WORDS:
for expanded in expand_term(tidied_term):
final_terms.append(tidy_term(expanded))
return final_terms
def tidy_term(term: str) -> str:
term = (
unicodedata.normalize("NFD", term)
.encode("ascii", "ignore")
.decode("utf-8")
.lower()
.translate(str.maketrans("", "", string.punctuation + string.whitespace + "‘’"))
)
return re.sub(r"\\p{P}", "", str(term).lower()).strip()
While we’re finalising the terms, we can also expand them. This means e.g.
adding "apples" for "apple" and vice versa. In my case I also used some
country data to add demonyms ("france" -> "french" ) and vice versa, and also
regional names ("japan" -> "asia" ), as these are relevant for searches in my
use case.
The basic expansion function looks like this, using the pluralizer package:
from pluralizer import Pluralizer
PLURAL = Pluralizer()
def expand_term(term: str) -> List[str]:
expanded = {
term,
PLURAL.pluralize(term),
PLURAL.singular(term),
}
return [t for t in list(expanded) if t]
Note that we’re doing a list(set()) cast on the final list of terms, so
duplicates don’t matter here.
As always, YMMV. These transformations and expansions work OK for my use case,
but the nice thing about this DIY search is that you can make it do whatever you
want.
This builds the whole search index in memory at once, but it’s probably fine as
your data fits in memory.
With that, you just need to write out the index.json files from the Python
dictionary in memory. I have this write to the static/ directory in the Hugo
project, and then ignore that in .gitignore . This has the advantage of being
included by hugo serve when you’re working on the site locally, and being
built into the public/ directory when you run hugo before deployment to S3.
The index file writing can be as simple as this:
import os
import pathlib
SEARCH_DIR = os.path.join(DIRNAME, "../static/search")
def build_search_indexes():
# ... indexing building logic above ...
if os.path.isdir(SEARCH_DIR):
shutil.rmtree(SEARCH_DIR)
for term, index in indexes.items():
write_index_file(term, index)
def write_index_file(term: str, index: Dict) -> None:
index_path = os.path.join(SEARCH_DIR, term, "index.json")
pathlib.Path(os.path.dirname(index_path)).mkdir(parents=True, exist_ok=True)
with open(index_path, "w") as out:
json.dump(list(index.values()), out, indent=None, separators=(",", ":"))
It uses the indent= and separators= kwargs to get the output JSON as
minified as possible, which might save a few bytes when the client downloads
them later.
Building and writing index files in this way for a few thousand content items
with a few tens of thousands of search terms takes less than 2 seconds on my
average spec laptop.
You end up with a directory structure like this:
tree static/
static/
└── search
├── alphonse
│ └── index.json
├── flowers
│ └── index.json
├── japanese
│ └── index.json
├── mountains
│ └── index.json
├── spacecraft
│ └── index.json
└── zoology
└── index.json
The contents of each of those index.json files might look like this (but as
minified JSON):
[
{
"t": "Lovely Item Title 1",
"p": "https://website.tld/lovely-item-1/",
"i": "https://website.tld/img/lovely-item-1.jpg"
},
{
"t": "Lovely Item Title 2",
"p": "https://website.tld/lovely-item-2/",
"i": "https://website.tld/img/lovely-item-2.jpg"
}
]
Deploying search index files with the static site to S3
As mentioned above, the index.json files are written to the static/
directory in the Hugo project, at static/search . This directory is ignored in
.gitignore as it will only contain these generated search index files.
This means that when you run either hugo serve or hugo , the search index
files get included in the built site, so it works for local development and for
deployment to S3.
In my case, after building the search index files (described above), they have
a total size of about 17 Mb on disk:
du -h --max-depth=0 static/search/
# 17 Mb
This might seem like quite a lot, but that data is separated across thousands of
separate index files, each of which is usually less than 10 Kb. The client will
only be downloading a couple of those small files at a time. The largest single
index file is 87 Kb, which is smaller than many image and asset files on a lot
of websites. If you have Cloudfront in front of S3, you can also have these
plain json files gzipped for transfer to the client, reducing the size even
more.
With the index files generated, the deployment can then be as simple as this:
hugo
aws s3 sync public/ s3://yourbucket.tld/ --acl=public-read
Search page Hugo template
You can create a dedicated search page at e.g. content/search/index.md in
your Hugo project. The content is just:
---
title: Search
layout: search
slug: search
---
Then you need an HTML template file for it at e.g.
themes/foobar/layouts/_default/search.html :
{{ define "main" }}
<h1 class="item-grid-page-title" id="search-title">
<span class="search-query"></span>
Foobar Search
</h1>
<div class="item-grid" id="searchGrid">
</div>
{{ end }}
{{ define "extraScript" }}
{{ $searchScript := resources.Get "js/search.js" | babel | resources.Minify | resources.Fingerprint }}
<script async defer type="application/javascript" src="{{ $searchScript.RelPermalink }}"></script>
{{ end }}
Note the "extraScript" content block used there. I find this a useful way to
have arbitrary script blocks on individual templates that don’t need to appear
across the site.
This template also uses Hugo’s JS compilation tools to build the search.js
source file into the site.
JS script on search page
The search is handled by JS on the front-end, from a small script that can be
placed at e.g. themes/foobar/assets/js/search.js .
This JS script takes the search terms from the URL, tries to fetch an index.json
file for each one, merges and scores the items, sorts them by score, and finally
inserts an element into the DOM for each one. In practice this is perfectly fast
and straightforward with vanilla JS.
First we get the search terms and process them a little bit:
const urlSearchParams = new URLSearchParams(window.location.search);
if (!urlSearchParams.has("q")) {
return;
}
const searchQuery = urlSearchParams.get("q")
.replace(/[\p{P}$+<=>^`|~]/gu, '')
.replace(/\s+/g, ' ')
.trim();
let searchTerms = searchQuery
.split(/\b/)
.map(t => t.trim().toLowerCase())
.filter(t => t.length > 2);
searchTerms = [...new Set(searchTerms)];
searchTerms = searchTerms.slice(0, 5);
This tidies up the search terms a little bit, and limits them to a maximum of 5.
That limit is not essential and would vary by use case, but it helps to avoid a
horribly slow experience and wasted bandwidth if a user pastes in a wall of text
or something like that.
The next thing the JS script does is update the page a little bit to show the
user what they’re searching for:
document.title = `“${searchQuery}” Search`;
document.querySelectorAll("input[name=q]").forEach(inp => inp.value = searchQuery);
document.querySelectorAll(".search-query").forEach(el => el.innerHTML = `“${searchQuery}”`);
The most important part of the JS search script is fetching, merging, scoring
and sorting the items we find for the user’s search:
const allIndexes = await Promise.all(searchTerms.map(t => {
return fetch(`/search/${t}/index.json`)
.then(res => {
if (res && res.status === 200) {
return res.json();
}
return [];
})
.then(json => json || [])
}));
const items = {};
for (const index of allIndexes) {
for (const item of index) {
if (!(item["t"] in items)) {
items[item["t"]] = {
score: 0,
...item,
}
}
items[item["t"]]["score"]++;
}
}
const itemsToDisplay = Object.values(items)
.sort((a, b) => b["score"] - a["score"])
.slice(0, 15);
Again there are some somewhat arbitrary decisions in here, such as limiting the
search results to 15 items. This works OK for my use case, but you might want to
do something like implement pagination etc.
Rendering search results with JS
Finally, the JS scripts creates an element for each search result item and
inserts it into the DOM:
for (const item of itemsToDisplay) {
const itemCell = document.createElement("a");
itemCell.href = item["p"];
itemCell.classList.add("item-grid-item");
itemCell.innerHTML = `
<img class="item-grid-image" src="${item["i"]}" alt="${item["t"]}">
<h2 class="item-grid-item-title">${item["t"]}</h2>
`;
searchGrid.append(productCell);
}
This is how the search feature on Moment Wall
Art works.
For example, this search for “japanese
mountains” fetches
the index files at /search/japanese/index.json and
/search/mountains/index.json , creates a combined score for each item in those
indexes together, and sorts the merged items by that score. Finally, it inserts
elements into the item grid using document.createElement() . The whole thing
feels fast and responsive.
View post:
Static site full text search with Hugo, S3, Python and JS
|