Az

Elasticsearch Blog Search

One of the main drawbacks I can think of to a static blog is the lack of easy search function. Because all the files are pre-generated HTML, CSS, and JS, there’s no server-side interpreted language that can perform actions and no database of posts which can be filtered. I decided to change this and did a little proof-of-concept on my local machine for how it would work.

If you’re running a Hugo blog, you can repeat my little experiment yourself! If not, you can follow along and modify parts of it to work with your own system, as this should work with platforms like Jekyll too.

Elasticsearch And Loading Posts

Firstly, set up an Ubuntu virtual machine with it’s own private IP and install Oracle’s Java 8 JDK (I recommend using the webupd8team PPA), then install a recent version of Elasticsearch. Following the linked docs should help you out. Lastly, make sure Python3, PIP and Varnish are installed too. You’ll need two Python packages, you can install them with the following command:

pip3 install python-frontmatter elasticsearch

Now check out your Hugo blog repository. Change into the directory and create a new file, elasticsearch.py and fill it with the following:

import frontmatter
import glob
from datetime import datetime
from elasticsearch import Elasticsearch

es = Elasticsearch()
es.indices.delete(index='blog')
for file in glob.iglob('content/**/*.md', recursive=True):
    post = frontmatter.load(file)
    doc = {
        'title': post['title'],
        'text': post.content,
    }
    res = es.index(index="blog", doc_type='post', body=doc)

You can now run this file with python3 blog.py. What’s it doing? It’s going through your blog content file and every Markdown file gets read and has the “frontmatter” (metadata in YAML, TOML, or JSON format) turned into a dictionary. This dictionary is then turned into an Elasticsearch document and indexed for future searching.

Configuring Varnish

Next up, you’ll want to edit your /etc/varnish/default.vcl file and change the backend port number to 9200 and insert the following inside the vcl_recv section:

if (req.method != "GET" && req.method != "OPTIONS" && req.method != "HEAD")$
    /* We only deal with GET and OPTIONS and HEAD by default */
    return (synth(405));
}

This limits us to only performing simple GET queries (and preflight checks for CORS) to prevent anyone in the world from indexing or deleting documents from our Elasticsearch node.

You’ll also want to add some Access-Control headers, so insert the following into the vcl_deliver portion:

set resp.http.Access-Control-Allow-Origin = "*";
set resp.http.Access-Control-Allow-Methods = "GET, OPTIONS, HEAD";
set resp.http.Access-Control-Allow-Headers = "Origin, Accept, Content-Type, X-Requested-With, X-CSRF-Token";

This will allow us to make search queries in JavaScript from our statically hosted blog to our Elasticsearch node without need for any server in the middle (except our reverse proxy, Varnish). Please restart your Varnish install so it picks up the new configuration and continue on to the next section.

Next up let’s create our search page. If you’re running a Hugo blog, you’ll probably want to create a file in the static top level directory called search.html and fill it with something like so:

<!DOCTYPE html>
<html lang="en">
<head>
  <title>Search</title>
  <script src="../js/jquery-3.3.1.min.js"></script>
</head>
<body style="font-family:'Special Elite', serif; text-align: center;">
  <div style="width: 100%; text-align:centre;">
    <h1>Search</h1>
  </div>
  <div class="search">
    <form id="search-form">
      <input id="search-bar" type="text" placeholder="Query">
      <input type="submit" value="Search">
    </form>
  </div>
  <content>

  </content>
  <script>
    $(function() {
      $('#search-form').submit(function(e) {
        e.preventDefault();
        var search = $('#search-bar').val();
        console.log(search);
        $.ajax({
          url: "http://192.168.33.10:6081/blog/_search",
          type: "GET",
          data: {
            q: '*:' + search
          },
          success: function(response) {
            $('content').text('');
            response.hits.hits.forEach(function(value) {
              var text = "<h2>" + value._source.title + "</h2>";
              $('content').append(text);
            });

          }
        })
      });
    });
  </script>
</body>
</html>

You’ll notice I’m using jQuery for this example, so make sure you have a similar version if you want to follow along. You’ll also want to replace the IP address with your own virtual machine’s IP address.

You should now be able to generate your site and upload it to your usual host. When you browse to the search page, you can enter searches and it will return up to 10 results showing the titles of blog posts containing those terms ranked by Elasticsearch’s full text searching algorithm.

Further Work

Of course, there’s bits missing here. We only have post titles, no links! That’s an exercise left up to the reader, it’ll be dependent on the structure of your blog but you may want to look at how URLs are generated and then match that in your Python script.

You’ll also need to work in your own method of handling pagination of results.

Lastly, consider how you’d turn this into a production setup. You’d probably want it hooked into CI/CD so that when you commit a new post, it rebuilds your site, SSH’s into the Elasticsearch server, updates the local repo there and runs the Python script again.

Conclusion

I’m not turning this into a production system for my own site, mostly because Elasticsearch is expensive to run compared to my blog (from US$0.75 to US$10.75 per month) and it was mostly an exercise to see what was possible.

There are other alternatives though, if you have a relatively small blog, you could use a JavaScript-based, Lucene-inspired search framework like Lunr. You could have as part of your build process a script that pulls all the metadata, content, and links into your search and sets them up as a JSON document on your search page that is then processed and searched. Keep in mind this scales worse the more posts you have since they’ll all be loaded at once.

Otherwise, you can do what I remembered after all of this: go to Google and type in “site:adamogrady.id.au keywords i wanted to look for”. Bam. Assuming your site is indexed and crawled by Google you’ll be able to find what you seek.