Migrating this website from hugo to org-publish

Written by Sebastian Dümcke on September 9, 2024
Tags: emacs

This website was created as a static website using hugo as site generator. However, I migrated to hugo at a time were its development pace was still very fast. Shortly after I was done designing the site the newest version of hugo broke the version of the themes I had selected. The fix I chose until I could make time to address the issue was to download an older release from their website which was still compatible with my themes (luckily hugo distributes binaries for each version).

Last month, as I had plans to rework my website’s content I decided to address this situation. Simultaneously, as I got very involved in Emacs org-mode (more on this in future posts) I really wanted a way to write website content with org-mode markup. One way to achieve this is to use ox-hugo. As I was researching the best way to go about this I came across org-publish. I read it’s documentation previously, without really understanding it’s use. This time around however it struck me: org-publish is a static site generator for org-mode! So I thought: since I have to put effort in to adopt a new hugo version and update my themes, moving to org-publish should not be much more effort. How wrong I was!

The following will detail my journey migrating from hugo to org-publish along with code examples and will conclude with a summary of what was gained/lost and a report on the time spent on the migration.

Migrating old posts

As with all migrations, the first step is to convert from one markup format to another. For this part, pandoc was a life saver. Pandoc is a universal document converter that can convert between many different markup formats. In particular between markdown with front matter and org. However, in order to properly parse the meta data we need to provide some additional template files to pandoc. To format the date in the format expected by org-mode we need to provide a separate parsing file in Lua. The command is then as follows:

cd posts
for f in ../old_posts/*.md
do
    base=$(basename "$f" ".md")
    mkdir "${base}"
    pandoc -f markdown+yaml_metadata_block -t org --template ../custom_org_pandoc_template -o "${base}"/index.org --lua-filter=../date-format.lua "$f"
done

Notice that I create the org files as index.org inside a new folder named after the post title/slug. This is done so that I can have clean URLs without the .html extension. The template file and Lua filter are as follows:

$if(title)$
#+title: $title$
$endif$
$if(author)$
#+author: $for(author)$$author$$sep$; $endfor$
$endif$
$if(date)$
#+date: $date$
$endif$
$if(tags)$
#+filetags: :$for(tags)$$tags$$sep$:$endfor$:
$endif$
$if(draft)$
#+filetags: :draft:
$endif$

$body$

  -- source: https://stackoverflow.com/questions/72658259/how-do-i-format-date-in-pandoc-html-output
function Meta(meta)
  if meta.date then
    local format = "(%d+)-(%d+)-(%d+)T(%d+):(%d+):(%d+).*"
    local y, m, d = pandoc.utils.stringify(meta.date):match(format)
    local date = os.time({
      year = y,
      month = m,
      day = d,
    })
    local date_string = os.date("%Y-%m-%d %a %H:%M", date)

    meta.date = pandoc.Str(date_string)
    return meta
  end
end

This got me most of the way there, with some issues to fix manually such as the use of Hugo shortcodes and some links without the proper protocol.

New website design

The step I spent the longest on was to replicate the parts of the hugo themes I liked, and understanding the way org aggregates posts. In this new iteration of my website I wanted a very light presentation with a focus on readability. For this I researched minimal CSS files and settled on sp.css from susam.net. I spent the time to change the default colours and create a colour palette to use as “corporate ID” across web and email. Using such a minimal CSS also means removing any custom fonts. Instead we advise the browser to use either Georgia, Garamond or any serif font of its choosing. This means the website might look a bit different on different devices. I can live with that. It avoid serving fonts thus reducing page load times and helps readability.

It was also important to me that the HTMl used behind the scene is semantic, i.e. using specific tags that can be attributed a meaning. Here I found that org-publish did a good default job and I had to change very little. I also believe this part improved in comparison to my previous set-up

Compared to hugo, org-publish does not use a template engine. This has made things more challenging in many ways and the new site is still lacking pagination for example.

As a result I am quite happy with the new presentation: good readability on mobile and desktop, a dark theme for reading during the night (or for people who prefer it) and no fluff. The total size of the CSS file is under 2.5 kb, compared to 688 kb previously (a 99.97% reduction).

org-publish set-up

It took a while, many readings of the documentation and inspiration from several blog posts to fully grasps the configuration of org-publish that will results in the expected outcome. Amongst other things I had to create 3 custom filter functions. One to enable clean URLs (copied from here), then another 2 so that, when I link to other posts or images from the asset folder I can use my Emacs completion framework to find the file and after export the link will be relative to the correct path in the web root. After this, though I can now display images in-line while composing the posts, which is great.

Then I had to wrangle org-publish to find a way to include the author and time stamp under the title of each post. This was only possible by parsing the final html result and inserting the required html into the final output. So this fourth filter function had to be part of org-export-filter-final-output-funcions.

Understanding the org-publish sitemap feature took a while (and I still do not fully understand it). It is org-publish’ mechanism to aggregate posts into an index. For this it will create another org file that contains a list of all posts (or files in the input directory). However, my experience was that this file also gets exported, even if part of the exclusion criteria. Here I wrote a function that formats the list of posts by adding the time string, because this was the way it was previously and I came to like it. I then included the sitemap org file into my main index.org file to show the list of all posts. Note, that the sitemap cannot do pagination. Once the amount of posts will reach a certain length, my best recourse will be to cut it down to the N most recent posts.

Each page gets a pre-amble and post-amble. The pre-amble corresponds to the header and contains the site logo/title and navigation. The post-amble is the footer and in my case contains the copyright and licensing notice as well as the imprint and data confidentiality notices. To exclude the pre-amble, i.e the Title and navigation on the main page (index.org) I excluded this page from the publishing of all pages and create a separate set-up to publish just the main index.org and overwrite the pre-amble with the empty string. The complete publish.el file with all configuration is below, edited to remove personalisations. Take what you need.

(package-initialize)
(require 'ox-publish)
;own variables
(setq samd/posts-publish-directory "blog/"
      samd/asset-directory-name "static/") ;trailing slash is important
;settings source: https://orgmode.org/manual/Publishing-options.html
;;Generic Properties
(setq user-full-name "Your Name"
      org-export-default-language "en"
      org-export-headline-levels 5 ;Instead of 3, set the maximum headline level to 5. This matches the HTML standard of having six headline levels, when counting the document title as the first, leaving five. source: https://jeffkreeftmeijer.com/emacs-configuration/#org
      org-export-preserve-breaks nil ;default. Do not preserve line breaks in text 
      org-export-with-section-numbers nil
      org-export-timestamp-file nil ;Non-nil inserts an html comment with the timestamp the file was create
      org-export-with-author nil 
      org-export-with-creator nil
      org-export-with-broken-links nil 
      org-export-with-drawers nil
      org-export-with-email nil
      org-export-with-smart-quotes t
      org-export-with-special-strings t
      org-export-with-statistics-cookies nil
      org-export-with-sub-superscripts nil
      org-export-with-toc nil
      org-export-with-tags t 
      org-export-with-title t ;nil ;do not export title into a hader element. This allows us to do so in the preamble (and thus group title and additional data in the same <header> tag
      )
;;HTML specific properties
(setq 
      org-html-doctype "html5"
      org-html-container-element "section" ;used for wrapping top-level blocks (in this case only applies to pages). ;gets applied around all (level 1?) headings in org file
      org-html-html5-fancy t ;use new html5 elements
      org-html-divs '((preamble  "header" "preamble")
                      (content   "main" "content")
                      (postamble "footer" "postamble"))
      org-html-head "<link rel=\"stylesheet\" type=\"text/css\" href=\"/css/sp.css\" /><link rel=\"shortcut icon\" href=\"http://sam-d.com/favicon.ico\">"
      org-html-head-include-default-style nil
      org-html-head-include-scripts nil
      org-html-inline-images t
      org-html-with-latex nil ;set to t for MathJax  
      org-html-validation-link nil
      org-html-postamble "<span class=\"hr\"></span><footer style=\"text-align:center\">
        All content licensed <a href=\"https://creativecommons.org/licenses/by/4.0/\">CC-BY</a> unless otherwise indicated except for any pictures which are © of the author (%a)<br /><a href=/impressum.html>Impressum</a> <a href=/datenschutz.html>Datenschutz</a></footer>" ;global footer
      org-html-preamble "<h1 style=\"text-align:center\">Your title or logo</h1><span class=\"hr\"></span><nav><ul><li><a href=\"/\">Home</a></li><li><a href=\"/#about\">About</a></li><li><a href=\"/#projects\">Projects</a></li><li><a href=\"/#blog\">Blog</a></li></ul></nav>" 
      )

;the code below ensures clean URLs
;source: https://diego.codes/post/blogging-with-org/
(defun filter-local-links (link backend info)
  "Filter that converts all the /index.html links to /"
  (if (org-export-derived-backend-p backend 'html)
          (replace-regexp-in-string "/index.html" "/" link)))

(defun samd/remove-assets-from-links (link backend info)
  "Filter that removes the asset directory from the link. This allows linking to assets by file system path and thus allows to e.g. inline images"
  (if (org-export-derived-backend-p backend 'html)
      (replace-regexp-in-string samd/asset-directory-name "" link)))

(defun samd/remove-posts/pages-from-links (link backend info)
  "Filter that removes the post or pages directory from the link.
This allows linking between posts and pages and vice-versa"
  (if (org-export-derived-backend-p backend 'html)
      (replace-regexp-in-string "pages/" "" (replace-regexp-in-string "posts/" "blog/" link))))

(defun samd/add-author-timestamp (content backend info)
  "Filter to add Author and timestamp information into the header tag containing the post title"
  (if (and (org-export-derived-backend-p backend 'html) (org-export-get-date info ))
      (let ((timestamp (org-export-get-date info "%Y-%m-%dT%T"))
            (timestring (org-export-get-date info "%B %e, %Y")))
        (replace-regexp-in-string "\\(<main.*\\(\n.*\\)*\\)</header>" (concat "\\1 <p class=\"info\">Written by "
                                            (org-export-data (plist-get info :author) info)
                                            " on <time datetime=\"" (org-export-data timestamp info) "\">" (org-export-data timestring info) "</time></p></header>")
                                  content))))
;; Do not forget to add the functions to the list!
(add-to-list 'org-export-filter-link-functions 'filter-local-links)
(add-to-list 'org-export-filter-link-functions 'samd/remove-assets-from-links)
(add-to-list 'org-export-filter-link-functions 'samd/remove-posts/pages-from-links)
(add-to-list 'org-export-filter-final-output-functions 'samd/add-author-timestamp)

(defun samd/org-publish-sitemap-as-list (entry style project)
  "copy of default function, changed to add path to blog in the link and date at the end of the entry"
  (cond ((not (directory-name-p entry))
         (format "[[file:%s/%s][%s]] - %s"
                 (concat "../" samd/posts-publish-directory)
                 entry
                 (org-publish-find-title entry project)
                 (format-time-string "%b %e, %Y" (org-publish-find-date entry project))))
        ((eq style 'tree)
         ;; Return only last subdir.
         (file-name-nondirectory (directory-file-name entry)))
        (t entry)))

(setq org-publish-project-alist
      `(("posts"
         :base-directory "./posts"
         :base-extension "org" ;default value
         :publishing-directory ,(concat "public/" samd/posts-publish-directory)
         :recursive t ;required because posts in subfolders
         :auto-sitemap t
         :sitemap-filename "index.inc"
         :sitemap-title ""
         :sitemap-style list
         :sitemap-format-entry samd/org-publish-sitemap-as-list
         :sitemap-sort-files anti-chronologically
         :publishing-function org-html-publish-to-html)
        ("static"
         :base-directory ,(concat "./" samd/asset-directory-name)
         :base-extension any
         :publishing-directory "public/"
         :publishing-function org-publish-attachment
         :recursive t)
        ("pages"
         :base-directory "./pages"
         :base-extension "org"
         :recursive t
         :exclude ,(rx (seq line-start "index.org")); exclude the homepage, e.g. top-level index.org file
         :publishing-directory "./public")
        ;we want to overwrite the html-preamble for the homepage only, so we exclude all other pages by using a file extension that does not exist 
        ("homepage"
         :base-directory "./pages"
         :base-extension "doesnotexist"
         :publishing-directory "./public"
         :include ("index.org")
         :html-preamble ""))) ;override preamble here!

Things that did not work in org-publish:

one cannot exclude a file from export by setting the org-export-exclude-tags variable. This was against my expectation based on the variable name and docstring. org-publish works on a file by file basis and inclusion/exclusion can only be handled based on file name or extension.
I could not find how to show a “teaser” or excerpt of each post in the sitemap
I could not find a way to export an XML sitemap (yet)
I could not find a way to create pages with all posts associated to a certain tag (yet)
I was not yet able to generate coloured code highlighting, thought it is clearly possible using the htmlize package

Validating the migration

To ensure the website migration went smoothly and did not leave any broken links I downloaded the sitemap.xml from the previous site and validated all links using wget

grep -e loc sitemap.xml | sed 's|<loc>\(.*\)<\/loc>$|\1|g' | xargs -I {} curl -s -o /dev/null -w "%{http_code} %{url_effective}\n" {} | grep -v 200

I also checked the page load times before and after using Pingdom. You can see the results below: my website scoring, total page size and page load times all improved.

Figure 1: performance of the old website

Figure 2: performance of the new website

It now only takes 4 requests instead of 14 (probably due to removing custom fonts), page size has decreased by 89% and page load time reduced by 43%. To further improve the score I should look into adding expire headers on my static files I guess, though that is more of a web server issue than HTML/CSS.

Conclusion

I believe that setting up the blog would have been faster if I had not migrated from a previous set-up and could have started with a clean slate. That way I could have created a different folder structure and leaned into the idiosyncrasies of org-publish. Having tracked my time working on the migration, I can say that I worked at least 23 hours and 25 minutes on this project. In hindsight this time would have been better spent writing posts.

What I gained

I can now write my posts using org markup and show images in-line in the editor. This would have been possible without migrating to org-publish. Conversely I am now tied to Emacs as an editor.
The website generation should still work across operation systems as Emacs is cross-platform and also available for Windows and MacOS. Same as hugo.
The new website is much lighter. Again I could have achieved this without migrating to org-publish.
A better understanding of Emacs, org-publish and elisp (for what this is worth).

What I lost

Tag pages,
pagination,
code highlighting,
speed of website generation (org-publish is much slower than hugo).
Writing draft posts in the same folder.
Also previewing draft posts with the built-in local http server.

Verdict

In the end, I probably should have spend the few hours fixing my hugo set-up and instead written more posts. 10/10 hindsight!