Overview Link to heading

I have a dump in XML of my old Wordpress blog posts (exported in December 2019) and would like to convert them to markdown in order to add them to this site.

Parsing the Wordpress Export Link to heading

My first step was to break the file down into files for each blog post, this could be done on Mac with:

awk -v FS='\n' -v RS="<title>" '/CDATA\[post\]/ {fname=$1".xml" ; gsub(/<\/title>/, "", fname) ; gsub(/: /,"-",fname) ; gsub(/&amp;/, "and", fname) ; gsub(/\//, "_or_", fname) ; gsub(/\(/,"",fname); gsub(/\)/,"",fname) ; gsub(/ /, "_", fname) ; print $0 > tolower(fname)  }' wordpress.xml

What a mouthful! Let's break down those awk arguments:

  • -v FS='\n': Separate fields by newline (overrides default of whitespace, this means that the awk arguments of $1, $2, etc will correspond to lines in the paragraph)
  • -v RS="<title>": Separate paragraphs based on HTML field <title>
  • /CDATA\[post\]/ Only continue with paragraphs that contain the string CDATA[post] (something that has a post_type of post, this excludes pages, attachments, etc that exist in wordpress export)
  • {fname=$1".xml": Set the output file name to the first line of the paragraph (in this case, the post title from wordpress) and add the type extension .xml
    • gsub(/<\/title>/, "", fname): Remove </title> from the end of the post title
    • gsub(/: /,"-",fname): Replace colon+space with a dash
    • gsub(/&amp;/, "and", fname): Replace ampersand code with the word "and"
    • gsub(/\//, "_or_", fname): Replace slashes with "or"
    • gsub(/\(/,"",fname) and gsub(/\)/,"",fname): Removing opening and closing brackets
    • gsub(/ /, "_", fname): Replacing spaces with underscores
    • Note: Yes I should have named my posts in a simpler way back in the day
  • print $0 > tolower(fname): Output the matching paragraph to the output file name in lower case

Note: To do this on Mac simply replace awk with gawk (gawk may need to be installed on your system first)

The only issue I had with this is that my "Front Page" was technically a post so I removed this from my dump of files.

Converting HTML to Markdown Link to heading

Now that the blog posts have been separated they're still in XML format with a lot of wordpress-specific gunk, the following script will:

  • Set the site metadata for Title, Categories and Date in page metadata for Hugo
  • Update any HTML formatting elements to be in markdown
  • Remove any unnecessary XML tags
  • Remove the windows control character from the export
#!/bin/bash

file=$1

# Title = Line 1
# Data = <wp:post_date>
# Body = <content:encoded> to </content:encoded>

title_orig=$(head -1 $file |sed 's,</title>,,g')
title_parsed=$(echo "$title_orig" |sed -e "s,&gt;,>,g;s,&lt;,<,g;s,&#91;,[,g;s,&#8211;,-,g;s,&#8217;,',g;s,&#8220;,\",g;s,&#8221;,\",g;s,&#8230;,...,g;s,&amp;,\&,g;s,:,,g")

date=$(grep post_date_gmt -m 1 $file |sed 's/.*CDATA\[//g;s/\].*//g')

# Collect and sanitise category names from Wordpress metadata
categories=$(grep 'domain="category"' $file |sed 's/.*CDATA\[//g;s/\].*//g')

content_orig=$(awk '/\<content:encoded/,/\/content:encoded>/' $file |sed 's/.*CDATA\[//g;s/\]\].*//g')
content_parsed=$(echo "$content_orig" |sed -e "s,&gt;,>,g;s,&lt;,<,g;s,&#91;,[,g;s,&#8211;,-,g;s,&#8217;,',g;s,&#8220;,\",g;s,&#8221;,\",g;s,&#8230;,...,g;s,&amp;,\&,g")

# Replace HTML headers with hashes
content_parsed=$(echo "$content_parsed" |sed 's,<h1>,# ,g;s,</h1>,,g;s,<h2>,## ,g;s,</h2>,,g;s,<h3>,### ,g;s,</h3>,,g;s,<h4>,#### ,g;s,</h4>,,g;')

# Replace line breaks with newlines
content_parsed=$(echo "$content_parsed" |sed  "s,<br />,\n,g")

# Replace em with italics
content_parsed=$(echo "$content_parsed" |sed  's,<em>,_,g;s,</em>,_,g')

# Replace strong with bold
content_parsed=$(echo "$content_parsed" |sed  's,<strong>,**,g;s,</strong>,**,g')

# Convert bullet lists (YMMV - This was written for my archives which had no nested bullet lists)
content_parsed=$(echo "$content_parsed" |sed 's,<ul>,,g;s,</ul>,,g;s,.*<li>,- ,g;s,</li>,,g')

# Replace code tags with backticks
## In all but one case code blocks use `pre` tags and in-line code uses `code` (one case of `code` tags used for multi-line)
## ("<code[^>]*>" matches both <code> and <code foo="bar"> by going until the next occurrence of >)
content_parsed=$(echo "$content_parsed" |sed  's,<code[^>]*>,`,g;s,<pre[^>]*>,```\n,g;s,</code>,`,g;s,</pre>,\n```,g')

# Remove paragraph tags ("<p[^>]*>" matches both <p> and <p foo="bar"> by going until the next occurrence of >)
## Need to do this after any other tags that start "<p" otherwise this mangles them (like <pre>)
content_parsed=$(echo "$content_parsed" |sed  's,<p[^>]*>,,g;s,</p>,,g')

# Remove divs
content_parsed=$(echo "$content_parsed" |sed  '/<div /d')
content_parsed=$(echo "$content_parsed" |sed  '/<\/div>/d')

# Horizontal rules with markdown equiv
content_parsed=$(echo "$content_parsed" |sed  's,<hr[^>]*>,---,g')

# Replace html codes for symbols with actual symbols
content_parsed=$(echo "$content_parsed" |sed  "s,&gt;,>,g;s,&lt;,<,g;s,&#91;,[,g;s,&#8211;,-,g;s,&#8217;,',g;s,&#8220;,\",g;s,&#8221;,\",g;s,&#8230;,...,g;s,&amp;,&,g")

# Replace featured image HTML with featured shortcode
## This uses a shortcode I created for my Hugo site at layouts/shortcodes/imglink.html
content_parsed=$(echo "$content_parsed" |sed  's;<figure.*<figcaption>\(.*\)</figcaption.*;
;g')

# Hyperlinks to markdown
## Handling mutliple links on one line by using match groups, try it out on regex101.com to see how it works
content_parsed=$(echo "$content_parsed" |sed  -re 's,<a[^>]*href="([^"]*)"[^>]*>([^<]*)</a>,[\2](\1),g')

# Get rid of shitty quotes
content_parsed=$(echo "$content_parsed" |sed  's,“,",g;s,”,",g')
content_parsed=$(echo "$content_parsed" |sed  "s,’,',g")

# Replace Blockquotes
## Performs second `-e` action on lines between first and second pattern match
content_parsed=$(echo "$content_parsed" |sed -e '/\<blockquote/,/\<\/blockquote/!b' -e 's/^/\> /g')

#
# Return Hugo blog post
#
echo "---"
echo "title: $title_parsed"
echo "date: $date"
echo "categories:"
while IFS= read -r line; do
    echo "  - $line"
done <<< $categories
echo "  - wordpress archive"
echo "draft: true"
echo "---"
echo "$content_parsed" |tr -d '^M'

Obviously check files through manually to ensure formatting is as expected before publishing! Some things the above script doesn't do:

  • Upload external files and attachments (sort out your own image paths and uploads!)
  • Manage relative linking (so check for any mentions of your other blog posts and correct them!)
  • Preserve post tags (I didn't care about them in my use case)

Pulling it all together Link to heading

I like to store my posts in directories for the year they were written so to do that I looped through a temporary directory that had my parsed files, ran the converting script above on them and output them to the appropriate location:

for i in wordpress_parse/*.md ; do
    year=$(grep post_date_gmt -m 1 $i |sed 's/.*CDATA\[//g;s/\].*//g' |awk -v FS='-' '{print $1}')
    mkdir -p content/posts/$year
    bash wordpress_parse/html_to_markdown.sh $i > content/posts/$year/${i##*/}
done

Note: If the year is 0000 then it's because it was a draft in Wordpress that was never published

There we go, check out my imported posts in my Wordpress Archive.