Overview Link to heading
I have a dump in XML of my old Wordpress blog posts (exported in December 2019) and would like to convert them to markdown in order to add them to this site.
Parsing the Wordpress Export Link to heading
My first step was to break the file down into files for each blog post, this could be done on Mac with:
awk -v FS='\n' -v RS="<title>" '/CDATA\[post\]/ {fname=$1".xml" ; gsub(/<\/title>/, "", fname) ; gsub(/: /,"-",fname) ; gsub(/&/, "and", fname) ; gsub(/\//, "_or_", fname) ; gsub(/\(/,"",fname); gsub(/\)/,"",fname) ; gsub(/ /, "_", fname) ; print $0 > tolower(fname) }' wordpress.xml
What a mouthful! Let's break down those awk arguments:
-v FS='\n'
: Separate fields by newline (overrides default of whitespace, this means that theawk
arguments of$1
,$2
, etc will correspond to lines in the paragraph)-v RS="<title>"
: Separate paragraphs based on HTML field<title>
/CDATA\[post\]/
Only continue with paragraphs that contain the stringCDATA[post]
(something that has apost_type
ofpost
, this excludes pages, attachments, etc that exist in wordpress export){fname=$1".xml"
: Set the output file name to the first line of the paragraph (in this case, the post title from wordpress) and add the type extension.xml
gsub(/<\/title>/, "", fname)
: Remove</title>
from the end of the post titlegsub(/: /,"-",fname)
: Replace colon+space with a dashgsub(/&/, "and", fname)
: Replace ampersand code with the word "and"gsub(/\//, "_or_", fname)
: Replace slashes with "or"gsub(/\(/,"",fname)
andgsub(/\)/,"",fname)
: Removing opening and closing bracketsgsub(/ /, "_", fname)
: Replacing spaces with underscores- Note: Yes I should have named my posts in a simpler way back in the day
print $0 > tolower(fname)
: Output the matching paragraph to the output file name in lower case
Note: To do this on Mac simply replace awk
with gawk
(gawk
may need to be installed on your system first)
The only issue I had with this is that my "Front Page" was technically a post so I removed this from my dump of files.
Converting HTML to Markdown Link to heading
Now that the blog posts have been separated they're still in XML format with a lot of wordpress-specific gunk, the following script will:
- Set the site metadata for Title, Categories and Date in page metadata for Hugo
- Update any HTML formatting elements to be in markdown
- Remove any unnecessary XML tags
- Remove the windows control character from the export
#!/bin/bash
file=$1
# Title = Line 1
# Data = <wp:post_date>
# Body = <content:encoded> to </content:encoded>
title_orig=$(head -1 $file |sed 's,</title>,,g')
title_parsed=$(echo "$title_orig" |sed -e "s,>,>,g;s,<,<,g;s,[,[,g;s,–,-,g;s,’,',g;s,“,\",g;s,”,\",g;s,…,...,g;s,&,\&,g;s,:,,g")
date=$(grep post_date_gmt -m 1 $file |sed 's/.*CDATA\[//g;s/\].*//g')
# Collect and sanitise category names from Wordpress metadata
categories=$(grep 'domain="category"' $file |sed 's/.*CDATA\[//g;s/\].*//g')
content_orig=$(awk '/\<content:encoded/,/\/content:encoded>/' $file |sed 's/.*CDATA\[//g;s/\]\].*//g')
content_parsed=$(echo "$content_orig" |sed -e "s,>,>,g;s,<,<,g;s,[,[,g;s,–,-,g;s,’,',g;s,“,\",g;s,”,\",g;s,…,...,g;s,&,\&,g")
# Replace HTML headers with hashes
content_parsed=$(echo "$content_parsed" |sed 's,<h1>,# ,g;s,</h1>,,g;s,<h2>,## ,g;s,</h2>,,g;s,<h3>,### ,g;s,</h3>,,g;s,<h4>,#### ,g;s,</h4>,,g;')
# Replace line breaks with newlines
content_parsed=$(echo "$content_parsed" |sed "s,<br />,\n,g")
# Replace em with italics
content_parsed=$(echo "$content_parsed" |sed 's,<em>,_,g;s,</em>,_,g')
# Replace strong with bold
content_parsed=$(echo "$content_parsed" |sed 's,<strong>,**,g;s,</strong>,**,g')
# Convert bullet lists (YMMV - This was written for my archives which had no nested bullet lists)
content_parsed=$(echo "$content_parsed" |sed 's,<ul>,,g;s,</ul>,,g;s,.*<li>,- ,g;s,</li>,,g')
# Replace code tags with backticks
## In all but one case code blocks use `pre` tags and in-line code uses `code` (one case of `code` tags used for multi-line)
## ("<code[^>]*>" matches both <code> and <code foo="bar"> by going until the next occurrence of >)
content_parsed=$(echo "$content_parsed" |sed 's,<code[^>]*>,`,g;s,<pre[^>]*>,```\n,g;s,</code>,`,g;s,</pre>,\n```,g')
# Remove paragraph tags ("<p[^>]*>" matches both <p> and <p foo="bar"> by going until the next occurrence of >)
## Need to do this after any other tags that start "<p" otherwise this mangles them (like <pre>)
content_parsed=$(echo "$content_parsed" |sed 's,<p[^>]*>,,g;s,</p>,,g')
# Remove divs
content_parsed=$(echo "$content_parsed" |sed '/<div /d')
content_parsed=$(echo "$content_parsed" |sed '/<\/div>/d')
# Horizontal rules with markdown equiv
content_parsed=$(echo "$content_parsed" |sed 's,<hr[^>]*>,---,g')
# Replace html codes for symbols with actual symbols
content_parsed=$(echo "$content_parsed" |sed "s,>,>,g;s,<,<,g;s,[,[,g;s,–,-,g;s,’,',g;s,“,\",g;s,”,\",g;s,…,...,g;s,&,&,g")
# Replace featured image HTML with featured shortcode
## This uses a shortcode I created for my Hugo site at layouts/shortcodes/imglink.html
content_parsed=$(echo "$content_parsed" |sed 's;<figure.*<figcaption>\(.*\)</figcaption.*;
;g')
# Hyperlinks to markdown
## Handling mutliple links on one line by using match groups, try it out on regex101.com to see how it works
content_parsed=$(echo "$content_parsed" |sed -re 's,<a[^>]*href="([^"]*)"[^>]*>([^<]*)</a>,[\2](\1),g')
# Get rid of shitty quotes
content_parsed=$(echo "$content_parsed" |sed 's,“,",g;s,”,",g')
content_parsed=$(echo "$content_parsed" |sed "s,’,',g")
# Replace Blockquotes
## Performs second `-e` action on lines between first and second pattern match
content_parsed=$(echo "$content_parsed" |sed -e '/\<blockquote/,/\<\/blockquote/!b' -e 's/^/\> /g')
#
# Return Hugo blog post
#
echo "---"
echo "title: $title_parsed"
echo "date: $date"
echo "categories:"
while IFS= read -r line; do
echo " - $line"
done <<< $categories
echo " - wordpress archive"
echo "draft: true"
echo "---"
echo "$content_parsed" |tr -d '^M'
Obviously check files through manually to ensure formatting is as expected before publishing! Some things the above script doesn't do:
- Upload external files and attachments (sort out your own image paths and uploads!)
- Manage relative linking (so check for any mentions of your other blog posts and correct them!)
- Preserve post tags (I didn't care about them in my use case)
Pulling it all together Link to heading
I like to store my posts in directories for the year they were written so to do that I looped through a temporary directory that had my parsed files, ran the converting script above on them and output them to the appropriate location:
for i in wordpress_parse/*.md ; do
year=$(grep post_date_gmt -m 1 $i |sed 's/.*CDATA\[//g;s/\].*//g' |awk -v FS='-' '{print $1}')
mkdir -p content/posts/$year
bash wordpress_parse/html_to_markdown.sh $i > content/posts/$year/${i##*/}
done
Note: If the year is 0000 then it's because it was a draft in Wordpress that was never published
There we go, check out my imported posts in my Wordpress Archive.