Steps for html-to-md script
Step 1 Download the website
To avoid getting blocked or having excessive request to the server you’ll want to download the website to your local computer. This will be useful incase you need to keep retry the html-to-md script. Easiest to use WGET
wget -r -m -p -x --user-agent="Mozilla/5.0" https://www.websiteurlhere.com
Step 2 Copy script and git init
Next thing to do is to add the html-to-md script into the downloaded folder and git init in this folder. This will help you see changes using git status and also roll back failed scrapes with git add . and git stash This is not connected to a repo online and is just used to help track locally.
So open a terminal and run:
git init
git add .
git commit -m 'init commit'
Step 3 Configure the html-to-md script
From here you’ll want to configure the HTML script. By default it’s set up to run both the _pages collection and _blogposts collection. The script looks for a unique selector that should be specific to the blog to determine what is a blog and what is not.
So you’ll need to work through the script and try and fill in the blanks for any selectors needed.
Step 4 run and verify
The script can then be run with ruby html-to-md.rb but it’s important to check that pages have been made. YOu can look through the created _pages and _blogposts to verify there are changes or use git status to see updates.
If you’re able to verify everything has been brought over and created appropriately move to step 5 otherwise you can reset with git add . and git stash then try step 3 again.
Step 5 verify and compare sitemaps
From here you’ll want to make sure all the pages came over and everything is the same as the live site. Find the sitemap on the live site and find and replace the XML code out so you have a list of all the websites URLs. You then need to sort these a-z.
Do the same on the html-to-md converted content. Add these pages into jekyll.. get all the links from the sitemap… and find and replace so you’re left with only the links. Add the url back in (https://www) however it is to match the live websites sitemap. And then sort the links a-z.
You should have two identical files sorted a-z. Name one new-pages.txt and the other one live-pages.txt
Then we’re going to use VIM to generate an html diff comparrison. There are other ways to run a diff between two files but using VIM has been the easiest for me. Running that live has had great results but this HTML output works well as well.
vimdiff new-pages.txt live-pages.txt -c TOhtml -c 'w! diff.html' -c 'qa!'
Any pages that are missed will show up here. Find these and convert them manually with this online converter (https://codebeautify.org/html-to-markdown) or go back to Step 3 and try to reconfigure the html-to-md script.
HTML-TO-MD.rb conversion script
##############################
# READ ME
###
# This loops through every page in the directory (except /index and /template.index),
# pulling the title tag and description from the head,
# with the page content in the body then putting around
# them in the template file.
# **CAUTION**
# this does require a few edits, which will be listed below
# Read the outputs to follow along with the script
# Let's see how it's done!
# Requiring dependencies
puts "Requiring dependencies"
require 'nokogiri'
require 'reverse_markdown'
require 'find'
require 'date'
require 'fileutils'
# Define CSS selectors and other configuration settings at the top for easy modification
CONFIG = {
ignore_files: ['./index.html', './indexNEW.html'],
ignore_folders: ['/assets/', '.git', '.htaccess'],
blog_selector: "body.single-post",
title: 'head title',
description: 'head meta[name="description"]',
category: 'a[rel="category tag"]',
h1: 'h1',
json_ld: "head script[type='application/ld+json']",
content: '.et_pb_post_content',
remove_selectors: [
'.extra-selector-1', # Add your extra selectors here
'.extra-selector-2' # Add more selectors as needed
]
}
# Method to process each file
def process_file(f, config)
puts "---"
puts "Processing file: #{f}"
file = File.read(f)
puts "File read successfully"
doc = Nokogiri::HTML(file)
puts "File parsed successfully"
puts "Beginning DOM interaction"
title = doc.at_css(config[:title])&.inner_text || ""
puts "Found title tag: #{title}"
description_tag = doc.at_css(config[:description])
description = description_tag ? description_tag["content"] : ""
puts "Found header metadata: title=#{title}, description=#{description}"
is_blog = doc.at_css(config[:blog_selector])
if is_blog
puts 'Processing as BLOG PAGE'
category = doc.at_css(config[:category])&.inner_text || ""
puts "Found category: #{category}"
h1 = doc.at_css(config[:h1])&.text || ""
doc.at_css(config[:h1])&.remove
puts "Found H1: #{h1}"
date = if doc.at_css(config[:json_ld])
json_ld = doc.at_css(config[:json_ld]).text
Date.parse(json_ld.scan(/[0-9]{4}-[0-9]{2}-[0-9]{2}/).first) rescue Date.today
else
Date.today
end
puts "Found date: #{date}"
# Remove additional unwanted selectors from the content
config[:remove_selectors].each do |selector|
doc.css(selector).each(&:remove)
end
puts "Removed additional unwanted content"
content = doc.at_css(config[:content])&.parent
if content.nil?
puts "Content not found using selector: #{config[:content]}"
return
else
puts "Found content"
end
# Extract the filename from the URL path for blog posts
filename = f.split('/')[-2]
mdF = "./_blogposts/#{filename}.md"
puts "Markdown file path: #{mdF}"
frontMatter = <<~FRONTMATTER
---
title: >
#{h1}
layout: post
date: >
#{date}
titletag: >
#{title}
description: >
#{description}
permalink: >
#{f.gsub("./","/").gsub("/index.html","/")}
sitemap: true
categories:
- #{category}
---
FRONTMATTER
puts "Saved front matter"
markdown = ReverseMarkdown.convert(content.to_s)
puts "Converted content to Markdown"
newPage = frontMatter + markdown
puts "Combined front matter and content"
directory = mdF.gsub(mdF.split('/').last, "")
puts "Making directory: #{directory}"
FileUtils.mkdir_p(directory)
puts "Directory created"
File.write(mdF, newPage)
puts "Written to file: #{mdF}"
FileUtils.rm_rf(f)
puts "Removed old file: #{f}"
else
puts 'Processing as regular PAGE'
# Remove additional unwanted selectors from the content
config[:remove_selectors].each do |selector|
doc.css(selector).each(&:remove)
end
puts "Removed additional unwanted content"
content = doc.at_css(config[:content])&.parent
if content.nil?
puts "Content not found using selector: #{config[:content]}"
return
else
puts "Found content"
end
# Generate filename for pages by replacing slashes with dashes and removing index.html
filename = f.gsub('./', '').gsub('/index.html', '').gsub('.html', '').gsub('/', '-')
mdF = "./_pages/#{filename}.md"
puts "Markdown file path: #{mdF}"
frontMatter = <<~FRONTMATTER
---
layout: page
title: >
#{f.gsub('./', '').gsub('.html', '').gsub('-', ' ').split('/').map(&:capitalize).join(' ')}
titletag: >
#{title}
description: >
#{description}
titlebar: >
sitemap: true
---
FRONTMATTER
puts "Saved front matter"
markdown = ReverseMarkdown.convert(content.to_s)
puts "Converted content to Markdown"
newPage = frontMatter + markdown
puts "Combined front matter and content"
directory = mdF.gsub(mdF.split('/').last, "")
puts "Making directory: #{directory}"
FileUtils.mkdir_p(directory)
puts "Directory created"
File.write(mdF, newPage)
puts "Written to file: #{mdF}"
FileUtils.rm_rf(f)
puts "Removed old file: #{f}"
end
rescue => e
puts "Error processing file #{f}: #{e.message}"
puts e.backtrace.join("\n")
end
# This loops through every file and ignores framework files
puts "Beginning loop of every file"
Find.find("./") do |f|
if CONFIG[:ignore_files].include?(f)
puts "Ignored file: #{f}"
elsif CONFIG[:ignore_folders].any? { |folder| f.include?(folder) }
puts "Ignored folder content: #{f}"
elsif f.include?('.html')
process_file(f, CONFIG)
else
puts "Not an HTML file: #{f}"
end
end
exec "find ./ -empty -type d -delete"
puts "Removed all empty directories"