Html To Md

Steps for html-to-md script

Step 1 Download the website

To avoid getting blocked or having excessive request to the server you’ll want to download the website to your local computer. This will be useful incase you need to keep retry the html-to-md script. Easiest to use WGET

wget -r -m -p -x --user-agent="Mozilla/5.0" https://www.websiteurlhere.com

Step 2 Copy script and git init

Next thing to do is to add the html-to-md script into the downloaded folder and git init in this folder. This will help you see changes using git status and also roll back failed scrapes with git add . and git stash This is not connected to a repo online and is just used to help track locally.

So open a terminal and run:

git init
git add .
git commit -m 'init commit'

Step 3 Configure the html-to-md script

From here you’ll want to configure the HTML script. By default it’s set up to run both the _pages collection and _blogposts collection. The script looks for a unique selector that should be specific to the blog to determine what is a blog and what is not.

So you’ll need to work through the script and try and fill in the blanks for any selectors needed.

Step 4 run and verify

The script can then be run with ruby html-to-md.rb but it’s important to check that pages have been made. YOu can look through the created _pages and _blogposts to verify there are changes or use git status to see updates.

If you’re able to verify everything has been brought over and created appropriately move to step 5 otherwise you can reset with git add . and git stash then try step 3 again.

Step 5 verify and compare sitemaps

From here you’ll want to make sure all the pages came over and everything is the same as the live site. Find the sitemap on the live site and find and replace the XML code out so you have a list of all the websites URLs. You then need to sort these a-z.

Do the same on the html-to-md converted content. Add these pages into jekyll.. get all the links from the sitemap… and find and replace so you’re left with only the links. Add the url back in (https://www) however it is to match the live websites sitemap. And then sort the links a-z.

You should have two identical files sorted a-z. Name one new-pages.txt and the other one live-pages.txt

Then we’re going to use VIM to generate an html diff comparrison. There are other ways to run a diff between two files but using VIM has been the easiest for me. Running that live has had great results but this HTML output works well as well.

vimdiff  new-pages.txt live-pages.txt -c TOhtml -c 'w! diff.html' -c 'qa!'

Any pages that are missed will show up here. Find these and convert them manually with this online converter (https://codebeautify.org/html-to-markdown) or go back to Step 3 and try to reconfigure the html-to-md script.

HTML-TO-MD.rb conversion script

##############################
#         READ ME
###

# This loops through every page in the directory (except /index and /template.index),
# pulling the title tag and description from the head,
# with the page content in the body then putting around
# them in the template file.

# **CAUTION**
# this does require a few edits, which will be listed below

# Read the outputs to follow along with the script

# Let's see how it's done!

# Requiring dependencies
puts "Requiring dependencies"
require 'nokogiri'
require 'reverse_markdown'
require 'find'
require 'date'
require 'fileutils'

# Define CSS selectors and other configuration settings at the top for easy modification
CONFIG = {
  ignore_files: ['./index.html', './indexNEW.html'],
  ignore_folders: ['/assets/', '.git', '.htaccess'],
  blog_selector: "body.single-post",
  title: 'head title',
  description: 'head meta[name="description"]',
  category: 'a[rel="category tag"]',
  h1: 'h1',
  json_ld: "head script[type='application/ld+json']",
  content: '.et_pb_post_content',
  remove_selectors: [
    '.extra-selector-1',  # Add your extra selectors here
    '.extra-selector-2'   # Add more selectors as needed
  ]
}

# Method to process each file
def process_file(f, config)
  puts "---"
  puts "Processing file: #{f}"

  file = File.read(f)
  puts "File read successfully"

  doc = Nokogiri::HTML(file)
  puts "File parsed successfully"

  puts "Beginning DOM interaction"

  title = doc.at_css(config[:title])&.inner_text || ""
  puts "Found title tag: #{title}"

  description_tag = doc.at_css(config[:description])
  description = description_tag ? description_tag["content"] : ""
  puts "Found header metadata: title=#{title}, description=#{description}"

  is_blog = doc.at_css(config[:blog_selector])
  if is_blog
    puts 'Processing as BLOG PAGE'

    category = doc.at_css(config[:category])&.inner_text || ""
    puts "Found category: #{category}"

    h1 = doc.at_css(config[:h1])&.text || ""
    doc.at_css(config[:h1])&.remove
    puts "Found H1: #{h1}"

    date = if doc.at_css(config[:json_ld])
             json_ld = doc.at_css(config[:json_ld]).text
             Date.parse(json_ld.scan(/[0-9]{4}-[0-9]{2}-[0-9]{2}/).first) rescue Date.today
           else
             Date.today
           end
    puts "Found date: #{date}"

    # Remove additional unwanted selectors from the content
    config[:remove_selectors].each do |selector|
      doc.css(selector).each(&:remove)
    end
    puts "Removed additional unwanted content"

    content = doc.at_css(config[:content])&.parent
    if content.nil?
      puts "Content not found using selector: #{config[:content]}"
      return
    else
      puts "Found content"
    end

    # Extract the filename from the URL path for blog posts
    filename = f.split('/')[-2]
    mdF = "./_blogposts/#{filename}.md"
    puts "Markdown file path: #{mdF}"

    frontMatter = <<~FRONTMATTER
      ---
      title: >
        #{h1}
      layout: post
      date: >
        #{date}
      titletag: >
        #{title}
      description: >
        #{description}
      permalink: >
        #{f.gsub("./","/").gsub("/index.html","/")}
      sitemap: true
      categories:
        - #{category}
      ---
    FRONTMATTER

    puts "Saved front matter"

    markdown = ReverseMarkdown.convert(content.to_s)
    puts "Converted content to Markdown"

    newPage = frontMatter + markdown
    puts "Combined front matter and content"

    directory = mdF.gsub(mdF.split('/').last, "")
    puts "Making directory: #{directory}"

    FileUtils.mkdir_p(directory)
    puts "Directory created"

    File.write(mdF, newPage)
    puts "Written to file: #{mdF}"

    FileUtils.rm_rf(f)
    puts "Removed old file: #{f}"
  else
    puts 'Processing as regular PAGE'

    # Remove additional unwanted selectors from the content
    config[:remove_selectors].each do |selector|
      doc.css(selector).each(&:remove)
    end
    puts "Removed additional unwanted content"

    content = doc.at_css(config[:content])&.parent
    if content.nil?
      puts "Content not found using selector: #{config[:content]}"
      return
    else
      puts "Found content"
    end

    # Generate filename for pages by replacing slashes with dashes and removing index.html
    filename = f.gsub('./', '').gsub('/index.html', '').gsub('.html', '').gsub('/', '-')
    mdF = "./_pages/#{filename}.md"
    puts "Markdown file path: #{mdF}"

    frontMatter = <<~FRONTMATTER
      ---
      layout: page
      title: >
        #{f.gsub('./', '').gsub('.html', '').gsub('-', ' ').split('/').map(&:capitalize).join(' ')}
      titletag: >
        #{title}
      description: >
        #{description}
      titlebar: >
      sitemap: true
      ---
    FRONTMATTER

    puts "Saved front matter"

    markdown = ReverseMarkdown.convert(content.to_s)
    puts "Converted content to Markdown"

    newPage = frontMatter + markdown
    puts "Combined front matter and content"

    directory = mdF.gsub(mdF.split('/').last, "")
    puts "Making directory: #{directory}"

    FileUtils.mkdir_p(directory)
    puts "Directory created"

    File.write(mdF, newPage)
    puts "Written to file: #{mdF}"

    FileUtils.rm_rf(f)
    puts "Removed old file: #{f}"
  end
rescue => e
  puts "Error processing file #{f}: #{e.message}"
  puts e.backtrace.join("\n")
end

# This loops through every file and ignores framework files
puts "Beginning loop of every file"
Find.find("./") do |f|
  if CONFIG[:ignore_files].include?(f)
    puts "Ignored file: #{f}"
  elsif CONFIG[:ignore_folders].any? { |folder| f.include?(folder) }
    puts "Ignored folder content: #{f}"
  elsif f.include?('.html')
    process_file(f, CONFIG)
  else
    puts "Not an HTML file: #{f}"
  end
end

exec "find ./ -empty -type d -delete"
puts "Removed all empty directories"