Why I recreate PDFs into Plain Old HTML

Why would I bother taking the time to painstakingly re-create PDF documents into Plain Old HTML?

Case in point: I get a document like this as a PDF.

<image deleted>

It’s generic, it’s basic, it’s not pretty, but it gets the job done.

What happens, say, if I try to copy-and-paste the text from it? I get this garglemesh →.

<image deleted>

Depending on the tools used, automatically converting text into PDF is not always kind to what would otherwise be a simple, hierarchally-organised document. The process of conversion actually scrambles the text so that, outside of a PDF reader, it’s unintelligible. In fact, non-Adobe PDF readers may not be able to make sense of it either.

Rather than try to explain this to people, I end up re-creating the occasional PDF into POSH — Plain Old Semantic HTML.

Why is HTML better than PDF?

  • HTML is much, much smaller in file-size, thus much, much faster to load.
  • HTML is easier to display on multiple devices, including screen readers, mobile devices, and Google.
  • The visitor can read the document immediately, rather than having to load a PDF reader—or worse, have the PDF try to render inside the web browser. An 8.5 x 11″ or A4 sheet of paper is difficult to read on a mobile phone, but well marked-up text reads just fine.
  • HTML is easier for search engines to index than PDF, because the text isn’t scrambled into nonsense, the file-size is smaller, and semantic elements such as headings are preserved.

PDFs are for one purpose only: distributing information that must be physically printed on a specified size of paper. If you want your information available on-screen or on a different size of paper, do not use PDF.

Why is PDF better than Word?

  • PDF is a bit more universally available than a Microsoft Word document. Most people have a PDF reader on their computer. Not everyone can afford Word, and not everybody knows that there are alternative word processors that can both read and create .doc files readable by Word.
  • Someone could take your Word document, change the text, and redistribute the document as though it came from the original source. It’s slightly trickier to do that with PDFs.

If your print-only document is absolutely complete and final, and you know exactly the size of paper it must be printed on, then you can publish it as a PDF.

Wouldn’t it be easier to convert from the original Word document to HTML?

Yes. Yes, it would.

Word and HTML do not play nicely together, but at least the text wouldn’t be nearly as scrambled. In this scenario, I often copy all the text from the Word document, paste it into a Plain-Text Editor to strip all formatting, and manually re-format the document. At least it saves on all that re-typing.

In a bureaucracy, however, it’s faster and less agonising to just re-type the document than to go through the process of trying to get the original text.

Also, if the original Word author didn’t use styles, but instead manually made things large, bold, or green, then there’s no semantic information in the document to use anyway—just a load of non-semantic presentational markup. If they did use Word styles and marked up elements like headings and lists, then that Word document becomes much more useful.

So what’s the ideal situation?

  1. People mark up their content in POSH to begin with. From there, we can create semantic Word, PDF and other documents as needed.Bwah ha ha ha ha! Just kidding.
  2. Use an HTML editor that forces POSH. XStandard is the best I’ve ever seen, though I’ve not used it for some time.
  3. Have people mark up their content in a semantic word processor using semantic styles, carefully identifying things like headings, captions, list items (the bullet character does not count), tables, &c. Then a skilled HTML author can do the conversion that preserves the semantics and eliminates the presentational markup.

One thought on “Why I recreate PDFs into Plain Old HTML”

Comments are closed.