Workflow for semanticClimate

23 Oct, 2022 by Peter Murray-Rust

Steps in processing a Chapter

We assume

PDF2HTML

The raw chapter is in PDF which requires messy heuristic processing into HTML. We have done this for a small number of chapters and will be automating it as soon as we can. Here's a typical chunk.

There are no words, lines, paragraphs in PDF. The tools usually guess right but here there's a problem of the top and left margins. We have to set clipping boxes. Sometimes these haven't worked. But generally we can get HTML (although things like font styles, subscripts, etc. are often trashed.

oaweek

← Back