Creating a PDF from HTML – Formatting documents with CSS Paged Media
Corresponding services: DITA-XML for technical documentationPipelines for automated publishing
There are several ways and tools you can use to create a PDF from a technical documentation in DITA-XML or a website in HTML. In the first part of this knowledge article, I look at the advantages and disadvantages of the languages commonly used in this process: XSL-FO and CSS Paged Media. In the second part I compare programs for rendering HTML files to PDF files via CSS and compare their functions.
This knowledge article was written as part of my bachelor thesis for my study of Technical Information Design and Technical Editing at the University of Applied Sciences and Arts Hannover.
Starting point: PDF formatting with XSL-FO or CSS Paged Media
Documents created in DITA-XML are often output in various formats, such as HTML and PDF, based on the principle of single sourcing. The XSL-FO language (Extensible Stylesheet Language – Formatting Objects) is often used to format a PDF file (see Fig. 1). But XSL-FO is pretty complex and requires in-depth knowledge of XSLT and XSL-FO to make custom formatting adjustments. In addition, the W3C has stopped developing XSL-FO further.
As an alternative, the PDF output can be formatted with CSS. This language called CSS Paged Media is being actively developed by the W3C and called the successor to XSL-FO (see Fig. 2).
The old formatting solution: XSL-FO
Formatting via XSL-FO is based on an XML file and an XSLT stylesheet. This XSLT stylesheet defines the transformation of the XML file into an XSL-FO document. To perform this operation, the XML file and the XSLT stylesheet are passed to an XSLT processor, which generates the XSL-FO result document from the two documents. This XSL-FO document is then transformed into a PDF file by an XSL formatter (see Fig. 3).
An XSL-FO document consists of two parts. In the first part, <fo:layout-master-set> defines the layout of the page(s). The second part is defined with <fo:page-sequence>. This consists of “blocks” where other blocks or elements and specific text content can be nested. Layout and content are therefore defined in one and the same file when using XSL-FO.
The new formatting solution: CSS Paged Media
Here, formatting via CSS is instead based on an XML or HTML file, as well as a CSS file. These two files are then passed to a renderer, which generates the PDF file from them (see Fig. 4).
The advantage of generating a document via CSS Paged Media over XSL-FO is that it is possible to work with a more widely used (formatting) language, which makes the technology easier to use. In addition, generating a document via CSS separates content and layout, which increases reusability. Moreover, the print layout can be controlled separately from the screen layout in the CSS via “media queries,” so that only one CSS file is needed for the various output media.
More design freedom with XSL-FO
But the advantage of generating a document via XSL-FO over CSS is that when using XSL-FO the structure of the PDF file can differ from the XML source file's. With the aid of the XSLT transformation, the structure of the PDF file can be adapted to individual needs. For example, a table of contents can be automatically created that is not described in the XML source file. This is not possible when using CSS Paged Media, as the structure of the HTML or XML source file is transferred to the PDF file unchanged. Only specific elements can be hidden or text content manually inserted via the CSS. To adapt the structure more extensively, an additional XSLT transformation must be performed here.
Tools under test: Rendering HTML files to PDF files using CSS
But how does the technology differ from XSL-FO in practice? In my bachelor thesis I researched, tested, and compared several programs for rendering HTML files to PDF files via CSS. To do so, I derived criteria from the W3C specifications for CSS Paged Media and created a criteria list to provide a basis for comparing the tools.
I then created test files to have consistent sample content for testing. These test files implement the criteria I created beforehand. I then used the various programs to transform the test files in HTML to PDF files and compared the results.
The tested programs at a glance
I looked at both paid and open source tools during this study and tested the following programs:
Paid-for programs
- Antenna House Formatter
-
Oxygen Chemistry
-
pdfChip
-
PDFreactor
-
Prince XML
Open source programs
- paged.js
- Vivliostyle
- WeasyPrint
- wkhtmltopdf
Antenna House Formatter
Antenna House Formatter is a paid-for program from Antenna House Inc. which can transform XML and HTML files to PDF and other file formats using both XSL-FO and CSS Paged Media.
Oxygen Chemistry
Oxygen Chemistry is a paid-for program of the company Syncro Soft SRL, which also distributes the Oxygen XML Editor. Chemistry is a CSS Paged Media processor based on the open-source Apache FOP XSL-FO engine.
paged.js
paged.js is a freely available open source product. The project was set up by Adam Hyde and is currently maintained by the Coko Foundation. paged.js is a JavaScript library that renders HTML files in the browser and converts them to PDF. The Paged Media Standard of the W3C is set to be implemented.
pdfChip
pdfChip is a paid-for command line tool from callas software GmbH. The program can convert HTML files to PDF using CSS. pdfChip should support all HTML features and offer more advanced functions, e.g. CMYK support, SVGs, MathML, and a whole load more.
PDFreactor
PDFreactor is a paid-for PDF converter from RealObjects GmbH. PDFreactor converts HTML files to PDF and claims to offer broad support for e.g., HTML5, sCSS3, JavaScript, PHP, Python and many more.
Prince XML
Prince XML is a paid-for product of YesLogic Pty Ltd. Using CSS Paged Media, Prince can convert XML and HTML files to PDF.
Vivliostyle
This open source program from the Vivliostyle Foundation offers Vivliostyle CLI, a command line tool that exports HTML files to PDF using CSS.
WeasyPrint
WeasyPrint is an open source product developed by the kozea group and maintained by CourtBillion. WeasyPrint is a visual rendering engine and can render HTML files to PDF using CSS.
Wkhtmltopdf
wkhtmltopdf is an open source program. The command line tool renders HTML files to PDF using the Qt WebKit rendering engine. The project was started by Jakob Truelsen and is currently maintained by Ashish Kulkarni.
Results: How well did the programs in the test support CSS Paged Media?
Overall, the tested programs all offer good support for CSS Paged Media. Basic document formatting requirements can be served by almost all programs. Only wkhtmltopdf supports only bookmark creation in this test. If more specific layout functions are needed, it is necessary to check more closely which program offers the necessary support. The table with the detailed test results of my bachelor thesis is a good guide1:
Please note the results in this table refer to the specifically triggered corner-cases of my bachelor thesis. So, from a practical point of view, a negative entry in the table does not necessarily mean that the program cannot implement a basic layout requirement. There are often several ways of implementing such a requirement, but most programs do not support all available options. For example, based on my evaluation, the WeasyPrint program cannot convert colors defined in CMYK. But as WeasyPrint supports colors defined in RGB, the layout can still be adapted to individual needs (corporate design). In practice, this should then be negligible, as both color spaces are identical except for small differences and the color values can be transferred quite effectively into the other color space. So the WeasyPrint program supports the basic requirement to be able to define your own colors, but not the more detailed requirement to be able to define your own colors in the CMYK color space.
Overall, the paid-for program Antenna House Formatter offers the best support, but also has a price to match, especially in the server version. The best open source program, WeasyPrint, comes very close to Antenna House Formatter in functionality and is even on par with the paid-for programs Prince XML and PDFreactor. In general, most of the tested open source products are a good alternative to using paid-for programs. Only pdfChip and wkhtmltopdf get a partly recommendation due to their limited functionality. The pricing structure of the pdfChip program and the limit of pages per document in some versions (e.g., pdfChip S: 25 pages/document) also do not support the use of pdfChip.
Conclusion
Formatting via XSL-FO and CSS Paged Media have advantages and disadvantages, so ultimately it comes down to your individual circumstances and formatting needs.
For your average document formatting needs, CSS Paged Media is a good alternative to XSL-FO. CSS knowledge is more prevalent than that of XSLT and XSL-FO and the formatting effort is lower than using XSL-FO, especially in the case of smaller documents. As the results of my bachelor thesis show, there are also enough programs on the market that support CSS Paged Media to an acceptable degree.
But if there is already an output path via XSL-FO or if there are significantly more specialized formatting requirements, CSS Paged Media is usually not an alternative.
Tony Graham’s (Antenna House) presentation at XML Prague 2022 gives a more up-to-date and advanced perspective of how the two technologies compare, along with the support capabilities that Antenna House in particular offers.
XML solutions – Need XML-based documentation created with DITA or DocBook? We develop solutions that are fully bespoke to your requirements. Get in touch!
Footnotes
1 The programs were tested in September / October 2022. More recent versions of the tested programs can therefore deviate from the test results.
Sources and further links
https://www.antennahouse.com/formatter-v7
https://www.oxygenxml.com/doc/versions/21.1/ug-chemistry/topics/ch_getting_started.html
https://www.callassoftware.com/en/products/pdfchip
https://doc.courtbouillon.org/weasyprint/stable
https://wkhtmltopdf.org/index.html
https://archive.xmlprague.cz/2022/files/xmlprague-2022-proceedings.pdf
https://archive.xmlprague.cz/2022/files/presentations/xsl-fo-css-comparison.pdf
https://www.antennahouse.com/hubfs/PDFS/XSL%20CSS%20Comparison/xsl-fo-css-comparison.css.pdf
Add new comment