Character Encoding

9 January 2013

What are those NUL bytes doing in the text file?

Some people have reported that when they process the PBS XML using Microsoft's MSXSL.EXE processor, the resulting text file has many extra NUL bytes.

This is due to the character encoding of the output file. Every XSLT processor will produce output files in a particular character encoding, even text output.

When an XSLT processor creates an output file it will use the default character encoding for the platform. This may be overridden by the XSL stylesheet (using the xsl:output element). A processor may also provide its own means of changing the default (eg. using a command-line parameter).

In the case of MSXSL.EXE running on a MS Windows operating system, the default character encoding is UTF-16. UTF-16 uses at least 2 bytes (16 bits) for every character. For characters that are in the 7-bit ASCII character repertoire then the character encoding adds an extra NUL byte to make up the 16 bits.

Solution

The XSL stylesheets provided on this site do not explicitly set the character encoding of the result document. This is because, in most cases, the XSLT processor will automatically select the appropriate character encoding for the platform.

In order to force the processor to use a particular character encoding, a simple solution is to write a small XSL stylesheet that imports the original XSL stylesheet, using the xsl:import element, and sets the character encoding using the xsl:output element.

For example:

<xsl:stylesheet version="1.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:import href="drug.xsl"/>

<xsl:output method="text" encoding="utf-8"/>

</xsl:stylesheet>