Contents | Start | End | Previous: KB0001: How to adapt Jutoh to your language | Next: KB0003: How to improve document formatting


KB0002: How to work with encodings

Importing files

When importing text files, Jutoh needs to know what the encoding of your files is, since otherwise the files are just streams of bits that could represent anything. A 'standard' ASCII file only represent the basic symbols, whereas files encoded in Unicode can represent most symbols in use on the planet. Jutoh's favoured Unicode encoding is UTF-8, in which plain text is encoded with one character per symbol (and so is readable in any text editor) and more complex symbols are represented by two or more characters.

When you save text from a word processor, you need to make sure it's going to write using the encoding that you've specified in the New Project Wizard or Project Properties. For example, when saving a document as plain text from Microsoft Word, Word will show you a further dialog. Click on "Other encoding" and select "Unicode (UTF-8)". Don't check "Insert line breaks", since you want each paragraph to be one line.

If you forget to save in the right encoding, you may be able to fix it, as follows. When you get an Jutoh error indicating an encoding problem (or the file doesn't show properly in the finished book or the editor), open the file in an encoding-savvy application such as Programmer's Notepad. It should auto-detect the encoding, which you can check by typing Alt+Enter to see the document's properties. Now select the whole document and copy it to the clipboard. Create a new file, change the encoding to UTF-8 in the document's properties, and paste the text into it. Save this over the original file - it's now in the correct encoding.

If you import an HTML file that doesn't specify an encoding, then Jutoh will warn you, and use the fallback encoding specified in the New Project Wizard or in the Options page of the Project Properties dialog. If the file doesn't import as it should, you can try reimporting with a different encoding. For example, an HTML file created on Windows might be in the encoding "Windows Western European (CP 1252)". Another way to deal with the problem is to add the meta encoding declaration to the HTML file, for example:

<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=Windows-1252" />
<title>My Title</title>
</head>
...

Compiling files

Jutoh outputs text and HTML in UTF-8 by default, which is pretty much a universal solution for all the characters you're likely to need in your book.

If you need to change this, you can edit the configuration option Content encoding. HTML output using a non-UTF encoding will output non-ASCII characters, such as curly quotations and the euro symbol, using symbolic entities where possible, to avoid a problem converting to an encoding that doesn't support characters outside of the ASCII range.


Contents | Start | End | Previous: KB0001: How to adapt Jutoh to your language | Next: KB0003: How to improve document formatting