Contents | Start | End | Previous: Appendix D: The Bibliographic Formatting Language | Next: Appendix F: Speech Profile Reference

Appendix E: Speech Markup Reference

This topic describes the properties you can use to enhance text-to-speech through markup in content using speech, span, div, pause or other objects, or via advanced paragraph or style properties.

Property reference

To understand which properties can be used with a given speech engine, check the description of the selected speech engine in Preferences/Speech. This will indicate whether SSML is supported; SAPI XML is supported when you select Microsoft SAPI, and Apple Speech Manager text is supported when you select Apple Speech Manager. Note that not all SSML properties are supported by all voices and SSML speech engines.

Most of the time you will be marking up text within a paragraph, and not across paragraphs. If you wish a single object to span several paragraphs, ensure that the start and end objects are on lines of their own. This will ensure that Jutoh can emit correct, and not overlapping, markup.

Alias

Supported for: SSML, SAPI, Apple, text

Provides an alternative natural language pronunciation, equivalent to the SSML ‘sub’ element. Non-SSML implementations are provided by simply substituting the provided text. The specified property value is natural language text that is used instead of the text enclosed by the speech object. To specify a phonetic pronunciation, use the Phoneme property.

Note that this property cannot be supported for Epub 3 because there is no corresponding property or attribute in the Epub 3 specification, and replacing the text in advance of rendering it would remove content that is valid for a visual document.

Alphabet

Supported for: Epub 3, SSML, SAPI, Apple

This specifies the alphabet used in the Phoneme property; examples are x-microsoft-sapi, x-sampa, ipa. Note that when generating for SAPI, the only legitimate value is x-microsoft-sapi. For the Apple Speech manager, the alphabet must be apple. You can also use a comma-separated list of alphabets as explained in the Phoneme description.

Content after

Supported for: All formats

Inserts text after the content at this point. Note that this affects all formats, including Epub, Kindle and ODT; to make it format-specific when used in styles, create multiple style sheets with different style properties and specify the style sheet to use in the configuration option Style sheet.

Content before

Supported for: All formats

Inserts text in front of the content at this point. Note that this affects all formats, including Epub, Kindle and ODT; to make it format-specific when used in styles, create multiple style sheets with different style properties and specify the style sheet to use in the configuration option Style sheet.

Cue

Supported for: Epub 3, SSML

Provides the ability to uniquely identify elements with an aural sound. For example, “url('audio/bong.mp3')”. The value consists of either one item or two space-separated items; supply an extra value ‘null’ to avoid repetition of the value after the element.

Language

Supported for: Epub 3, SSML, SAPI

An optional language abbreviation, such as en or en-GB, to give a hint to the speech system about what language is in use for the content enclosed in the object.

Pause

Supported for: Epub 3, SSML, SAPI, Apple

The amount of pause in milliseconds that occurs before and after the element that it is applied to. For example, “50ms” (50ms before and after) or “50ms 0ms” (50ms before and no pause after). Use null to specify that one of the values should be absent. You can use a pause object for this, or a speech or span object, or you can use it in a style.

Phoneme

Supported for: Epub 3, SSML, SAPI, Apple

Specifies a phonetic pronounciation that should be used instead of the text contained within the object. You must also provide a value for the Alphabet property. Note that for SAPI, the Alphabet property must be x-microsoft-sapi, or the Phoneme property will be ignored (except when using CereVoice: see below). To use a natural language equivalent of the content, such as the expansion of an abbreviation, use the Alias property.

In order to cater for different alphabets in the same object, you can use a comma-separated list of phonemes in the Phoneme property, and a corresponding list of alphabets in the Alphabet property. Then, specify the valid alphabets in the Lexicon alphabets speech profile property as a comma-separated list – these can contain wildcards (‘*’, ‘?’) to match multiple alphabets. Jutoh will filter out the invalid alphabets for this configuration and speech profile. For text-to-speech output, Jutoh will also take into account the current speech engine to filter out further invalid alphabets. The first of the remaining alphabets (if any) will be used.

CereVoice: when using CereVoice with SAPI, you can specify either x-microsoft-sapi as the alphabet, or a name that includes the word ‘cerevoice’. In this case, instead of using the ‘pron’ tag, the ‘lex’ tag will be generated with CereVoice phonemes.

Rest

Supported for: Epub, SSML, SAPI, Apple

Controls the pause that occurs between any aural cues and the rendering of the associated element. For example, “25ms 0ms”. Use null to specify that one of the values should be absent. You can use a pause object for this, or a speech or span object, or you can use it in a style.

Say as

Supported for: SSML, SAPI, Apple (partial)

This property uses SSML conventions to specify how the content of the element is spoken. It is more powerful than the Speak as property, which can be used for Epub 3.

Unfortunately, this feature is supported only patchily by speech synthesisers. Most seem to ignore the markup and attempt to parse the text, which may or may not yield a correct result. So you cannot rely on the correct pronunciation of dates and other values. If in doubt, use the Alias property to substitute text that will be spoken correctly, for example “20 15” instead of “2015”.

You can use Say as format and Say as detail with this property. Values include number, cardinal, ordinal, characters, spell-out, digits, fraction, unit, date, time, telephone, address, vxml:boolean, vxml:date, vxml:currency, and vxml:phone. See also www.w3.org/TR/speech-synthesis11 for details.

When outputting to SAPI, most SAPI context ids are handled by converting SSML values, except for the context ids web_url, e-mail_address, and number_decimal.

When outputting to the Apple Speech Manager, only the values digits and characters can be used.

Say as detail

Supported for: SSML, SAPI

This property is used with Say as.

If Say as is number, this property’s value can be one of ordinal, cardinal, and telephone. If Say as is date, this property’s value can be one of mdy, dmy, ymd, md, dm, ym, my, d, m, and y. If Say as is time, this property’s value can be one of hms24 and hms12. If Say as is characters, this property’s value can be one of glyphs and characters.

Say as format

Supported for: SSML

This property is used with Say as. The value is specific to the speech synthesizer.

Sentence

Supported for: SSML

This marks the content as containing a sentence, with a suitable pause at the end of the sentence. This is not generally needed since punctuation is interpreted as signifying a pause, but occasionally you may need to disambiguate content with unusual or missing punctuation. To support more platforms, you can use a pause instead of marking content as a sentence.

Speak as

Supported for: Epub 3, SSML (partial), SAPI (partial)

Possible values are normal, spell-out, digits, literal-punctuation, and no-punctuation. For SSML, only spell-out and digits are supported. For SAPI, only spell-out and digits are supported, and they both translate to the spell element.

Vocal gesture

Supported for: SSML, SAPI

This provides gestures such as laughs, coughs and ‘hmm’ sounds. The available values are shown in the drop-down list when you edit the value. There must be content contained within this element, which is replaced by the gesture. This property is currently only supported by CereVoice.

Voice emphasis

Supported for: SSML, SAPI, Apple

Manipulates the strength of emphasis, using a combination of pitch and timing changes, loudness and other acoustic features. The value can be one of normal, none, reduced, moderate, and strong. Using none prevents the synthesizer from emphasising words that it might normally emphasise. The value normal leaves the output unaffected. For SAPI, there is just one strength of emphasis.

Voice emotion

Supported for: SSML, SAPI

This provides a small variation in the content prosody and emphasis, currently for CereVoice voices only. Values are happy, sad, calm and cross.

Voice family

Supported for: Epub 3, SSML, SAPI

This attribute uses Epub 3 syntax to specify a change of voice. You can provide three space-separated values in the form “age gender number” or just age or gender. The age may be child, young, or old, and the gender can be male, female or neutral. The number represents an index into available matching voices. Alternatively, you can use a specific voice name in single quotes, such as 'Heather'.

When supplying a specific voice name, you can use wildcards, for example '*Heather*'. This will be matched against the available voices for the current speech engine, so that there is a better chance of matching the voice you want even if you switch speech engines. Even if you don’t supply wildcards, Jutoh will still try to find a voice using substring matching if none of the available voices matches the exact specified name.

Not all text-to-speech implementations support specification of age and gender.

You can also change the overall voice via the speech profile.

Voice pitch

Supported for: SSML, SAPI

The value for this property can be one of x-low, low, medium, high, and x-high.

For SSML, this can also be an absolute or relative value (preceded with - or +) with a Hz or % suffix. SAPI will use a rough equivalent of a relative or absolute percentage, but the Hz form is ignored.

Voice pitch contour

Supported for: SSML

A set of targets at specified time positions in the speech output, conforming to the SSML definition of pitch contour in the prosody element. Each target is a pair (time position,pitch). Example: (0%,+20Hz) (10%,+30%) (40%,+10Hz). In SSML, the pitch values can use absolute or relative Hz, but this is not supported by the Jutoh pitch contour editor.

Voice rate

Supported for: SSML, SAPI

Manipulates the rate of generated synthetic speech in terms of words per minute. Specify a keyword from normal, x-slow, slow, medium, fast, x-fast, or a percentage, where 100% is the normal rate.

Voice variant

Supported for: SSML

This is supported by CereVoice only. It selects a different version of the synthesis for the contained content, generating a usel tag. The value is an integer; a value of 0 represents the original version. If you find the wrong pronunciation is being used, then if you are using the CereVoice speech engine generating SSML, you can often correct by using a suitable value such as 1 or 2.

Voice volume

Supported for: SSML, SAPI

The volume value can be one of normal, default, silent, x-soft, soft, medium, loud, x-loud. For SSML (only), you can also supply a relative dB value.

Unsupported SSML features

The following SSML 1.1 elements are not currently supported by Jutoh: audio, desc, lookup, token, w. In addition, not all of the finer details of SSML elements are supported. Please ask if you need further SSML support.

Contents | Start | End | Previous: Appendix D: The Bibliographic Formatting Language | Next: Appendix F: Speech Profile Reference