KompoZer User Guide - Appendix 6
Encoding refers to the details of how the characters in the source file for a web page are coded for transmission over the web. For the most part the author can leave all the details to be handled by KompoZer or the browser. This leaves a few options available to the author these generally provide a means of optimising a file only in rare cases do they affect functionality.
KompoZer defaults to using ISO-8859-1 encoding with the following settings. You may check or reset this as follows.
These settings are completely suitable for pages using English and adequate, though not necessarily optimised, for most other languages used in Western and Northern Europe.
If using an other European language and some other languages a different selection from ISO-8859 may be preferable. KompoZer offers the full available range. Wikipedia [Ref 14] has a useful article detailing the coverage.
|Printable ASCII and Latin 1 characters|
|Hex code for character - msd in row lsd in column|
characters sp (space), nbsp, (No-break space),
shy (soft hyphen) are printable but (normally) invisible.
Early computers used the ASCII (American Standard Code for Information Interchange) which provides a set of 95 printable characters dating from the teleprinter era. An eight bit byte however allows a doubling of this number (while reserving a number of codes for control purposes) and gives rise to the Latin-1 set illustrated in table A6.3-1. The row and column headings indicate the more and less significant parts of the code (in hexadecimal) corresponding to each character. For instance, the code for character 'A' is 41.
Latin-1 corresponds to the ISO-8859-1 set which is sufficient for web pages in English and many other western European languages. Include the appropriate code in a file and the corresponding character will appear.
The needs of many languages, European and other, can be satisfied by similar sets of characters, all share the ASCII characters and substitute some in other positions. This give rise to 15 standards in the ISO-8859 series. You can find which language, along with the list of characters, supported by each encoding in the Wikipedia article [Ref 9] referred to above.
To implement this it is clear that more than 256 characters are needed although only 256 locations (less control positions) are available to address them. The characters required to satisfy all in the series are drawn from a much larger set.
The Unicode Consortium [Ref 17] have standardised a universal character set (UCS), i.e. a standard that defines, in one place, all the characters needed for writing the majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded.
Unicode (as the UCS is commonly referred to) can access over a million characters of which about 100,000 have already been defined. These include characters for all the world's main languages along with a selection of symbols for various purposes.
HTML specifies a Document Character Set which is a list of the character repertoire available along with the corresponding code points (sometimes referred to as code positions). For HTML (and XHTML) the Document Character Set is identical to the UCS which means that, in principle, any character in the UCS may be used in any HTML document. In practice support for the complete character range is uneconomic and systems provide support for subsets only.
ASCII and Greek characters
Using ISO-8859-7 encoding
|Hex code for character - msd in row lsd in column|
Character Encoding, at its simplest, refers to the process whereby the codes for the characters are mapped to the code points for the Unicode characters appropriate to the language in use. In the case of ISO-8859-1 the character codes are mapped to identical Unicode code points. (The first 256 Unicode characters being the same as the Latin-1 set.) As another example, ISO-8859-7 encodes Greek characters displacing many from the Latin-1 set to make room. (Compare table A6.3-2 to table A6.3-1.) In this case the code EA instead of being mapped to Unicode code point EA (giving e circumflex ê) is mapped to code point 03BA which returns a small kappa κ. In fact ISO-8859-7 does not include the ê character.
All ISO-8859 encodings retain the ASCII characters at the original positions.
This document uses
encoding but, in spite of this, has no difficulty in representing the
full repertoire of the Greek characters covered by
can be seen in the table. How this is achieved is explained in the next
Authors should note that every page uses one character encoding, and one only, irrespective of the number or range of languages encountered on a page.
In HTML pages character encoding is specified using the
parameter in the head area for each page. Several options are
permissible but KompoZer always uses
http-equiv= "content-type" content= "text/html;
Note 'charset', in spite of its name, does not specify a character set. The character set for HTML documents is always the UCS. 'charset' specifies the encoding.
|Character||Entity||Numeric character reference|
|no-break space|| || || |
uses a single byte per character to represent all
the characters commonly expected in a language but clearly
there may be a need to represent uncommon characters. HTML
provides two mechanisms – Character entity references
references. Using these methods any character in the UCS may
be reached by using a sequence of ASCII characters to point to
the required character. Entities take the form
and numeric references the form
€ all representing the euro
represent the Unicode code point for the symbol in
decimal and hexadecimal notation.
These methods free the author to employ Unicode characters, irrespective of the encoding in use, at the expense of increasing file size. Where such use is limited this is inconsequential.
The list of entities is included at section 24 of the HTML specification [Ref 16]. About 250 are defined, numeric character references must be used for characters outside this range. Characters do not have to be out of range of the encoding for entity references to be provided as is clear from Table A6.3-3 which lists some of the most frequently used including some in the ASCII set.
are case sensitive thus
upper case E with an acute accent (É) while
the corresponding lower case letter (é).
represent anything (&EacutE;). (The error just gets
Irrespective of the ISO-8859 encoding employed the entity or numeric
reference to be
input remains the same. So, although in ISO-8859-7 the euro symbol is
represented as byte A4, entering the code
a ¤ symbol not a euro symbol. The code to be input is the
entity or numeric character reference for the character required.
ISO-8859 is fine when using one language at a time but becomes clumsy and slow when languages are mixed. UTF coding releases us from this restriction and provides a mechanism for addressing the full range of Unicode characters quickly. KompoZer allows coding in either UTF-8, UTF-16 or UTF-32 formats which are based on units of 8, 16 or 32 bits respectively. UTF-32 is not usually used for coding web pages.
UTF-8 uses 1 to 4 bytes to represent a character. It uses 1 byte to represent characters in the ASCII set, two bytes for the next 1920 characters (including the Latin alphabet characters with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters) and three bytes for the rest of 65,000 characters in the Basic Multilingual plane (BMP). Supplementary characters use 4 bytes.
UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.
UTF-32 uses 4 bytes for all characters
KompoZer offers additional practical advantages when UTF-8 encoding is used for pages with multiple languages. Irrespective of the encoding in use in Normal or Preview mode KompoZer depicts all characters correctly using the corresponding 'glyph'. In source view encodings which require characters to be represented as character references show the character reference. This reduces readability significantly. With UTF encoding all characters can be represented as glyphs so the problem is avoided.
Traditionally computers have relied on special fonts like 'Symbol' or 'Wingdings' to produce symbols. This is not necessary on web pages. Since such fonts do not support Unicode any attempt to use them will yield unreliable results which may vary from browser to browser.
Fortunately Unicode supports a large range of symbols which fulfills many needs.
Inputting special symbols
|Left single quote||‘||alt+0145||U+2018|
|Right single quote||’||alt+0146||U+2019|
|Left double quote||“||alt+0147||U+201C|
|Right double quote||”||alt+0148||U+201D|
There are several ways of inserting symbols into a page using KompoZer.
€ > <
The first four methods can be used in normal or preview mode, the last two in source view.
The keen-eyed may observe that the key codes are neither the character codes nor the Unicode code points for the character required. In fact they are the (decimal) character codes derived from Windows-1252 encoding. KompoZer will convert these to whatever code is appropriate depending on the encoding selected and the character involved. (So the key code never appears in the source.) Incidentally Windows-1252 [Ref 13] encoding is a possible alternative to ISO-8859-1 suitable for western languages. It increases the number of available character codes to 218 characters by re-allocating some of the codes in the range 80 to 9F which are normally unused.
Alan Wood's website [Ref 2] is a useful resource listing entities (where defined) and Numeric character references for a large number of characters from the Monotype Typography Symbol font (as on Windows XP) including Greek, Mathematical and Punctuation [Ref 6] and also the Microsoft Wingdings font [Ref 7]. (For Windings in several cases there is no Unicode equivalent.)
Although Unicode offers tremendous potential the usual caveats apply when choosing fonts. See for instance section 188.8.131.52 but, in this case, it is important to check that all fonts in the list include the characters required. No font covers the full range of Unicode, or even a small single digit percentage of it. To check the supported Unicode ranges of a font Microsoft supply an extension [Ref 12] for Windows Explorer. With it installed, right-click any TrueType (TTF) font file in Windows Explorer and select the Properties tab. Particular characters can be searched for using Character map.
Checking for support is more than usually difficult if unusual characters are required. Compatible fonts must be installed on any visitor's computer and, where in a style sheet the font-family is specified as a prioritised list of font family names (as it should be), ideally all fonts in the list should be checked.
Note A font list lists fonts the first of which will be used if available. It does not check that the character required is supported by the font. So even if support is provided by a font lower in the list that may not be accessed.
Check the rendering of a page on as many different browsers as possible. Mozilla browsers do authors only a partial service since, if a character is encountered which is not included in the font(s) listed they will make an attempt to find the character on other fonts installed on the machine. Authors should however check all pages using MSIE 6 which does not offer this capability. (MSIE7 will substitute for a few characters.) Visitors may finish up looking at square boxes instead of the character required.
Alan Wood offers several pages which are extremely useful in this respect. Using special characters from Windows Glyph List 4 (WGL4) in HTML [Ref 3] lists characters in the WGL4 set and which are likely to be widely available. Unicode fonts for Windows computers [Ref 4] lists which fonts carry specific ranges of Unicode characters and more interestingly shows distribution of the fonts so that authors may check likely availability to visitors. Those wishing to use a rarer character may check which fonts include them at Unicode character ranges and the Unicode fonts that support them [Ref 5].
Character U+21D1 not included any font used from the list specified (Tahoma, Arial, Helvetica, sans-serif).
The arrow appears as a square when using MSIE ≤ 7
Same demonstration but set up spanning the arrow with
font-family: 'Lucida Sans Unicode' .
While preparing this page, for instance, Table A6.3-1 displayed correctly in Firefox and KompoZer but in MSIE the arrows originally appeared as squares. The issue is reproduced in the box on the right. The arrows use comparatively rare characters that do not appear in the Tahoma font used but, on the writer's machine at least, the Gecko engine was able to retrieve them, possibly from Lucida Sans Unicode.
The result is that visitors using MSIE see boxes instead of arrows but those using Firefox or Opera may see the arrows if Lucida Sans Unicode or some other font with the characters is installed on their machine.
A work-around this issue is possible, as also shown in the box. The list specifying the font is modified so that the first in the list becomes 'Lucida Sans Unicode'. If this is available it will be used, otherwise the choice passes down the list. Alan Wood shows that this font is supplied with Windows XP and Windows 2000 which cover 90% of installations (mid 2007).
This is a moderately, but not very, robust solution. Had the availability of the arrows been critical to understanding the table it would have been necessary to change the design.
While the arrows may be considered rare and unusual characters even characters covered by some ISO-8859 options may not be reliable. In viewing Table A6.3-2, depending on the browser in use and fonts installed there are two characters, Drachma sign (Code A5) and Greek ypogegrammeni (Code AA), which may not display correctly. In cases like these checking the WGL4 list may provide a warning because neither of the characters is listed.
The way in which characters are coded in the source for the page may be altered in KompoZer using Tools > Preferences > Advanced. In the Special Characters area there are four options under 'Output the following characters as entities'
The options refer to characters typed onto the page which will be readable by a visitor to a site. Irrespective of the option set the visual appearance on screen will remain the same.
Only & < > and non-breakable whitespace
Note In normal practice the character referred to as 'non-breakable whitespace' is called 'no-break space' (entity ).
The section on 'Preferences' recommends this as the preferred option.
This is the minimal setting. The characters listed must always be encoded whatever option is selected. With this selection the encoding will be as entities. Since the character < occurs in HTML code to mark the start of an element, if it is included in the page text the browser would expect to to start a new element and the page would become corrupted. It must always be encoded. The > character marks the end of an element and should be safe to use but W3C recommend that it also be encoded since it may confuse older browsers See section 5.3.2 of the HTML Specification [Ref 16]. If you wish to override this check the box 'Don't encode > outside of attribute values'.
Since entities and numeric character references start with an ampersand (&) a similar problem occurs with this character.
Outputing the no-break space as an entity is convenient since it would otherwise look like a normal space in a listing.
With this option the output will be an entity where specified, else for encodable characters, it will be the code for the character else the character will be output as a decimal numeric character code.
Before publishing a page always select this option since it will result in the smallest file size.
The option 'The above and Latin-1 letters' Strictly should read 'The above and Latin-1 characters except ASCII characters'. That refers to characters in the Latin-1 set with codes in the range A0 to FF.
The output code for ASCII characters is the character code, for the remaining Latin-1 characters is the entity, else is a decimal numeric character reference.
The option 'HTML 4 special characters' refers to all characters for which The HTML 4 specification [Ref 16] (Section 24) provides an entity reference.
The output code for ASCII characters is the character code, for all characters for which an entity exists is the entity else it is a decimal numeric character reference.
The final option should output decimal numeric character reference for all non-ASCII characters but has a bug so that it uses a mixture of character codes (for ASCII characters) and entities or numeric references for others.
Note The output will always produce valid and operating code for the corresponding character so files run correctly. With Nvu 1.0 and KompoZer 0.7.10, for all options, there are a few detailed non-compliances with the above description.
The options provided may be useful during development if you work in source code and may make it easier for you to read. They may also be required if transferring source code to some other applications. If, for instance, your file was encoded as UTF-8 and you wanted to transfer the content to an application which could accept only text the fourth option would be useful.
This ensures that only ASCII characters appear in the file. Unfortunately this is one option that does not work in Nvu 1.0 and KompoZer 0.7.10
Note. In source view you may enter characters using any of these systems irrespective of the option selected. After leaving and re-entering source view, the display will conform to the option selected and not to the form in which it was entered.
In Normal and Preview mode KompoZer will attempt to render all characters correctly, irrespective of the option selected, subject to the limitations described under 'Unicode support' above.
If you need an ascii only file an alternative way to obtain one is via the File > Save and change Character Encoding menu item. Check the 'Export to text' box and save the file as text. The original file remains intact. Again with UTF encoding this appears to be faulty but if the encoding is temporally switched to ISO-8859-1 it works.
This is explained under the first option above.
The section on 'Preferences' suggests that this be left in the default, cleared, state but this is optional.
One final option in Tools > Preferences > Advanced provides the option 'Don't encode special characters in Attribute values'. That is not a very precise definition of what happens because the whole page must be encoded in the same way. What it changes is the way in which character references are used in attribute values.
When you check the box special characters appearing in the value of any attribute will be left untouched and not encoded as entities or numerical character references.
An example will serve to illustrate.
The table element has an
attribute 'summary' and might appear as:
summary="Resumé of results">
The attribute is 'summary' and the value is 'Resumé
of results'. If the
encoded as entities the source would read
summary="Resumé of results">
Even worse if the
encoded as numeric character references the source would read
summary="Resumé of results">
Any readout on the screen, whether through a properties inspector or at the bottom of the browser window, will be normal but those doing lot of work in source view might prefer not to encode in this situation.
You are recommended to check the box 'Don't encode special characters in attribute values' unless you have good reason to do otherwise.
Where the attribute value is a URL a different encoding method is covered in the following section.
|Characters permissible in a URL|
Special considerations apply to characters in a URL. URLs can occur as the value of an attribute to many elements the most common being the 'href'.
Any Latin 1 character may occur in a URL but only those shown against a green background in Table A6.5-1 may be used freely. This set includes alphabetics, hyphen and underscore. A number of other characters which may have specific meanings are reserved. This includes the majority of the remaining ASCII characters. Such characters may be used to separate one part of the structure from an other e.g the colon separates the protocol from the domain. These characters, from current specifications, are shown against an orange background. Whenever such a character is used other than for the specific reserved purpose it must be encoded to avoid confusion. Use of the remaining characters depends on specifics of the URL or part of the URL involved.
When encoding is required in a URL a new method referred to as 'percentage encoding' is used. Put simply, percent encoded characters consist of a percentage sign followed by two characters representing the hexadecimal position of the character in the Latin 1 set. Thus %20 represents a space.
Authors often note that the names of saved files appear with spaces replaced by %20. As explained this is quite safe and indeed some operating systems prohibit unencoded spaces in file names. It is always preferable to avoid spaces when naming files. Use the underscore as an alternative.
It is actually possible to use percentage encoding for any character in the Latin 1 set.
When KompoZer percent encodes it errs on the cautious side, that is, it may encode when this is not strictly necessary. This occurs for two reasons.
Since it is permissible to percent encode any character this should not matter. Unfortunately it sometimes does. Again systems may not comply with the current specifications.
Eric Meyer has provided a URL Decoder Encoder in his toolbox [ref 20] which allows you to see the results of encoding.
The check box 'Don't encode special characters in attribute values' controls whether or not KompoZer percent encodes special characters in URLs.
As an example KompoZer encodes the pipe character '|' because an earlier specification required this. But some systems will not decode this correctly and malfunction. On the Nvu forum it was reported that
was being encoded as
and that this was not being corectly recognised.
To solve this problem check the box 'Don't encode special characters in attribute values'.
KompoZer is not capable of selectively encoding characters in one URL and not in another. Nevertheless this limitation does not appear to be a problem to authors.
 Characters and Encodings covers almost every aspect of the subject.
 Excellent general resource offers several pages including Unicode test pages, setting up of browsers, Fonts with Unicode support, Operating systems. Also specific pages on
Wikipedia has good articles on
 ISO 8859-1 also other parts
 URL encoding
 Extension for the properties tab of Windows Explorer to check Unicode support.
 Windows 1252 character listing
Standards and Specifications
 RFC 3986 Uniform Resource Identifier Generic Syntax
 Unicode site
 Penn State University also has an excellent page on 'Getting Started: Unicode' which covers OS, Browsers, fonts etc
 W3C offer a somewhat more technical tutorial dealing mainly with encoding and covering material that KompoZer hides from the author
 Eric Meyer has a toolbox which includes a URL Decoder Encoder.
 AllChars has a freeware utility that allows you to enter any Windows 1282 character using a few keystrokes