KompoZer 0.8 pre-release - Character treatment in available views
This edition is based on KompoZer 0.8b1
Editing HTML using an external editor
especially with reference to character issues
Up to version 0.7 KompoZer took full control of the HTML code output to a server or browser. From Version 8 it offers the option to delegate some responsibility to an external editor. To make this process a success the contract conditions applying to the two parties must be clearly defined.
Web pages have a capability of invoking characters from a very large set called the 'Document Character Set' (DCS). This set is actually identical to a 'Universal Character set' (UCS) defined by the Unicode Consortium. This system – usually referred to as 'Unicode' – defines the code position (or code points) for each character in the set.
These positions may be accessed using a UTF (UCS Transformation Format) method of encoding. Because they need to address such a large address field they require a large bandwidth in use but the UTF-8 system seeks to minimise this by using one byte only to address the ascii set though other characters may require up to 4 bytes.
This is an alternative set of 15 different encodings in total covering the needs of the majority of written languages but each optimised around a set of closely related languages. A feature is that all characters require only one byte to represent them but a consequence is each can address only 191 characters. Of this number 95 are common to all sets and 96 are specific to the language group. The common set includes the ascii characters. The first standard in this series ISO-8859-1 (Western European) uses codes which are identical to the Unicode code positions for the same characters.
From this it may be deduced that all ISO-8859 encodings share the same code positions as Unicode for the ascii characters but for ISO-8859 encodings other than ISO-8859-1 the codes for other characters may differ from the Unicode positions in other words the same codes represent different characters for each standard. For examples see notes[Note1] [Note2].
Impact on editors
A corollary of this is that any editor capable of working with Unicode will be capable of editing HTML pages coded as IS0-8859-1 without error and, of course, the same editors will be capable of working with pages coded as UTF-8. A number of popular editors have this capability but there are very few capable of reading other HTML encodings.
Extending the range of characters
Being able to access 191 characters or so while the Document Character Set includes many more is often inadequate so HTML provides two mechanisms. These mechanisms use a combinaton of ascii characters to identify any character in the DCS.
- Character entity reference. HTML defines these for all Latin-1 characters[note1] along with a large number of mathematical and other markup characters. Entities, as they are known for short, are convenient because that are fairly easy to remember. Examples are © (©), > (>), “ (“).
- Numeric character reference. This may be expressed in either decimal or hexadecimal form. This refers to the Unicode code position and has the advantage that it may be used for any unicode character and the disadvantage that memorizing the code is much more difficult. Examples, each in decimal and hexadecimal form, are © or © (©), > or > (>), “ or “ (“)
While these mechanisms may be used with confidence having to resort to them has two main disadvantages
- It is more difficult to read, and therefore maintain, the code
- It can increase the file size, and therefore load on the network, quite considerably.
KompoZer is capable of working with all these methods (though when generating numeric character references they are always in decimal form). Which method is used to represent a character depends on the character encoding employed, the view being used and what setting of Tools > Options > Advanced > Special characters is set.
We need to look a little closely at the interpretation of the various options for this setting. In the following the phrase "if possible, be represented by the character" means if the encoding method allows a character to represent itself. In the case of UTF encoding this would be true for all characters but for ISO-8859 encoding only by a subset of 191 characters.
- Only & < > and nobreak space (Option 1)
- Because HTML uses the ampersand and less than sign in very
special ways – to introduce an entity or numeric character reference
and to start a tag – they can never appear in the page code in any
context. KompoZer always encodes them as entities (& and
<) though & and < would work equally
well. The nobreak space is similarly coded, just to make it visible,
the greater than sign for historic reasons.
These are all coded in the same way whichever option is selected.
In this option any character not covered by the above will, if possible, be represented simply by the character and otherwise by a decimal numeric character reference.
- The above and Latin1 letters (Option 2)
- This is somewhat badly titled. All Latin 1 characters other
than ascii characters i.e. those in the second half of the table[Note1] are coded as
Any character not covered by this or the proviso in option 1 will, if possible, be represented simply by the character and otherwise by a decimal numeric character reference.
- HTML4 special characters (Option 3)
- In this case all characters for which HTML defines an
entity reference will be coded as entities. This includes those coded
in this way in Option 1 and 2 plus any entities defined not covered by
Any character not covered by this or the proviso in option 1 will, if possible, by represented simply by the character and otherwise by a decimal numeric character reference.
- Use &#..; notation for all non-ASCII characters (Option 4)
- Any character not covered by the proviso in option 1 will,
if possible, be represented simply by the character and otherwise by a
decimal numeric character reference.
Unfortunately, though it is not important, KompoZer has inherited a bug from Nvu and non-ASCII Latin 1 characters are encoded as entities.
The reason for including this detail is because when editing the output code the way a character may be correctly represented and employed differs considerably depending on the selections used.
Appearance in editors
But there is a further complication. KompoZer 0.8 offers four different ways of viewing the code. In each characters may be represented differently. The descriptions above refer to the output code as delivered to a browser and as available to the HTML external editor. Here is a summary of the behaviour.
- In design view
- Characters are always represented by the glyph involved.
- In split view
- Characters are always represented by the glyph.
- In Source view
- Characters are usually represented by the glyph but when Option 3 is selected the character is represented as in the external editor.
- In the external editor
- Characters are represented by the glyph if possible, that
is if they are within the range of the encoding (charset) of the page,
otherwise they are represented in the way defined by the Option
Note the appearance will depend both on the character delivered to the editor and the editor's ability to recognise it and respond with the glyph. This requires both an editor capable of working with Unicode and a font with the appropriate glyph available.
- Output code
- The code provided for upload is identical to that delivered to the external editor.
Characters may be edited in the usual ways e.g.
- By direct keypress from the keyboard
- By copying and pasting
- By entering an entity or numeric character reference
In design view the first two methods are straightforward and the third may be used via the Insert > HTML menu item. In split or source view all methods are available. In any of these cases it does not matter which approach is taken but on returning to Design view the code will be formatted according to the Option selected so on re-opening split or source view you may not see exactly what you have input though the ultimate effect will be correct.
External editor traps
The methods just described may also be used in the external editor. If your page is encoded as ISO-8859-1 and if your external editor works with Unicode things are straightforward but otherwise there are some traps for the unwary.
Let me suppose that your page is encoded as ISO-8859-7 and that you have used the characters € and Σ. Referring to Note 2 you will see that these are located at positions A4 and D3. When these are read by the External unicode editor it will look up the characters at these locations and find ¤ and Ó as you will see from Note 1 (remember that ISO-8859-1 corresponds to the first 256 Unicode characters). Suppose now that you have forgotten all you read above and decide to edit the file inserting € and Σ. What happens?
First the page is encoded as ISO-8859-7 which allocates one byte for each character. You have inserted a Unicode character, with which the editor is completely happy but, which requires probably 2 bytes to code. When the files is saved and read by a browser two bytes will be found where only one is expected. The browser is likely to respond by not recognising the character and issue a 'Unicode replacement character'[note3]. The way this is displayed depends on the browser in use and the fonts installed on the computer but is likely to be as or .
In summary. Editing in an external editor is safe if the page is coded as ISO-8859-1 but in other cases if you wish to edit any non-ascii character you should take care that you do not introduce errors and check all changes in a browser before finalising.
characters sp (space), nbsp, (No-break space),
shy (soft hyphen) are printable but (normally) invisible.
Note 1 Latin-1 characters are those covered by the ISO8859-1 set which includes 191 printable characters and 65 control characters. The printable characters include the ascii set plus an extended set designed to cover the needs of Western European languages.
The control characters are not used in HTML but text editors respond to a few such as carriage return and line feed.
The set is effectively divided into two halves each with 128 positions. In each half the first 32 positions are reserved for control characters.
same ascii printing characters in the first half but replaces many of
the characters in the second half with Greek.
The control characters remain unchanged.
Note 3 Unicode replacement character used to replace an incoming character whose value is unknown or unrepresentable in Unicode. The character is located at code point FFFD.