W3C: XML 1.0 Second Edition Specification Errata

XML 1.0 Second Edition Specification Errata


This document records all known errors in the Second Edition of the Extensible Markup Language (XML) 1.0 Specification; for updates see the latest version.

The errata are numbered, classified as Substantive, Editorial or Clarification and listed in reverse chronological order of their date of publication. Changes to the text of the spec are indicated thus: deleted text, new text, modified text.

Please email error reports to xml-editor@w3.org.

Known Errors

Errata as of 2002-09-18

E41 Substantive

Section 2.12

Modify the last sentence of the first paragraph so that it reads:

The values of the attribute are language identifiers as defined by [IETF RFC 3066], Tags for the Identification of Languages, or its successor; in addition, the empty string is allowed.

Append the following to the paragraph immediately following the first example:

In particular, the empty value of xml:lang is used on an element B to override a specification of xml:lang on an enclosing element A, without specifying another language. Within B, it is considered that there is no language information available, just as if xml:lang had not been specified on B or any of its ancestors.

Change the sample declaration of xml:lang to:


Change the last set of examples to read:

<!ATTLIST poem   xml:lang CDATA 'fr'>
<!ATTLIST gloss  xml:lang CDATA 'en'>
<!ATTLIST note   xml:lang CDATA 'en'>

When embedding an XML fragment within a document (such as wrapping a payload inside a SOAP envelope), it is necessary to be able to specify that language information specified higher up in the element tree doesn't apply in the fragment, i.e. to break the inheritance chain without specifying a new language.

Errata as of 2002-08-21

E40 Editorial

Section 5.2
In each of the two list items of the bulleted list, change the instance of "may not" to "may fail to".
Despite the strictures of RFC 2119, the phrase "may not" is dangerous, for it can be read as "must not".

Errata as of 2002-07-10

E39 Clarification

Section 5.2

Amend the last sentence of the last paragraph to read:

Applications which require DTD facilities not related to validation (such as the declaration of default attributes and internal entities) that are or may be specified in external entities should use validating XML processors.
It was not clear whether the relative clause "which are declared in external entities" applied to both the attributes and the entities, or just to the entities.

Errata as of 2002-06-19

E38 Substantive

Section 2.8

Remove the whole paragraph after the second example. This paragraph reads:

The version number "1.0" should be used to indicate conformance to this version of this specification; it is an error for a document to use the value "1.0" if it does not conform to this version of this specification. It is the intent of the XML working group to give later versions of this specification numbers other than "1.0", but this intent does not indicate a commitment to produce any future versions of XML, nor if any are produced, to use any particular numbering scheme. Since future versions are not ruled out, this construct is provided as a means to allow the possibility of automatic version recognition, should it become necessary. Processors may signal an error if they receive documents labeled with versions they do not support.

Change production [26] VersionNum to read:

   [26]    VersionNum    ::=    '1.0'
With the advent of XML 1.1, this clarifies that 1.0 documents may not refer to entities of versions other than 1.0.

Errata as of 2002-03-20

E37 Clarification

Section 6

Change the definition for "#xN" to read:

where N is a hexadecimal integer, the expression matches the character whose number (code point) in ISO/IEC 10646 is N whose canonical (UCS-4) code value, when interpreted as an unsigned binary number, has the value indicated. The number of leading zeros in the #xN form is insignificant; the number of leading zeros in the corresponding code value is governed by the character encoding in use and is not significant for XML.
It was found that the phrase "canonical (UCS-4) code value" was misleading. The phrase "the number of leading zeros in the corresponding code value is governed by the character encoding in use" doesn't really make sense.

E36 Substantive

Section 2.9

Change the third item of the bullet list of conditions for the "Standalone Document Declaration" VC to:

  • attributes with tokenized types, where the attribute appears in the document with a value such that normalization will produce a different value from that which would be produced in the absence of the declaration, or
The original condition required standalone="no" whenever normalization affected some white space (e.g. a TAB turned into a SPACE) or expanded some entities, even if external declarations had no effect.

E35 Editorial

Section 2.8

Change the first sentence of the 4th paragraph to read:

The function of the markup in an XML document is to describe its storage and logical structure and to associate attribute-valueattribute name-value pairs with its logical structures.
There was a concern that "attribute-value pairs" could be interpreted such that attribute values come in pairs, that values are paired with attributes, or that an attribute is just an attribute name rather than combination of name and value. None of these interpretations are accurate.

E34 Clarification

Section 3.2.1

Change the next to last sentence of the paragraph immediately preceding the "Proper Group/PE Nesting" VC to read:

For compatibility, it is an error if the content model allows an element to match more than one occurrence of an element type in the content model.
The original text could be interpreted to mean that the check for a non-deterministic content model for an element had to be performed only if that element actually occurred in the instance being processed, with a child matched ambiguously. The modified text clarifies that having a non-deterministic content model is a property of a DTD, not of a particular instance document using that DTD.

E33 Editorial

Section 2.8

Restore linebreaks in the first and next-to-last examples that were lost between the 1st and 2nd edition:

<?xml version="1.0"?>
<greeting>Hello, world!</greeting>
<?xml version="1.0"?>
<!DOCTYPE greeting SYSTEM "hello.dtd">
<greeting>Hello, world!</greeting>

Errata as of 2002-03-06

E32 Editorial

Section 4.3.3
In the last paragraph, change "octet sequences" to "byte sequences".
For consistency, "byte" everywhere.

E31 Editorial

Appendix H
Change the title to "W3C XML Core Working Group (Non-Normative)".

E30 Editorial

Appendix A.2
Change the URI for the WEBSGML entry to http://www.sgmlsource.com/8879/n0029.htm
The original one was stale.

Errata as of 2002-02-20

E29 Substantive

Section 2.12

Remove the last 5 words from the last sentence of the first paragraph, so that it reads:

The values of the attribute are language identifiers as defined by [IETF RFC 3066], Tags for the Identification of Languages, or its successor on the IETF Standards Track.

Remove the entire Note following the first paragraph (already amended by E11):

[IETF RFC 3066] tags are constructed from two-letter language codes as defined by [ISO 639], from two-letter country codes as defined by [ISO 3166], or from language identifiers registered with the Internet Assigned Numbers Authority [IANA-LANGCODES].
RFC 3066 is not Standard Track but BCP (Best Current Practice) in the IETF. The deleted note was incomplete, potentially misleading and otiose; it was misinterpreted by some to forbid 3-letter codes.

Errata as of 2001-10-31

E28 Substantive

Section 2.2
Paragraph 2: for "ISO/IEC 10646 [ISO/IEC 10646] (see also [ISO/IEC 10646-2000])" read simply "ISO IEC/10646:2000 [ISO/IEC 10646]".
Section 4.3.3
Paragraph 2: remove the words "Annex F of [ISO/IEC 10646],".
Appendix A.1
Remove the entire reference to ISO 10646, leaving only the anchor. Change "10646-2000" in the next entry to "10646:2000".
It has become pointless to refer to ISO/IEC 10646:1993 as amended, which is now obsolete and unavailable.

E27 Substantive

Section 2.2
Second paragraph: for "...the UTF-8 and UTF-16 encodings of 10646" read "...the UTF-8 and UTF-16 encodings of Unicode 3.1", with a link to the Unicode3 entry.
Section 4.2.2
Numbered paragraph 1: change the reference "[IETF RFC 2279]" to "[Unicode3]".
Section 4.3.3

Last paragraph: add a new 3rd sentence:

"Specifically, it is a fatal error if an entity encoded in UTF-8 contains any irregular code unit sequences, as defined in Unicode 3.1."

with a reference to Unicode 3.1.

Appendix A.1

Change the [Unicode3] entry (leaving the anchor name unchanged) to read:

The Unicode Consortium. The Unicode Standard, Version 3.1, defined by: The Unicode Standard, Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode Standard Annex #27: Unicode 3.1 (http://www.unicode.org/reports/tr27/).
Appendix A.2
Remove RFC 2279 as a non-normative reference, since it is now superseded. Also, for "IETF RFC2141" read "IETF RFC 2141".
There was no normative reference for UTF-8, unless the phrase "UTF-8 and UTF-16 encodings of 10646" in 2.2 is to be interpreted so, and if it is, it refers to an obsolete edition. The new sentence in 4.3.3 makes interpretation of UTF-8 well-defined in a case where Unicode allows a looser interpretation (that potentially creates security concerns).

Errata as of 2001-10-17

E26 Clarification

Obsoletes E4
Section 4.2.2

Rewrite the paragraph beginning "[Definition: The SystemLiteral is called the entity's system identifier.", the following paragraph and the following numbered list, so that they read:

[ Definition: The SystemLiteral is called the entity's system identifier. It is meant to be converted to a URI reference (as defined in [IETF RFC 2396], updated by [IETF RFC 2732]), as part of the process of dereferencing it to obtain input for the XML processor to construct the entity's replacement text.] It is an error for a fragment identifier (beginning with a # character) to be part of a system identifier. Unless otherwise provided by information outside the scope of this specification (e.g. a special XML element type defined by a particular DTD, or a processing instruction defined by a particular application specification), relative URIs are relative to the location of the resource within which the entity declaration occurs. A URI might thus be relative to the document entity, to the entity containing the external DTD subset, or to some other external parameter entity.
System identifiers (and other XML strings meant to be used as URI references) may contain characters that, according to [IETF RFC 2396] and [IETF RFC 2732], must be escaped before a URI can be used to retrieve the referenced resource. The characters to be escaped are the contol characters #x0 to #x1F and #x7F (most of which cannot appear in XML), space #x20, the delimiters '<' #x3C, '>' #x3E and '"' #x22, the unwise characters '{' #x7B, '}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and '`' #x60, as well as all characters above #x7F. Since escaping is not always a fully reversible process, it must be performed only when absolutely necessary and as late as possible in a processing chain. In particular, neither the process of converting a relative URI to an absolute one nor the process of passing a URI reference to a process or software component responsible for dereferencing it should trigger escaping. When escaping does occur, it must be performed as follows:
  1. Each disallowed character to be escaped is represented in UTF-8 [IETF RFC 2279] as one or more bytes.

  2. The resulting bytes are escaped with the URI escaping mechanism (that is, converted to %HH, where HH is the hexadecimal notation of the byte value).

  3. The original character is replaced by the resulting character sequence.

It was still unclear exactly when escaping was to be done and by whom.

Errata as of 2001-10-03

E25 Clarification

Section 4.2.2

Amend the second sentence of the next-to-last paragraph to read:

An XML processor attempting to retrieve the entity's content may use any combination of the public and system identifiers as well as additional information outside the scope of this specification to try to generate an alternative URI reference.
It was felt that a too literal reading of the original text would prohibit using the system identifier or other information in attempting to generate an alternate URI reference, which was never the intention.

Errata as of 2001-09-23

E24 Clarification

Section 2.4

Change the last sentence of the third paragraph to read:

The right angle bracket (>) may be represented using the string "&gt;", and must, for compatibility, be escaped using either "&gt;" or a character reference when it appears in the string "]]>" in content, when that string is not marking the end of a CDATA section.
The original sentence was somewhat ambiguous, the change clarifies which interpretation is correct.

E23 Substantive

Section 4.3.3

Amend the last sentence of the next-to-last paragraph to read:

Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.
It was always the intent of the XML 1.0 spec to allow the character encoding to be determined externally. The sentence corrected here was introduced in the second edition.

Errata as of 2001-07-25

E22 Substantive

Section 4.3.3

Amend the second paragraph to read:

Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.
The BOM in UTF-8 is already mentionned in Appendix F. It's happening anyway: Windows 2000's Notepad puts a BOM when one saves as UTF-8, and it's not an option. Since it makes some sense for a general-purpose text editor to do that, it's likely to spread to other editors.

Errata as of 2001-06-13

E21 Substantive

Section 2.8

Add a new production [28b] and modify production [28] to refer to it:

[28]    doctypedecl    ::=    '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>' [VC: Root Element Type]
[WFC: External Subset]
[28a]    DeclSep    ::=    PEReference | S [WFC: PE Between Declarations]
[28b]    intSubset    ::=    (markupdecl | DeclSep)*
[29]    markupdecl    ::=    elementdecl | AttlistDecl | EntityDecl | NotationDecl | PI | Comment [VC: Proper Declaration/PE Nesting]
[WFC: PEs in Internal Subset]
Clarify what internal subset means, in particular that it doesn't include the enclosing square brakets "[...]".

Errata as of 2001-05-24

E20 Substantive

Obsoletes erratum E108 to first edition
Section 2.3

Change productions [6] Names and [8] Nmtokens to use #x20 (a single space character) instead of S:

[6] Names ::= Name (#x20 Name)*
[8] Nmtokens ::= Nmtoken (#x20 Nmtoken)*

Add a note after production 8:

Note: The Names and Nmtokens productions are used to define the validity of tokenized attribute values after normalization (see 3.3.1 Attribute Types).

This restores first edition erratum E62, which was rescinded by E108. It seems likely that when E108 was adopted the productions were incorrectly thought to apply to unnormalized attribute values, which would have prevented the use of non-#x20 whitespace (tabs and newlines) as separators in tokenized attribute values. In fact, it only prohibits the use of character references to these characters.

This change restores SGML compatibility (cf. the "name list" and "name token list" productions in SGML).

E19 Clarification

Section 4.5

Modify the third sentence of the second paragraph, so that it reads:

The actual replacement text that is included (or included in literal) as described above must contain the replacement text of any parameter entities referred to, and must contain the character referred to, in place of any character references in the literal entity value; however, general-entity references must be left as-is, unexpanded.

Errata as of 2001-04-24

E18 Clarification

Section 4.2.2

To the sentence:

Unless otherwise provided by information outside the scope of this specification (e.g. a special XML element type defined by a particular DTD, or a processing instruction defined by a particular application specification), relative URIs are relative to the location of the resource within which the entity declaration occurs.

(inside the paragraph following the Notation declared VC), append the following:

This is defined to be the external entity containing the '<' which starts the declaration, at the point when it is parsed as a declaration.

This clarifies exactly where a declaration occurs, for purposes of determining the base for relative URIs. Given the example:


 <!DOCTYPE foo [
 <!ENTITY % pe SYSTEM "subdir1/pe">


 <!ENTITY % extpe SYSTEM "../subdir2/extpe">
 <!ENTITY % intpe "%extpe;">


 <!ENTITY ent SYSTEM 'entfile'>

Though the characters making up the declaration of ent appear in subdir2/extpe, they are not parsed as a declaration there. They are just treated as characters making up the replacement text of intpe. They are not parsed as a declaration until intpe is parsed, at which point the containing external entity is the document entity, so the relevant base URI is that of example.xml.

The fact that it is the containing external entity that is used may be summed up by saying that internal entities do not carry any base URI with them; indeed, they consist only of their replacement text.

If example.xml contained %extpe; instead of %intpe; the situation would be different: the contents of subdir2/extpe would be parsed as a declaration, and the relevant base URI would be that of subdir2

Errata as of 2001-04-11

E17 Editorial

Section 6

From the definition for "A | B", delete "but not both":

A | B
matches A or B but not both.
"but not both" was found misleading by some and was in fact useless.

E16 Substantive

Appendix A

Move the entries for [IETF RFC 2396] and [IETF RFC 2732] from A.2 (informative) to A.1 (normative).

In 4.2.2, immediately after the Notation Declared VC, there is a definition of system identifier which clearly depends normatively on those RFCs.

Errata as of 2001-03-27

E15 Clarification

Section 3

Rewrite the Element valid VC as follows:

Validity constraint: Element Valid

An element is valid if there is a declaration matching elementdecl where the Name matches the element type, and one of the following holds:

  1. The declaration matches EMPTY and the element has no content (not even entity references, comments, PIs or white space).

  2. The declaration matches children and the sequence of child elements (after replacing any entity references with their replacement text) belongs to the language generated by the regular expression in the content model, with optional white space (characters matching the nonterminal S), comments and PIs (i.e. markup matching production [27] Misc) between the start-tag and the first child element, between child elements, or between the last child element and the end-tag. Note that a CDATA section containing only white space or a reference to an entity whose replacement text is character references expanding to white space do not match the nonterminal S, and hence cannot appear in these positions; however, a reference to an internal entity with a literal value consisting of character references expanding to white space does match S, since its replacement text is the white space resulting from expansion of the character references.

  3. The declaration matches Mixed and the content (after replacing any entity references with their replacement text) consists of character data, comments, PIs and child elements whose types match names in the content model.

  4. The declaration matches ANY, and the types of any child elements (after replacing any entity references with their replacement text) have been declared.

Section 3.1

In the paragraph just after production [43] content, amend the definition of empty element so that the word "content" within the definition is a link to production [43].

Errata as of 2001-03-07

E14 Clarification

Section 4.3.2

Amend the last paragraph so that it reads:

A consequence of well-formedness in general entities is that the logical and physical structures in an XML document are properly nested; no start-tag, end-tag, empty-element tag, element, comment, processing instruction, character reference, or entity reference can begin in one entity and end in another.

"General" is added because:

  • since all parameter entities are (now) well-formed by definition, there can't be any interesting consequences of their well-formedness;
  • the list of properly-nested structures notably does not include declarations.

This clarifies that the following from the OASIS test suite:

<!DOCTYPE doc SYSTEM "001.ent">

with 001.ent:
<!ENTITY % e "<!--">
%e; -->

is well-formed but violates a validity constraint.

Errata as of 2001-03-05

E13 Editorial

Section 2.10

In the first paragraph after the example, replace "overriden" with "overridden" (two d's) in the sentence "This declared intent is considered to apply to all elements within the content of the element where it is specified, unless overridden with another instance of the xml:space attribute."

Errata as of 2001-02-22

E12 Substantive

Appendix F.2

Change the [IETF RFC 2376] reference to [IETF RFC 3023] (keeping the same #RFC2376 fragment identifier in order not to break existing links).

Appendix A.2

Change the IETF RFC 2376 entry to:

IETF (Internet Engineering Task Force). RFC 3023: XML Media Types. eds. M. Murata, S. St.Laurent, D. Kohn. 2001. (See http://www.ietf.org/rfc/rfc3023.txt.)
RFC 3023 updates and obsoletes RFC 2376.

E11 Substantive

Section 1.1

Amend the next to last paragraph so that it reads:

This specification, together with associated standards (Unicode and ISO/IEC 10646 for characters, Internet RFC 3066 for language identification tags, ISO 639 for language name codes, and ISO 3166 for country name codes), provides all the information necessary to understand XML Version 1.0 and construct computer programs to process it.

[The only change is that "RFC 1766" becomes "RFC 3066".]


Change all [IETF RFC 1766] references to [IETF RFC 3066] (keeping the same #RFC1766 fragment identifier in order not to break existing links).

Section 2.12

Remove the last sentence of the Note: "It is expected that the successor to [IETF RFC 1766] will introduce three-letter language codes for languages not presently covered by [ISO 639]."

Appendix A.1

Change the IETF RFC 1766 entry to:

IETF (Internet Engineering Task Force). RFC 3066: Tags for the Identification of Languages, ed. H. Alvestrand. 2001. (See http://www.ietf.org/rfc/rfc3066.txt.)
RFC 3066 updates and obsoletes RFC 1766.

E10 Substantive

Section 3.3.3

Just after the paragraph beginning "All attributes for which no declaration has been read..." (just before the examples), append the following paragraph:

It is an error if an attribute refers to an entity when there is a declaration for that entity which the processor has not read. This can happen only when a non-validating processor is being used.

Errata as of 2001-01-25

E9 Clarification

Section 3.3.2

Change the title and the text of Attribute Default Legal Validity Constraint to:

Validity Constraint: Attribute Default Value Syntactically Correct
The declared default value must meet the syntactic constraints of the declared attribute type.
Note that only the syntactic constraints of the type are required here; other constraints (e.g. that the value be the name of a declared unparsed entity, for an attribute of type ENTITY) may come into play if the declared default value is actually used (an element without a specification for this attribute occurs).
This clarification was prompted by the "sun/invalid/attr11.xml" test file in the OASIS test suite. The interpretation is that the default value of an attribute only needs to be syntactically correct unless it is actually used (i.e an element occurs without a specification for that attribute), in which case the default value must also meet the constraints bearing on this use. This is believed to be required for SGML compatibility and to be what the XML 1.0 spec currently says.

E8 Clarification

Section 4.1

Change the first sentence of the second paragraph of the Entity Declared WFC (not the VC of the same name) to read:

Note that non-validating processors are not obligated to read and process entity declarations occurring in parameter entities or in the external subset.
The note was inconsistent with the normative text, as it read "external parameter entities" whereas internal parameter entities are also not necessarily processed.

E7 Clarification

Section 4.5

Remove the word "internal" from the title of the section.

Change the first paragraph, in particular removing the word "internal", so that it reads:

In discussing the treatment of internal entities, it is useful to distinguish two forms of the entity's value. [Definition: For an internal entity, the literal entity value is the quoted string actually present in the entity declaration, corresponding to the non-terminal EntityValue.] [Definition: For an external entity, the literal entity value is the exact text contained in the entity.] [Definition: For an internal entity, the replacement text is the content of the entity, after replacement of character references and parameter-entity references.] [Definition: For an external entity, the replacement text is the content of the entity, after stripping the text declaration (leaving any surrounding whitespace) if there is one but without any replacement of character references or parameter-entity references.]
The concept of an entity's replacement text is used throughout the spec, but was defined nowhere for external entities. Also, it was not clear whether the replacement text of an external entity is the content after replacement of character references and parameter-entity references, as for internal entities.

Errata as of 2000-12-06

E6 Editorial

Section 3.3.3

Modify the second example in the table at the end of the section to read as follows (add a &#x20; in the middle):

A #x20 B #x20 #x20 A #x20 #x20 #x20 B #x20 #x20
Illustrate how space characters (#x20) get normalized no matter whether they come from a character reference or not.

Errata as of 2000-12-01

E5 Editorial

Section 4.2.2
In the numbered list explaining the escaping of disallowed characters in URI references, changes "octets" to "bytes".
For consistency. We had "octets" and "bytes" meaning the same thing, but apparently suggesting that they were different. "bytes" won by majority rule.

Errata as of 2000-11-22

E4 Clarification

Obsoleted by E26
Section 4.2.2

Replace the last sentence of the paragraph beginning with "URI references require encoding and escaping of certain characters." with the following:

The XML processor must escape disallowed characters as follows:
The fact that the XML processor is responsible for escaping disallowed characters when resolving URI references was lost in the modifications of the 2nd edition.

E3 Clarification

Section 4.2.2

After the sentence reading "A URI might thus be relative to the document entity, to the entity containing the external DTD subset, or to some other external parameter entity.", which follows the definition of SystemLiteral, add the following:

Attempts to retrieve the resource identified by a URI may be redirected at the parser level (for example, in an entity resolver) or below (at the protocol level, for example, via an HTTP Location: header). In the absence of additional information outside the scope of this specification within the resource, the base URI of a resource is always the URI of the actual resource returned. In other words, it is the URI of the resource retrieved after all redirection has occurred.

Errata as of 2000-11-16

E2 Substantive

Section 3.3.1

Add a validity constraint applying to productions [58] NotationType and [59] Enumeration as follows:

Validity constraint: No duplicate tokens
The notation names in a single NotationType attribute declaration, as well as the NmTokens in a single Enumeration attribute declaration, must all be distinct.
Necessary to maintain compatibility with SGML.

Errata as of 2000-11-02

E1 Editorial

Section 3.3.3
In the set of examples at the end of the section, change the last character of the 3rd column of the last example from "#xD" to "#xA". The change makes the third column identical to the second column (for that third example).
"#xD" was a typo.

Last updated $Date: 2003/10/10 17:22:17 $ by $Author: tournois $

Donate to the XiMoL project SourceForge.net Logo If you have any questions about XiMoL, you could write to tournois@users.sourceforge.net.