Документ взят из кэша поисковой машины. Адрес оригинального документа : http://crydee.sai.msu.ru/~vab/html.doc/html40/appendix/notes.html
Дата изменения: Fri Apr 24 19:03:46 1998
Дата индексирования: Tue Oct 2 16:58:27 2012
Кодировка:

Поисковые слова: п п п п п п п р п
Performance, Implementation, and Design Notes

Appendix B: Performance, Implementation, and Design Notes

Contents

  1. Notes on invalid documents
  2. Special characters in URI attribute values
    1. Non-ASCII characters in URI attribute values
    2. Ampersands in URI attribute values
  3. SGML implementation notes
    1. Line breaks
    2. Specifying non-HTML data
    3. SGML features with limited support
    4. Boolean attributes
    5. Marked Sections
    6. Processing Instructions
    7. Shorthand markup
  4. Notes on helping search engines index your Web site
    1. Search robots
  5. Notes on tables
    1. Design rationale
    2. Recommended Layout Algorithms
  6. Notes on forms
    1. Incremental display
    2. Future projects
  7. Notes on scripting
    1. Reserved syntax for future script macros
  8. Notes on frames
  9. Notes on accessibility
  10. Notes on security
    1. Security issues for forms

The following notes are informative, not normative. Despite the appearance of words such as "must" and "should", all requirements in this section appear elsewhere in the specification.

B.1 Notes on invalid documents

This specification does not define how conforming user agents handle general error conditions, including how user agents behave when they encounter elements, attributes, attribute values, or entities not specified in this document.

However, to facilitate experimentation and interoperability between implementations of various versions of HTML, we recommend the following behavior:

We also recommend that user agents provide support for notifying the user of such errors.

Since user agents may vary in how they handle error conditions, authors and users must not rely on specific error recovery behavior.

The HTML 2.0 specification ([RFC1866]) observes that many HTML 2.0 user agents assume that a document that does not begin with a document type declaration refers to the HTML 2.0 specification. As experience shows that this is a poor assumption, the current specification does not recommend this behavior.

For reasons of interoperability, authors must not "extend" HTML through the available SGML mechanisms (e.g., extending the DTD, adding a new set of entity definitions, etc.).

B.2 Special characters in URI attribute values

B.2.1 Non-ASCII characters in URI attribute values

Although URIs do not contain non-ASCII values (see [URI], section 2.1) authors sometimes specify them in attribute values expecting URIs (i.e., defined with %URI; in the DTD). For instance, the following href value is illegal:

<A href="http://foo.org/Håkon">...</A>

We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases:

  1. Represent each character in UTF-8 (see [RFC2044]) as one or more bytes.
  2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).

This procedure results in a syntactically legal URI (as defined in [RFC1738], section 2.2 or [RFC2141], section 2) that is independent of the character encoding to which the HTML document carrying the URI may have been transcoded.

Note. Some older user agents trivially process URIs in HTML using the bytes of the character encoding in which the document was received. Some older HTML documents rely on this practice and break when transcoded. User agents that want to handle these older documents should, on receiving a URI containing characters outside the legal set, first use the conversion based on UTF-8. Only if the resulting URI does not resolve should they try constructing a URI based on the bytes of the character encoding in which the document was received.

Note. The same conversion based on UTF-8 should be applied to values of the name attribute for the A element.

B.2.2 Ampersands in URI attribute values

The URI that is constructed when a form is submitted may be used as an anchor-style link (e.g., the href attribute for the A element). Unfortunately, the use of the "&" character to separate form fields interacts with its use in SGML attribute values to delimit character entity references. For example, to use the URI "http://host/?x=1&y=2" as a linking URI, it must be written <A href="http://host/?x=1&#38;y=2"> or <A href="http://host/?x=1&amp;y=2">.

We recommend that HTTP server implementors, and in particular, CGI implementors support the use of ";" in place of "&" to save authors the trouble of escaping "&" characters in this manner.

B.3 SGML implementation notes

B.3.1 Line breaks

SGML (see [ISO8879], section 7.6.1) specifies that a line break immediately following a start tag must be ignored, as must a line break immediately before an end tag. This applies to all HTML elements without exception.

The following two HTML examples must be rendered identically:

<P>Thomas is watching TV.</P>
<P>
Thomas is watching TV.
</P>

So must the following two examples:

<A>My favorite Website</A>
<A>
My favorite Website
</A>

B.3.2 Specifying non-HTML data

Script and style data may appear as element content or attribute values. The following sections describe the boundary between HTML markup and foreign data.

Note. The DTD defines script and style data to be CDATA for both element content and attribute values. SGML rules do not allow character references in CDATA element content but do allow them in CDATA attribute values. Authors should pay particular attention when cutting and pasting script and style data between element content and attribute values.

This asymmetry also means that when transcoding from a richer to a poorer character encoding, the transcoder cannot simply replace unconvertible characters in script or style data with the corresponding numeric character references; it must parse the HTML document and know about each script and style language's syntax in order to process the data correctly.

Element content 

When script or style data is the content of an element (SCRIPT and STYLE), the data begins immediately after the element start tag and ends at the first ETAGO ("</") delimiter followed by a name character ([a-zA-Z]); note that this may not be the element's end tag. Authors should therefore escape "</" within the content. Escape mechanisms are specific to each scripting or style sheet language.

ILLEGAL EXAMPLE:
The following script data incorrectly contains a "</" sequence (as part of "</EM>") before the SCRIPT end tag:

    <SCRIPT type="text/javascript">
      document.write ("<EM>This won't work</EM>")
    </SCRIPT>

In JavaScript, this code can be expressed legally by hiding the ETAGO delimiter before an SGML name start character:

    <SCRIPT type="text/javascript">
      document.write ("<EM>This will work<\/EM>")
    </SCRIPT>

In Tcl, one may accomplish this as follows:

    <SCRIPT type="text/tcl">
      document write "<EM>This will work<\/EM>"
    </SCRIPT>

In VBScript, the problem may be avoided with the Chr() function:

    "<EM>This will work<" & Chr(47) & "EM>"

Attribute values 

When script or style data is the value of an attribute (either style or the intrinsic event attributes), authors should escape occurrences of the delimiting single or double quotation mark within the value according to the script or style language convention. Authors should also escape occurrences of "&" if the "&" is not meant to be the beginning of a character reference.

Thus, for example, one could write:

 <INPUT name="num" value="0"
 onchange="if (compare(this.value, &quot;help&quot;)) {gethelp()}">

B.3.3 SGML features with limited support

SGML systems conforming to [ISO8879] are expected to recognize a number of features that aren't widely supported by HTML user agents. We recommend that authors avoid using all of these features.

B.3.4 Boolean attributes

Authors should be aware than many user agents only recognize the minimized form of boolean attributes and not the full form.

For instance, authors may want to specify:

<OPTION selected>

instead of

<OPTION selected="selected">

B.3.5 Marked Sections

Marked sections play a role similar to the #ifdef construct recognized by C preprocessors.

<![INCLUDE[
 <!-- this will be included -->
]]>

<![IGNORE[
 <!-- this will be ignored -->
]]>

SGML also defines the use of marked sections for CDATA content, within which "<" is not treated as the start of a tag, e.g.,

<![CDATA[
 <an> example of <sgml> markup that is
 not <painful> to write with < and such.
]]>

The telltale sign that a user agent doesn't recognize a marked section is the appearance of "]]>", which is seen when the user agent mistakenly uses the first ">" character as the end of the tag starting with "<![".

B.3.6 Processing Instructions

Processing instructions are a mechanism to capture platform-specific idioms. A processing instruction begins with <? and ends with >

<?instruction >

For example:

<?>
<?style tt = font courier>
<?page break>
<?experiment> ... <?/experiment>

Authors should be aware that many user agents render processing instructions as part of the document's text.

B.3.7 Shorthand markup

Some SGML SHORTTAG constructs save typing but add no expressive capability to the SGML application. Although these constructs technically introduce no ambiguity, they reduce the robustness of documents, especially when the language is enhanced to include new elements. Thus, while SHORTTAG constructs of SGML related to attributes are widely used and implemented, those related to elements are not. Documents that use them are conforming SGML documents, but are unlikely to work with many existing HTML tools.

The SHORTTAG constructs in question are the following:

B.4 Notes on helping search engines index your Web site

This section provides some simple suggestions that will make your documents more accessible to search engines.

Define the document language
In the global context of the Web it is important to know which human language a page was written in. This is discussed in the section on language information.
Specify language variants of this document
If you have prepared translations of this document into other languages, you should use the LINK element to reference these. This allows an indexing engine to offer users search results in the user's preferred language, regardless of how the query was written. For instance, the following links offer French and German alternatives to a search engine:
<LINK rel="alternate" 
         type="text/html"
         href="mydoc-fr.html" hreflang="fr"
         lang="fr" title="La vie souterraine">
<LINK rel="alternate" 
         type="text/html"
         href="mydoc-de.html" hreflang="de"
         lang="de" title="Das Leben im Untergrund">
Provide keywords and descriptions
Some indexing engines look for META elements that define a comma-separated list of keywords/phrases, or that give a short description. Search engines may present these keywords as the result of a search. The value of the name attribute sought by a search attribute is not defined by this specification. Consider these examples,
<META name="keywords" content="vacation,Greece,sunshine">
<META name="description" content="Idyllic European vacations">
Indicate the beginning of a collection
Collections of word processing documents or presentations are frequently translated into collections of HTML documents. It is helpful for search results to reference the beginning of the collection in addition to the page hit by the search. You may help search engines by using the LINK element with rel="start" along with the title attribute, as in:
 
<LINK rel="begin" 
         type="text/html"
         href="page1.html" 
         title="General Theory of Relativity">
Provide robots with indexing instructions
People may be surprised to find that their site has been indexed by an indexing robot and that the robot should not have been permitted to visit a sensitive part of the site. Many Web robots offer facilities for Web site administrators and content providers to limit what the robot does. This is achieved through two mechanisms: a "robots.txt" file and the META element in HTML documents, described below.

B.4.1 Search robots

The robots.txt file 

When a Robot visits a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyze its contents to see if it is allowed to retrieve the document. You can customize the robots.txt file to apply only to specific robots, and to disallow access to specific directories or files.

Here is a sample robots.txt file that prevents all robots from visiting the entire site

        User-agent: *    # applies to all robots
        Disallow: /      # disallow indexing of all pages

The Robot will simply look for a "/robots.txt" URI on your site, where a site is defined as a HTTP server running on a particular host and port number. Here are some sample locations for robots.txt:

Site URIURI for robots.txt
http://www.w3.org/ http://www.w3.org/robots.txt
http://www.w3.org:80/ http://www.w3.org:80/robots.txt
http://www.w3.org:1234/ http://www.w3.org:1234/robots.txt
http://w3.org/ http://w3.org/robots.txt

There can only be a single "/robots.txt" on a site. Specifically, you should not put "robots.txt" files in user directories, because a robot will never look at them. If you want your users to be able to create their own "robots.txt", you will need to merge them all into a single "/robots.txt". If you don't want to do this your users might want to use the Robots META Tag instead.

Some tips: URI's are case-sensitive, and "/robots.txt" string must be all lower-case. Blank lines are not permitted.

There must be exactly one "User-agent" field per record. The robot should be liberal in interpreting this field. A case-insensitive substring match of the name without version information is recommended.

If the value is "*", the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

The "Disallow" field specifies a partial URI that is not to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example,

    Disallow: /help disallows both /help.html and /help/index.html, whereas
    Disallow: /help/ would disallow /help/index.html but allow /help.html. 

An empty value for "Disallow", indicates that all URIs can be retrieved. At least one "Disallow" field must be present in the robots.txt file.

Robots and the META element 

The META element allows HTML authors to tell visiting robots whether a document may be indexed, or used to harvest more links. No server administrator action is required.

In the following example a robot should neither index this document, nor analyze it for links.

<META name="ROBOTS" content="NOINDEX, NOFOLLOW">

The list of terms in the content is ALL, INDEX, NOFOLLOW, NOINDEX. The name and the content attribute values are case-insensitive.

Note. In early 1997 only a few robots implement this, but this is expected to change as more public attention is given to controlling indexing robots.

B.5 Notes on tables

B.5.1 Design rationale

The HTML table model has evolved from studies of existing SGML tables models, the treatment of tables in common word processing packages, and a wide range of tabular layout techniques in magazines, books and other paper-based documents. The model was chosen to allow simple tables to be expressed simply with extra complexity available when needed. This makes it practical to create the markup for HTML tables with everyday text editors and reduces the learning curve for getting started. This feature has been very important to the success of HTML to date.

Increasingly, people are creating tables by converting from other document formats or by creating them directly with WYSIWYG editors. It is important that the HTML table model fit well with these authoring tools. This affects how the cells that span multiple rows or columns are represented, and how alignment and other presentation properties are associated with groups of cells.

Dynamic reformatting 

A major consideration for the HTML table model is that the author does not control how a user will size a table, what fonts he or she will use, etc. This makes it risky to rely on column widths specified in terms of absolute pixel units. Instead, tables must be able to change sizes dynamically to match the current window size and fonts. Authors can provide guidance as to the relative widths of columns, but user agents should ensure that columns are wide enough to render the width of the largest element of the cell's content. If the author's specification must be overridden, relative widths of individual columns should not be changed drastically.

Incremental display 

For large tables or slow network connections, incremental table display is important to user satisfaction. User agents should be able to begin displaying a table before all of the data has been received. The default window width for most user agents shows about 80 characters, and the graphics for many HTML pages are designed with these defaults in mind. By specifying the number of columns, and including provision for control of table width and the widths of different columns, authors can give hints to user agents that allow the incremental display of table contents.

For incremental display, the browser needs the number of columns and their widths. The default width of the table is the current window size (width="100%"). This can be altered by setting the width attribute of the TABLE element. By default, all columns have the same width, but you can specify column widths with one or more COL elements before the table data starts.

The remaining issue is the number of columns. Some people have suggested waiting until the first row of the table has been received, but this could take a long time if the cells have a lot of content. On the whole it makes more sense, when incremental display is desired, to get authors to explicitly specify the number of columns in the TABLE element.

Authors still need a way of telling user agents whether to use incremental display or to size the table automatically to fit the cell contents. In the two pass auto-sizing mode, the number of columns is determined by the first pass. In the incremental mode, the number of columns must be stated up front (with COL or COLGROUP elements.

Structure and presentation 

HTML distinguishes structural markup such as paragraphs and quotations from rendering idioms such as margins, fonts, colors, etc. How does this distinction affect tables? From the purist's point of view, the alignment of text within table cells and the borders between cells is a rendering issue, not one of structure. In practice, though, it is useful to group these with the structural information, as these features are highly portable from one application to the next. The HTML table model leaves most rendering information to associated style sheets. The model presented in this specification is designed to take advantage of such style sheets but not to require them.

Current desktop publishing packages provide very rich control over the rendering of tables, and it would be impractical to reproduce this in HTML, without making HTML into a bulky rich text format like RTF or MIF. This specification does, however, offer authors the ability to choose from a set of commonly used classes of border styles. The frame attribute controls the appearance of the border frame around the table while the rules attribute determines the choice of rulings within the table. A