Continuing the discussion from Standardizing innerText:
Okay, so looking directly at this table, I think rather than trying to invent a parallel logic that sorta-kinda-interacts with some of CSS sometimes, Plain-Text Conversion should start from the CSS style/layout/selector engine browsers are already using for rich content (because, essentially, how an element should be converted to text is as much a property of its "style" as its margins or font are). Here are some of my thoughts around the process for doing that (note that I'm not really a CSS lawyer so some of this might be a little muddled):
- Hidden elements MUST NOT be included in the plain text. This includes elements hidden with
display: none as well as
visibility: hidden may be replaced by a space-equivalent empty element - I'm not really familiar enough with HTML spec behavior to dictate the transformation yet.
- Anonymous boxes MUST be treated the same as non-anonymous boxes.
- Within an inline content block, whitespace should be collapsed/preserved according to the
white-space CSS rule, after applying the text insertions defined below.
The remaining differences are handled by
plaintext CSS rules dictating the transformation of the HTML content into plain text.
Many of the plaintext rules specify what text to insert before, after, or between each type of
display with known values (
between values are only applied between two sibling elements of the same display type, after any
before value and before any
plaintext-between-block: Defaults to
plaintext-between-inline: Defaults to
plaintext-between-table-cell: Defaults to
Other values maybe be defined but are all defaulted to
plaintext-fallback-content: Defaults to "remove", meaning that complex (media) elements with fallback content should be removed. A value of "replace" states that the fallback content should replace the value of the media element.
text-transform transformations should apply to copied text. A value of "preserve" uses the transformed text: a value of "discard" uses the non-transformed text. Defaults to
plaintext-input-content: What to use for the plaintext content of
<input> elements. Defaults to "value", which uses the string value of the element as its content. A value of "remove" for this rule removes the
<input> element from consideration.
Also, there should probably be some kind of
plaintext-include: discard to otherwise hide elements when converting to plaintext, although I don't know if there's an existing rule (eg. for accessibility) that may be better suited to this purpose and repurposed here.
Similarly, there should maybe be general rules for "text pseudo-elements" that may imitate other styling features in plaintext, like a
plaintext-before that may be used to add
"* " bullets for list items. (These may instead be called
plaintext-suffix to differentiate them from the
plaintext-before-* display rules.)
Note that this spec does not handle word-wrap - in the CSS Plain-Text Transformation model, all lines may be of an arbitrarily long length. A future extension to this spec may attempt to specify word-wrapping constraints based on character limits or font metrics, which may then interact with the
word-wrap rule specifying text wrapping behavior for display: that is out of scope for this specification. It is recommended that implementations wishing to wrap text use a general algorithm for wrapping plain text of arbitrary line width.