Continuing the discussion from Standardizing innerText:
Okay, so looking directly at this table, I think rather than trying to invent a parallel logic that sorta-kinda-interacts with some of CSS sometimes, Plain-Text Conversion should start from the CSS style/layout/selector engine browsers are already using for rich content (because, essentially, how an element should be converted to text is as much a property of its “style” as its margins or font are). Here are some of my thoughts around the process for doing that (note that I’m not really a CSS lawyer so some of this might be a little muddled):
- Hidden elements MUST NOT be included in the plain text. This includes elements hidden with
display: none
as well asvisibility: hidden
.-
visibility: hidden
may be replaced by a space-equivalent empty element - I’m not really familiar enough with HTML spec behavior to dictate the transformation yet.
-
- Anonymous boxes MUST be treated the same as non-anonymous boxes.
- Within an inline content block, whitespace should be collapsed/preserved according to the
white-space
CSS rule, after applying the text insertions defined below.
The remaining differences are handled by plaintext
CSS rules dictating the transformation of the HTML content into plain text.
Many of the plaintext rules specify what text to insert before, after, or between each type of display
with known values (between
values are only applied between two sibling elements of the same display type, after any before
value and before any after
value):
-
plaintext-after-table-caption
,plaintext-after-block
,plaintext-between-block
: Defaults to"\n"
. -
plaintext-between-inline
: Defaults to" "
. -
plaintext-between-table-cell
: Defaults to"\t"
.
Other values maybe be defined but are all defaulted to ""
.
Also:
-
plaintext-fallback-content
: Defaults to “remove”, meaning that complex (media) elements with fallback content should be removed. A value of “replace” states that the fallback content should replace the value of the media element. -
plaintext-transform
: Whethertext-transform
transformations should apply to copied text. A value of “preserve” uses the transformed text: a value of “discard” uses the non-transformed text. Defaults topreserve
. -
plaintext-input-content
: What to use for the plaintext content of<input>
elements. Defaults to “value”, which uses the string value of the element as its content. A value of “remove” for this rule removes the<input>
element from consideration.
Also, there should probably be some kind of plaintext-include: discard
to otherwise hide elements when converting to plaintext, although I don’t know if there’s an existing rule (eg. for accessibility) that may be better suited to this purpose and repurposed here.
Similarly, there should maybe be general rules for “text pseudo-elements” that may imitate other styling features in plaintext, like a plaintext-before
that may be used to add "* "
bullets for list items. (These may instead be called plaintext-prefix
and plaintext-suffix
to differentiate them from the plaintext-before-*
display rules.)
Note that this spec does not handle word-wrap - in the CSS Plain-Text Transformation model, all lines may be of an arbitrarily long length. A future extension to this spec may attempt to specify word-wrapping constraints based on character limits or font metrics, which may then interact with the word-wrap
rule specifying text wrapping behavior for display: that is out of scope for this specification. It is recommended that implementations wishing to wrap text use a general algorithm for wrapping plain text of arbitrary line width.