CSS Plain-Text Conversion

stuartpb · 2015-07-27

Continuing the discussion from Standardizing innerText:

Okay, so looking directly at this table, I think rather than trying to invent a parallel logic that sorta-kinda-interacts with some of CSS sometimes, Plain-Text Conversion should start from the CSS style/layout/selector engine browsers are already using for rich content (because, essentially, how an element should be converted to text is as much a property of its “style” as its margins or font are). Here are some of my thoughts around the process for doing that (note that I’m not really a CSS lawyer so some of this might be a little muddled):

Hidden elements MUST NOT be included in the plain text. This includes elements hidden with display: none as well as visibility: hidden.
- visibility: hidden may be replaced by a space-equivalent empty element - I’m not really familiar enough with HTML spec behavior to dictate the transformation yet.
Anonymous boxes MUST be treated the same as non-anonymous boxes.
Within an inline content block, whitespace should be collapsed/preserved according to the white-space CSS rule, after applying the text insertions defined below.

The remaining differences are handled by plaintext CSS rules dictating the transformation of the HTML content into plain text.

Many of the plaintext rules specify what text to insert before, after, or between each type of display with known values (between values are only applied between two sibling elements of the same display type, after any before value and before any after value):

plaintext-after-table-caption, plaintext-after-block, plaintext-between-block: Defaults to "\n".
plaintext-between-inline: Defaults to " ".
plaintext-between-table-cell: Defaults to "\t".

Other values maybe be defined but are all defaulted to "".

Also:

plaintext-fallback-content: Defaults to “remove”, meaning that complex (media) elements with fallback content should be removed. A value of “replace” states that the fallback content should replace the value of the media element.
plaintext-transform: Whether text-transform transformations should apply to copied text. A value of “preserve” uses the transformed text: a value of “discard” uses the non-transformed text. Defaults to preserve.
plaintext-input-content: What to use for the plaintext content of <input> elements. Defaults to “value”, which uses the string value of the element as its content. A value of “remove” for this rule removes the <input> element from consideration.

Also, there should probably be some kind of plaintext-include: discard to otherwise hide elements when converting to plaintext, although I don’t know if there’s an existing rule (eg. for accessibility) that may be better suited to this purpose and repurposed here.

Similarly, there should maybe be general rules for “text pseudo-elements” that may imitate other styling features in plaintext, like a plaintext-before that may be used to add "* " bullets for list items. (These may instead be called plaintext-prefix and plaintext-suffix to differentiate them from the plaintext-before-* display rules.)

Note that this spec does not handle word-wrap - in the CSS Plain-Text Transformation model, all lines may be of an arbitrarily long length. A future extension to this spec may attempt to specify word-wrapping constraints based on character limits or font metrics, which may then interact with the word-wrap rule specifying text wrapping behavior for display: that is out of scope for this specification. It is recommended that implementations wishing to wrap text use a general algorithm for wrapping plain text of arbitrary line width.

stuartpb · 2015-07-27

Relevant: http://www.w3.org/Tools/html2things.html

It’s probably worth reaching out to the plain-text browser vendors like w3m and Lynx about this (they must have some thoughts on this).

stuartpb · 2015-07-27

Actually, more versatile than plaintext-include: discard would be plaintext-content: "", which would also allow for alternative replacements to use when converting the element’s text content to plaintext.

stuartpb · 2015-07-27

Okay, so it looks like @kangax wrote a “naive spec” for this, which is sort of sensible but also sort of one of those “parallel logic that sorta-kinda-interacts with some of CSS sometimes” things I mentioned in my introduction.

stuartpb · 2015-07-27

What I need here is somebody who really understands the CSS layout model and can work with me to rewrite this in a way that hooks into that.

stuartpb · 2015-08-02

Actually, I don’t know that I necessarily agree with this. If I have two immediately-adjacent spans, I don’t want a space being inserted between then when converted to text (unless there actually is a whitespace node between them). I think, considering that, it’s more sensible to default this to the empty string.

stuartpb · 2015-08-03

With the plaintext-content property, <br> would be handled with the rule br {plaintext-content: '\n'}.

stuartpb · 2015-08-03

The CSS properties introduced by this spec aim to replace the functionality of Blink’s TextIteratorBehaviorFlags.

stuartpb · 2015-08-03

Actually, what would be better than having several different plaintext-*-content properties, as well as one plaintext-content property that defines a literal, would be one plaintext-content property that takes either a string literal or a keyword, the same way as the content property for pseudo-elements.

Valid values for plaintext-content:

normal: uses the text content of the element. If the element has no text content (or has text content that is not displayed, ie. is fallback content), this is equivalent to ''.
<string>: Uses literal text content.
none: Synonym for ''.
fallback: Uses the fallback content of an element. If the element does not treat text content as a fallback, this is equivalent to ''.
attr(): Uses the content of an attribute on the element.
value: Uses the value of an input. Unlike attr(value), this uses the live value of the input. For non-input elements, this is equivalent to ''.
selected-options: Uses the value of the selected option (or options) of a <select> element. On non-<select> elements, this is equivalent to ''.

The logic behind these values being equivalent to '' in the non-applicable case is so that they may be concatenated together to apply for multiple elements, eg. normal fallback value.

stuartpb · 2015-10-05

@rocallahan just stated that Gecko could use a standard for this.