[Proposal] <speak> tag

Premise: Current HTML tags are attuned to text based VISUAL consumption. The use of embedding media files is needed for audio or a/v content, which may not necessarily be accurate to the typed content of the html page. The text may also not be enunciated as typed or otherwise inaccessible in a precise manner to visually challenged individuals. Although other methods are available to invoke spoken text from html pages, there is low consistency between them. Furthermore context or even content may be lost due to additional later markup such as interstitial marketing ads or inline suggested additional reading links, when using global “speak” or page based TTS solutions. A specific tag would provide precise delineation of what is to be spoken.

Suggestion: It is my suggestion that a new tag be introduced, called “speak”. This tag in its use would instruct the browser to speak the enclosed text aloud, utilizing TTS (text to speech) or other functionality which shall be determined by the browser itself.

Additional suggestions: text=“” to define alternate text, and if present is the spoken text. Visible text between tags remains same. type=onclick, onload, etc. to define browser interaction. language=English, French, etc. to define spoken language if not browser default. highlight=word, letter, wordunderline, letterunderline, etc. to define highlighting current word being spoken. pip=short, long to define length of audio pip queue when tag is in onhover state.

Example use case(s) and intended audience: Speaking page contents: Blind individuals, who otherwise cannot use browser gestures to locate, and more precisely select and play content (text)

Click to speak, for direct consumption: Children learning to read. Foreign language learning.

Click to speak, for communication to others: Autism cases of selective mutism Other disabilities requiring selection of words to speak

Additional uses: spoken notation, citation or pronunciation of selected visible text.

Thanks for your consideration! Respectfully, Brian A. Newbold

1 Like

Speech Synthesis API provides the needs for these use-cases already. Since in most cases the text in the element would need to be dynamically updated you’re already in JavaScript territory.

What is the justification for introducing an element to do this when we already have the capability in the web today?

1 Like

Justification is to remove the necessity for Javascript or any other separate API method.

As an HTML tag it applies to a much broader audience and doesn’t require additional separate coding ability or need to reference additional outside methods via an API.

Being programmers this seems trivial (to use Speech Synthesis API). I disagree that it replaces the necessity for a discrete << speak >> tag, and feel the latter really should be part of the core HTML tags. For vision challenged folks, a distinct tag provides simple access to spoken text that is far more native than having to use outside sources.

Think low level too… this comes from my child and other kids learning to program. They’re not yet ready to bounce between languages and methods that live outside of the core tag set.

Thanks!

So there is no disagreement that we do have the API to provide what you desire. That’s good to hear. So, the path forward for getting an element like this introduced is to make it yourself. Build a Web Component that does what you want. You can circulate this and get developers using it. If it picks up steam and does cover a very useful thing for web developers, then browser vendors can assess making it a core feature.

Since we have the full coverage needed to facilitate this method of testing and proving a new feature, it isn’t something browser vendors should worry about until it’s seen in the wild. This follows the Extensible Web Manifesto that is trying to be upheld. Which means, browser vendors are focusing on providing entirely new API coverage for things we aren’t capable of doing currently. If we can make it happen in user-land, we do that to prove a feature that should go into browsers. Rather than just demanding browsers do everything up front just because someone feels they should.

We have the current API coverage for this to be proven in user-land. That should be done to show the element is useful and solves a use-case that browsers currently don’t. Once it is proven, browsers can provide a native solution.

“Example use case(s) and intended audience: Speaking page contents: Blind individuals, who otherwise cannot use browser gestures to locate, and more precisely select and play content (text)”

I mentioned in a comment on this proposal (when it was made elsewhere), that this use case does not hold up.

Blind people who need content to be spoken aloud will already have that capability in the form of a screen reader. The capability needs to exist at the OS level, because without it you can’t open a browser and navigate to a website in the first place.

This also means there is huge potential for a serious conflict between the screen reader and the speak element. For example:

  • When the page loads the screen reader begins reading the content (default behaviour); the speak element is triggered onLoad and the platform TTS also begins speaking the content.
  • The screen reader tabs onto a link inside the speak element and announces the link text; the speak element is triggered onFocus and the platform TTS begins speaking the content.
2 Likes

Maybe not clear in the use case per se, but using specific tags provides the ability to focus on what needs to be spoken or not. That is quite different than page reader or onLoad TTS or iOS accessibility page reading tools which are global. An easily apparent benefit is layout control, where page sections are spoken in a controllable order. Most page readers just dump the entire page. Another would be control of interstitial elements (such as advertisements) which are loaded separately and break the contextual flow of the spoken sections.

The use case for visually impaired users does hold up, if you are able to put aside thinking this is about reading an entire page. Discrete open/close tags for spoken text provides specific control over content and message, whereas full page TTS does not. There are many other use cases that provide substantial support as well, so don’t vacillate on the visually impaired U/C when considering.

Developers may need to workaround a new tag, but it’s doubtful it will cause problems especially if the default presentation is similar to a standard link that requires a button click to execute.

“The use case for visually impaired users does hold up, if you are able to put aside thinking this is about reading an entire page. Discrete open/close tags for spoken text provides specific control over content and message, whereas full page TTS does not. There are many other use cases that provide substantial support as well, so don’t vacillate on the visually impaired U/C when considering.”

A screen reader does not just read the content of a page from top to bottom. It can be directed to specific parts of a page, and to read the content therein. Your use case was to give blind people this capability; my point is that blind people already have that capability (and if they don’t, they are unlikely to be able to reach any page that provides it for them).

“Developers may need to workaround a new tag, but it’s doubtful it will cause problems especially if the default presentation is similar to a standard link that requires a button click to execute.”

Is the expectation that the text content of the element will only be spoken when the element is interacted with?

If so, is the expectation that the text content will be hidden visually and only communicated through speech, or presented both visually and in speech (when the element is interacted with)?

1 Like

Understand and agree with you, but my point is that a screen reader is a separate technology that must be utilized and that formatting the spoken elements becomes device or mfg specific.

There were more use cases than blind in the OP.

The expectation is to create an extensible tag with both default and additional behaviors. See Additional Suggestions in OP. type=onclick, onload, this addresses the element interaction. text=“your spoken text here”, this addresses what is spoken as an alternative to what’s bracketed in the tags. Maybe a better attribute is something like alt-text. language=, what system language the text is in (to manipulate the character/word set and accent) hightlight=, in case the browser wants to be nice and highlight as it speaks pip=, in case there’s a need for hover-over audible indication a speakable link is present

(attributes here are pseudo but if this ever got to the specification stage would be clearly defined and include additionals)

And the final expectation is that by default (no attributes) the text between the tags is both visible and spoken. If using the text= attribute, the attribute value overrides and is what is spoken.

Does that help? :smile:

Thanks for the further explanation Brian.

What I’m trying to understand is whether there is the potential for conflict?

If the idea is that the contents of the speak element will be announced automatically by the platform TTS (on focus or on hover for example), it will conflict with a screen reader if there is one running at the time.

If the idea is that it will only be triggered by a deliberate user interaction with a control of some sort, then that’s less of a concern - though I’m struggling to imagine how that interaction might work in practice?

This is why they should develop a web component to do what they desire. They can then use it live and show it off. Then screen readers can be thrown on top and that interaction can be dealt with.

However, there is still the case of “Does a user even expect any given thing to just talk?” the answer is no. If I click an “Add to cart” button, I don’t want the system then saying out loud “Added to cart” or anything like that. It should not interrupt anything else I’m doing on my system. The one niche case where this could be a really useful thing (possibly) would be, a ‘Read Article’ button to synthesize reading a news article. However, even that would be dreadful for most compared to having a pre-done audio recording to playback of an actual human.

Personally, I’d rather not run down the rabbit hole of “What if?” scenarios. We have the tech to make this work now, so if this person really wants it in browsers that’s the way to go. Making a user-land component and battle-testing it where real-world cases can be thrown against it live.That way when we ask “what if?” we can actually test it. Not just speculate on how it may or may not function.

The proof is in the pudding and we have no pudding.

2 Likes

That is perhaps due to your fortune of having all of your senses.

What OP is essentially asking for is SSML parsing to be implemented in browsers, instead of being farmed out to external web services, see https://lists.w3.org/Archives/Public/www-voice/2017OctDec/0000.html. A start https://github.com/guest271314/SpeechSynthesisSSMLParser. The proposal is expressing similarity to

et al.

No, Web Speech API does not now provide SSML parsing.

Relevant specification issue https://github.com/w3c/speech-api/issues/10

Now full SSML parsing is far more applicable for browsers to take up compared to a general speech tag. That is worth pushing on IMO since it is far more than just “Take text and read it.” It provides far more nuanced capabilities to control the speed and flow of the speech.

compared to a general speech tag

The root code for SSML begins with the general <speak> tag, which is technically already defined at 3.1.1 speak Root Element

<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
  ... the body ...
</speak>

That is worth pushing on IMO

Perhaps you can submit 3. and 4. at Share your biggest challenges with the broader community to “The Web We Want”.

@Garbee See also https://github.com/mhakkinen/SSML-issues.

@Brian_Newbold Technically the functionality requested already exists. The HTML document can be parsed for <speak> elements and a new SpeechSynthesisUtterance instance can be created for each tag. Web Speech API now requires user action to output audio, unless user gesture is disabled.

Have you tried to parse an HTML document and exceute speak() for each <speak> tag?