[Proposal] Allow scrolling to a specified text snippet in a navigation

bokan · 2019-03-20

Current specifications and implementations allow a URL to specify a target in the URL fragment. If the navigated to resource has:

an element with an id matching the fragment
a link with with a name attribute that matches the fragment

that element will be the CSS target. The UA will scroll the targeted element into view when the resource is loaded. This allows linking to specific parts of a resource. For example, the Wikipedia entry on “Cat” is very long and broad. However, each subsection is marked with an id so users and resources can link directly to a subsection, for example: Cat Behavior.

However, this relies on a page author predicting all the parts of a resource users may find interesting and marking it up with ids. This limitation makes it difficult to use fragments to link to arbitrary content. For example, take the page: https://www.gutenberg.org/files/2147/2147-h/2147-h.htm. Suppose a user or resource wants to link to the paragraph (e.g. to cite a quote) that starts with:

“It became necessary, at last, that I should arouse both master and valet to the expediency of removing the treasure.”

The referring page would have to link to “THE GOLD-BUG” section and tell the user to scroll down or use their browser’s find-in-page function.

I’d like to propose extending the URL fragment for HTML documents to allow specifying a text snippet as the target. In this case, we could encode the above quote directly in the URL fragment and the user agent would scroll directly to the specified text, possibly highlighting it to the user.

I have an explainer in my personal GitHub repo that has more details and a proposed solution.

Would anyone else be interested in discussing and iterating on this proposal in the WICG?

yoavweiss · 2019-03-20

As a user and someone that find themselves linking to other documents quite often, I really like that proposal! Would be very handy!

bokan · 2019-03-20

It might be useful, here are some of the major open questions about the proposal:

How does this interact with single-page apps that use the fragment for routing? e.g. how could we specify a text fragment on a page like: http://example.org/#!splashPage (note, this isn’t unique to this proposal, the same issue exists for id-based fragments)
Selecting long, DOM-fragmented quotes: sometimes the text to select may be very long. Ideally we’d provide some convenient syntax to keep URLs manageable. Additionally, the desired quote may appear continuous on the page but be contained in separate elements, e.g. <p>paragraph 1</p><p>paragraph 2</p>. We should provide a way to select the text like this irrespective of it’s underlying DOM structure.
Allow selecting multiple quotes. We may wish to highlight an entire row or column in a table, or two paragraphs separated by an image.

These and others are discussed in detail in our GitHub repo issues

mkay581 · 2019-03-22

I like the idea but what if the text snippet on the target page changes? Not only would the link on the referrering page be broken, but the referrering page wouldnt even know about it to update. And if it does update it, the text snippet can be changed again. Then another broken link.

bokan · 2019-03-22

I like the idea but what if the text snippet on the target page changes? Not only would the link on the referrering page be broken, but the referrering page wouldnt even know about it to update. And if it does update it, the text snippet can be changed again. Then another broken link.

The fallback behavior would be that the page loads unscrolled which doesn’t seem any worse than what would happen today when you can’t link to a text snippet

Additionally, the UA could potentially indicate to the user that it couldn’t find the specified text to make it clear the page might be stale.

I’d also point out the same issue exists with element-id based fragments but they work well in practice. More so, if a link points to a page whose content has changed so that the content of the page is no longer relevant to the link, I’d say that’s an existing problem that exists regardless of whether a fragment is specified at all. Having an indication as mentioned above would actually improve this scenario since the user could at least tell that the page has changed since the link was created.

jirkakosek · 2019-03-22

It seems that you want to reinvent XPointer – something that never got much interest from browser vendors.

AaronGustafson · 2019-03-22

This is something the IndieWeb community has tackled with fragmentations. I have an explainer I wrote a few years back as well. Let me dust it off and see if it needs updating and then I’ll post it. Given my schedule it’ll be the tail end of next week at the earliest. But yes!

mkay581 · 2019-03-23

Yeah the same issues exist with URL fragments that are associated with anchor element IDs. And yes, its a problem regardless of this proposal, but at least when a developer uses an anchor ID (</>) to make their content linkable, they are deliberate in doing so and are aware they are using an identifier that can be linked to, which should probably never change.

This proposal, on the other hand, actually is encouraging anchor linking to text, which is even more likely to change and result in broken links. Of course, I’m not saying this proposal should fix broken anchor fragment links but it shouldnt facilitate an opportunity to make a bad problem even worse, if that makes sense.

feliperias · 2019-03-23

It is a good idea. However, since this seems to overlap with ongoing standardisation efforts, maybe it would be better to engage directly with those?

https://www.w3.org/TR/selectors-states/#TextQuoteSelector_def https://www.w3.org/TR/annotation-html/#web-annotation-based-citation-urls

ivan · 2019-03-23

This has been (and is still being) discussed in issue 10 of the aforementioned GitHub repo.

However, just to make the situation clear: there is no “ongoing standardization effort” on this. There is a set of standards on Web Annotation which does include a selector model related to annotations. The Web Annotation model does not define a fragment URL approach, and the corresponding Working Group has been closed a while ago.

The document you quote, that does include a fragment URL proposal has been published by the Working Group as a Note, and has not been touched since its publication in early 2017. I think one possible question is: should there be a standardization effort, possibly along the lines of that note? (I do not have an answer to this question.)

bokan · 2019-07-23

I think there’s enough interest. I’d like to move the repo into the WICG.

yoavweiss · 2019-07-23

Sounds good. Let’s do this!

yoavweiss · 2019-07-23

And the repo is now live at https://github.com/WICG/ScrollToTextFragment

Happy incubating!!

raphaellouis · 2022-06-09

@yoavweiss @mkay581 @bokan @bokan @AaronGustafson @AaronGustafson @feliperias @ivan Hi all! How are you all? Is it possible to have an http api to find snippet of text specified in a navigation?

raphaellouis · 2022-06-09

1. Context of doubt/Opening arguments

1. What is HTTP?

HTTP (HyperText Transfer Protocol) is the underlying protocol of the World Wide Web. Developed by Tim Berners-Lee and his team between 1989-1991, HTTP has gone through many changes that have helped maintain its simplicity while shaping its flexibility. Keep reading to learn how HTTP evolved from a protocol designed to exchange files in a semitrusted laboratory environment into a modern internet maze that carries images and videos in high resolution and 3D.

2. What is Web Scraping?

“Web scraping refers to the extraction of data from a website. This information is collected and then exported into a format that is more useful for the user. Be it a spreadsheet or an API.”

2.1. What are common web scraping issues?

Issue 1 - Bots: Websites are free to choose whether they will allow web scrapers bots or not on their websites for data scraping purpose. There are websites that actually do not allow automated web scraping. This is mainly because, at most times these bots scrape data with the intention of gaining competitive advantage and drain the server resources of the website they are scraping from, thus adversely affecting site performance.
Issue 2 - Captchas: The main purpose of captchas are to separate humans from bots by displaying logical problems that humans find easy to solve but making it difficult on the bots. So, their basic job is to keep spam away. In presence of captcha, basic scraping scripts will tend to fail, but with new advancements, there are generally measures to subsist these captchas, in an ethical manner.
Issue 4 - Frequent structural changes: In order to keep up with the advancements in UI/UX and to add more features, websites undergo regular structural changes. The web scrapers are specifically written with respect to the code elements of the webpage at the point of setup, so, frequent changes complicates the codes, giving scrapers some sort of a hard time. Though every structural change will not affect the web scraper setup, but as any sort of change may result in data loss, it is recommended to keep a tab on the changes.
Issue 5 - Getting Banned: If a web scraper bot sends multiple parallel requests per second or unnaturally high no of requests, there’s a good chance that you will cross the thin line of ethical and unethical scrapping and get flagged and ultimately banned. If the web scraper is smart and has sufficient resources, they can carefully handle these kind of counter measures and make sure they stay at the right side of the law and still achieve what they want.
Issue 6 - Real time data scraping: Real time data scraping can be of paramount importance to businesses as it supports immediate decision making. With the always fluctuating stock prices to the ever changing product prices in eCommerce, this can lead to huge capital gains for a business. But deciding what’s important and what’s not in real time is a challenge. Also, acquiring large data sets in real time is an overhead too. These real time web scrapers use a Rest API to monitor all dynamic data available in the public domain and scrape data in “nearly real time” but attaining the “holy grail” still remains a challenge. There is a thin line between data collection and causing damage to the web by careless data scraping. As web scraping is a such an insightful tool and with the immense effect it has on businesses, web scraping should be done with responsibility. With a little respect we can keep a good thing going.

3. Concept\Example URIs: get

HTTP GET https://example.com#:~:text=prefix-,startText,endText,-suffix
HTTP GET https://example.com#:~:text=prefix-,startText,endText,-suffix/users?size=20&page=5
HTTP GET https://example.com#:~:text=prefix-,startText,endText,-suffix/users/123
HTTP GEThttps://example.com#:~:text=prefix-,startText,endText,-suffix/users/123/address

3.1 Sample:

get: https://en.wikipedia.org/w/index.php?title=Cat&oldid=916388819#:~:text=Claws-,Like%20almost,the%20Felidae%2C,-cats
return: ‘Like almost all members of the Felidae’

4. Context

4.1 I would like to find parts of the text by http. With an open api to find part of the text implemented in several companies/sites - web in general this could allow a better targeting to extract data in a suitable way whether for people, companies, analysts and/or data scientists.
4.2 I believe this can make it easier to implement open-data in a friendly way on most sites.
4.3 I argue this, because many times we want to do the web scraping of a page or more pages. And the problem that occurs is in most of these pages there is a lot of impediment like Captcha, restriction of the number of page requests, elements that change with the dom - document object model.
4.4 If there are specific text calls within the context of http requests - companies could set better control on the number of page requests within the data extraction. Currently, this is a problem and it is very difficult to see what is being seen or extracted from the pages.
4.5 A common problem when scraping a page is that you have to have a css or js or html rule to find the part of the text you want to fetch.
4.6 The Web Platform Incubator Community Group (WICG) provides a lightweight venue for proposing and discussing new web platform features.
4.7 I would like to find parts of the text within the context of http calls. I believe that this would be a viable and interesting resource, it would solve many problems of making an algorithm for web scraping. A good part of the common problems with web scraping is the fact that Captchas, Bots, Banned etc. A good and partial solution is to openly define the number of requests of the page to do web scripting - extracting data from one or more pages within a site.

5. Notes

What I’m proposing is to use the existing proposal of specified-text-snippet-in-a-navigation in context of http request for data extraction
What I’m proposing could be wrong or good, so I want to know everyone’s opinion.
There is an evolution of the web and especially of http technology, I would like to know if this evolution would be related or could be related to data extraction - today it is very common to think about science and data analysis - I would like to know if there is any interest in this regard
The links I put here are bibliographic references and do not promote any company, product, service