Skip to content

Use RobustLinks as the primary means of link rewriting for navigation in archival playback #859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ibnesayeed opened this issue Mar 22, 2025 · 2 comments

Comments

@ibnesayeed
Copy link
Member

When we playback a memento of https://example.com/ captured at 20200323134509 like the following:

http://localhost:2016/memento/20200323134509/https://example.com/

The original page contains the following hyperlink:

<a href="https://www.iana.org/domains/example">More information...</a>

Which is rewritten as the following (to ensure that users get to navigate to the archived version of the hyperlinked document from around the same time when the current page was archived):

<a href="http://localhost:2016/memento/20200323134509/https://www.iana.org/domains/example">More information...</a>

If we do not pay attention to the template of playback URLs we would lose the information of the original URL and the desired time around which the linked page should have been archived (in some web archived like archive.today and perma.cc these attributes are opaque).

We can address this problem by incorporating RobustLinks in the playback. We can rewrite these anchor elements in one of the following ways:

Variant 1 -- Leave href unchanged:

<a href="https://www.iana.org/domains/example"
  data-originalurl="https://www.iana.org/domains/example"
  data-versiondate="20200323134509"
  data-versionurl="http://localhost:2016/memento/20200323134509/https://www.iana.org/domains/example">More information...</a>

Pros:

  • Link extraction/analysis tools do not need to undo any rewriting in the href attribute.

Cons:

  • Requires the JavaScript client of the RobustLinks to navigate to the archived version.
  • Absolute paths in href may lead to undesired locations (though, it is already handled in Reconstructive service-worker during navigation).

Variant 2 -- Rewrite href to a potential memento URL:

<a href="http://localhost:2016/memento/20200323134509/https://www.iana.org/domains/example"
  data-originalurl="https://www.iana.org/domains/example"
  data-versiondate="20200323134509"
  data-versionurl="http://localhost:2016/memento/20200323134509/https://www.iana.org/domains/example">More information...</a>

Pros:

  • JavaScript client of the RobustLinks is not required to navigate to the archived version, but including it would enrich the experience.

Cons:

  • Link extraction/analysis tools need to undo any rewriting in the href attribute.
  • Requires the JavaScript client of the RobustLinks to navigate to the archived version.

Variant 3 -- Include additional alternate playback URLs:

<a href="http://localhost:2016/memento/20200323134509/https://www.iana.org/domains/example"
  data-originalurl="https://www.iana.org/domains/example"
  data-versiondate="20200323134509"
  data-versionurl="http://localhost:2016/memento/20200323134509/https://www.iana.org/domains/example
                   https://web.archive.org/web/20200323134509/https://www.iana.org/domains/example
                   https://memgator.cs.odu.edu/memento/20200323134509/https://www.iana.org/domains/example">More information...</a>

Pros:

  • Provides numerous options of navigation when JavaScript client of the RobustLinks is included (as smaller archives may not have comprehensive captures of all the outlinks).

Cons:

  • Markup and UI get cluttered.

With this RobustLinks adoption we expect each link on the pape to provide multiple navigation options:

RobustLinks from 2020-03-23 13:45:09
Original Link
InterPlanetary Wayback Memento
Wayback Machine Capture
MemGator Aggregator Lookup

Notes:

  • If the href attribute is a relative or absolute path, it needs to be converted to an absolute URL for the data-originalurl attribute.
  • Avoid overwriting RobustLinks' data-* attributes if they are already present in the original page.
  • It is unclear what would be the behavior of the RobusLinks client if it were included twice, once in the original page and again by the web archive (the client needs to be smart about these cases).
@machawk1
Copy link
Member

This is an interesting proposition that surfaces data that would typically be in a subsequent response into the base representation. That said, how would the information (e.g., data-versiondate) be obtained without the overhead of requesting all of the links or consulting some secondary index? This seems like it would incur significant HTTP overhead for every base memento (i.e., HTML page) a user visits.

Another consideration is that modifying the memento for usability seem antithetical to Reconstructive's "reroute, don't rewrite".
I am typically all-for enriching the capture to make it more useful but wonder about the side-effects of this proposal.

Looking to discuss further. Thanks for the writeup, @ibnesayeed!

@ishank-dev
Copy link

I came across this issue while exploring ipywb, and I would love to contribute here.

Thanks, @ibnesayeed, for highlighting this issue!
And @machawk1 for bringing up the HTTP overhead.

Please keep me in loop of the discussion as to which variant out of the three makes more sense and is easy to do a POC upon. That's where I can come in and help with writing code and share some test reports and HTTP overhead benchmarks. Which can help us assess whether to go ahead with a given variant or not.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants