Skip to content

Conversation

@biplavbarua
Copy link

Description

The html2text parser currently ignores the HTML <base> tag. This PR adds logic to detect the <base> tag and update the parser's base URL accordingly, ensuring that relative links are resolved correctly.

Related Issue

Fixes #1680

Verification

  • Added a local unit test tests/test_base_tag_local.py which passes.
  • Confirmed that links are resolved against the href specified in the <base> tag.

Copy link
Author

@biplavbarua biplavbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This fixes the critical issue of broken relative links when a tag is present.
Minor implementation note: Since html2text is a stream parser, this update applies to all subsequent tags. While the HTML spec mandates the first tag in controls the whole document (including elements before it), strictly parsing that requires a two-pass approach or a full DOM tree.
Given the constraints of HTMLParser, this is the correct pragmatic solution.
Verified that urljoin logic correctly handles the accumulation/replacement of the base path.

@biplavbarua
Copy link
Author

Re-verified local build against latest master. Fix is stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: html2text ignores <base> tag when resolving relative links

1 participant