Skip to content

Allow to, optionally, keep Unicode escape sequences in stringToPDFString (PR 17331 follow-up) #19884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 1, 2025

Conversation

Snuffleupagus
Copy link
Collaborator

Currently some of the links[1] on page three of the issue19835.pdf test-case aren't clickable, since the destination (of the LinkAnnotation) becomes empty.
The reason is that these destinations include the character \x1b, which is interpreted as the start of a Unicode escape sequence specifying the language of the string; please refer to section 7.9.2.2 Text String Type in the PDF specification.

Hence it seems that we need a way to optionally disable that behaviour, to avoid a "badly" formatted string from becoming empty (or truncated), at least for cases where we are:

  • Parsing named destinations[2] and URLs.
  • Handling "strings" that are actually /Name-instances.
  • Building a lookup Object/Map based on some PDF data-structure.

NOTE: The issue that prompted this patch is obviously related to destinations, however I've gone through the src/core/ folder and updated various other stringToPDFString call-sites that (directly or indirectly) fit the categories listed above.


[1] Try clicking on anything on the line containing "Item 7A. Quantitative and Qualitative Disclosures About Market Risk 27".

[2] Unfortunately just skipping stringToPDFString in this case would cause other issues, such as the named destination becoming "unusable" in the viewer; see e.g. issues #14847 and #14864.

Whenever we cannot find a destination we'll fallback to checking all destinations, to account for e.g. out-of-order NameTrees, and in those cases any subsequent destination-lookups can be made a tiny bit more efficient by immediately checking the already cached destinations.
@Snuffleupagus
Copy link
Collaborator Author

/botio test

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/c3901ba98dd851b/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/835a3cbad009c87/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/c3901ba98dd851b/output.txt

Total script time: 29.68 mins

  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: Passed

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/835a3cbad009c87/output.txt

Total script time: 60.27 mins

  • Unit tests: FAILED
  • Integration Tests: FAILED
  • Regression tests: Passed

…ring` (PR 17331 follow-up)

Currently *some* of the links[1] on page three of the `issue19835.pdf` test-case aren't clickable, since the destination (of the LinkAnnotation) becomes empty.
The reason is that these destinations include the character `\x1b`, which is interpreted as the start of a Unicode escape sequence specifying the language of the string; please refer to section [7.9.2.2 Text String Type](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf#G6.1957385) in the PDF specification.

Hence it seems that we need a way to optionally disable that behaviour, to avoid a "badly" formatted string from becoming empty (or truncated), at least for cases where we are:
 - Parsing named destinations[2] and URLs.
 - Handling "strings" that are actually /Name-instances.
 - Building a lookup Object/Map based on some PDF data-structure.

*NOTE:* The issue that prompted this patch is obviously related to destinations, however I've gone through the `src/core/` folder and updated various other `stringToPDFString` call-sites that (directly or indirectly) fit the categories listed above.

---
[1] Try clicking on anything on the line containing "Item 7A. Quantitative and Qualitative Disclosures About Market Risk 27".

[2] Unfortunately just skipping `stringToPDFString` in this case would cause other issues, such as the named destination becoming "unusable" in the viewer; see e.g. issues 14847 and 14864.
@Snuffleupagus
Copy link
Collaborator Author

/botio unittest

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/468d84bcae5e873/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/17c380f3774666e/output.txt

@Snuffleupagus Snuffleupagus marked this pull request as ready for review April 30, 2025 18:53
@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/468d84bcae5e873/output.txt

Total script time: 2.41 mins

  • Unit Tests: Passed

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Success

Full output at http://54.193.163.58:8877/17c380f3774666e/output.txt

Total script time: 8.19 mins

  • Unit Tests: Passed

Copy link
Contributor

@calixteman calixteman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you.

@Snuffleupagus Snuffleupagus merged commit fc92e40 into mozilla:master May 1, 2025
9 checks passed
@Snuffleupagus Snuffleupagus deleted the issue-19835-2 branch May 1, 2025 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants