Skip to content

Conversation

@Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Jan 22, 2026

This was the cleanest place to make the edit, but I know you don't like to edit these classes if you don't have to. I'm open to alternatives but the token data itself needs to be sanitized in order for this to work when updating usfm. Also, added tests.

Fixes sillsdev/machine.py#250. This will need to be ported.


This change is Reviewable

@Enkidu93 Enkidu93 requested a review from ddaspit January 22, 2026 18:06
@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.61%. Comparing base (58392e4) to head (30d6cb7).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #378   +/-   ##
=======================================
  Coverage   72.61%   72.61%           
=======================================
  Files         423      423           
  Lines       35992    36000    +8     
  Branches     4965     4966    +1     
=======================================
+ Hits        26134    26142    +8     
  Misses       8760     8760           
  Partials     1098     1098           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit reviewed 3 files and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93).


src/SIL.Machine/Corpora/UsfmTokenizer.cs line 178 at r1 (raw file):

                                    null,
                                    null,
                                    SanitizeVerseData(GetNextWord(usfm, ref index, preserveWhitespace))

I don't know if this is the right place to sanitize the verse data. It might be better to do this in UsfmParser when we create the verse ref.


src/SIL.Machine/Corpora/UsfmTokenizer.cs line 568 at r1 (raw file):

        private static string SanitizeVerseData(string verseData)
        {
            return verseData.Replace("", "");

I assume that there is an RTL mark character in the string. It would be helpful to specify the character as a Unicode character literal, i.e. \u200f.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit made 1 comment.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93).


src/SIL.Machine/Corpora/UsfmTokenizer.cs line 178 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't know if this is the right place to sanitize the verse data. It might be better to do this in UsfmParser when we create the verse ref.

Generally, the tokenizer should try to preserve the original USFM as much as possible.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 made 1 comment.
Reviewable status: 1 of 4 files reviewed, 2 unresolved discussions (waiting on @ddaspit).


src/SIL.Machine/Corpora/UsfmTokenizer.cs line 178 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Generally, the tokenizer should try to preserve the original USFM as much as possible.

That makes sense. The problem is that I need to edit the token data itself. I considered, like you said, sanitizing it in the UsfmParser but the UsfmParser doesn't currently do anything like that (the state token etc are all read-only). I could create a new method or change an access modifier to enable updating the token data . This could also be used, for example, to not carry through non-Latin verse numbers when updating.

The other option is to leave it as-is on the parsing side and instead sanitize it when we do the updating. I've pushed a commit with this option for reference. I could see a similar case being made that even the parser should leave the token untouched. Whatever you think is best!

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 made 1 comment.
Reviewable status: 1 of 4 files reviewed, 2 unresolved discussions (waiting on @ddaspit).


src/SIL.Machine/Corpora/UsfmTokenizer.cs line 568 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I assume that there is an RTL mark character in the string. It would be helpful to specify the character as a Unicode character literal, i.e. \u200f.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Right-to-left markers cause verse ranges to be reversed

4 participants