Reduce reliance on user supplied target metadata during transcript selection #65

bencap · 2025-12-10T19:29:07Z

This pull request refactors transcript selection logic to improve per-gene transcript resolution and protein sequence matching. The main changes include returning both transcript accession and HGNC symbol from queries, updating selection logic to choose the best transcript per gene, and introducing a similarity scoring for protein sequences. Additionally, new helper functions and unit tests were added to ensure the correctness of transcript similarity and selection.

Transcript selection and query improvements:

The get_transcripts function now returns both transcript accession and HGNC symbol, allowing downstream logic to group transcripts by gene and select the best candidate per gene. (src/dcd_mapping/lookup.py)
The _get_compatible_transcripts function was refactored to return a set of (transcript accession, HGNC symbol) tuples, enabling per-gene transcript grouping and intersection logic. (src/dcd_mapping/transcripts.py)

Protein sequence similarity and transcript selection:

Added _percent_similarity and _choose_most_similar_transcript helpers to robustly score and select the transcript whose protein sequence is most similar to the query, using substring matching and SequenceMatcher. (src/dcd_mapping/transcripts.py)
The _select_protein_reference function now performs per-gene best transcript selection (MANE priority, fallback to longest), then chooses the globally best match by protein sequence similarity, improving accuracy and stability. (src/dcd_mapping/transcripts.py)

Testing and validation:

Added unit tests for similarity scoring, transcript selection logic, and end-to-end per-gene best transcript selection followed by global similarity matching. (tests/test_transcript.py)

…script reduction

…ction Prior to this change, we relied on the user supplying an appropriate HGNC symbol for their target as their target name. This is no longer required. Instead, transcript selection follows the following algorithm: 1. Align the target sequence with BLAT. 2. Fetch transcripts which overlap the aligned region (notably, without an HGNC symbol filter). 3. Perform transcript selection within each distinct gene. This will either leave us with (a) one transcript in cases where we have no overlapping genes in a region, or, (2) one transcript per gene when multiple genes overlap an aligned region. These will be our candidate transcripts. 4. If we still have more than one candidate transcript, we compare the similarity of each candidate to the provided target sequence. Select the most similar transcript.

refactor: change transcript parameter type to set for simplified tran…

e421e82

…script reduction

bencap requested a review from sallybg December 10, 2025 19:29

bencap linked an issue Dec 10, 2025 that may be closed by this pull request

Reduce reliance on user supplied target metadata when performing transcript selection #63

Open

bencap force-pushed the feature/bencap/63/reduce-reliance-on-user-supplied-target-metadata branch from bc59bbc to b6bbc59 Compare December 11, 2025 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce reliance on user supplied target metadata during transcript selection #65

Reduce reliance on user supplied target metadata during transcript selection #65

Uh oh!

bencap commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Reduce reliance on user supplied target metadata during transcript selection #65

Are you sure you want to change the base?

Reduce reliance on user supplied target metadata during transcript selection #65

Uh oh!

Conversation

bencap commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants