Skip to content

Conversation

@bencap
Copy link
Collaborator

@bencap bencap commented Dec 10, 2025

This pull request refactors transcript selection logic to improve per-gene transcript resolution and protein sequence matching. The main changes include returning both transcript accession and HGNC symbol from queries, updating selection logic to choose the best transcript per gene, and introducing a similarity scoring for protein sequences. Additionally, new helper functions and unit tests were added to ensure the correctness of transcript similarity and selection.

Transcript selection and query improvements:

  • The get_transcripts function now returns both transcript accession and HGNC symbol, allowing downstream logic to group transcripts by gene and select the best candidate per gene. (src/dcd_mapping/lookup.py)
  • The _get_compatible_transcripts function was refactored to return a set of (transcript accession, HGNC symbol) tuples, enabling per-gene transcript grouping and intersection logic. (src/dcd_mapping/transcripts.py)

Protein sequence similarity and transcript selection:

  • Added _percent_similarity and _choose_most_similar_transcript helpers to robustly score and select the transcript whose protein sequence is most similar to the query, using substring matching and SequenceMatcher. (src/dcd_mapping/transcripts.py)
  • The _select_protein_reference function now performs per-gene best transcript selection (MANE priority, fallback to longest), then chooses the globally best match by protein sequence similarity, improving accuracy and stability. (src/dcd_mapping/transcripts.py)

Testing and validation:

  • Added unit tests for similarity scoring, transcript selection logic, and end-to-end per-gene best transcript selection followed by global similarity matching. (tests/test_transcript.py)

…ction

Prior to this change, we relied on the user supplying an appropriate HGNC symbol for their target as their target name. This is no longer required. Instead, transcript selection follows the following algorithm:

1.   Align the target sequence with BLAT.
2.    Fetch transcripts which overlap the aligned region (notably, without an HGNC symbol filter).
3.    Perform transcript selection within each distinct gene. This will either leave us with (a) one transcript in cases where we have no overlapping genes in a region, or, (2) one transcript per gene when multiple genes overlap an aligned region. These will be our candidate transcripts.
4.    If we still have more than one candidate transcript, we compare the similarity of each candidate to the provided target sequence. Select the most similar transcript.
@bencap bencap force-pushed the feature/bencap/63/reduce-reliance-on-user-supplied-target-metadata branch from bc59bbc to b6bbc59 Compare December 11, 2025 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce reliance on user supplied target metadata when performing transcript selection

2 participants