Reduce reliance on user supplied target metadata during transcript selection #65
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request refactors transcript selection logic to improve per-gene transcript resolution and protein sequence matching. The main changes include returning both transcript accession and HGNC symbol from queries, updating selection logic to choose the best transcript per gene, and introducing a similarity scoring for protein sequences. Additionally, new helper functions and unit tests were added to ensure the correctness of transcript similarity and selection.
Transcript selection and query improvements:
get_transcriptsfunction now returns both transcript accession and HGNC symbol, allowing downstream logic to group transcripts by gene and select the best candidate per gene. (src/dcd_mapping/lookup.py)_get_compatible_transcriptsfunction was refactored to return a set of(transcript accession, HGNC symbol)tuples, enabling per-gene transcript grouping and intersection logic. (src/dcd_mapping/transcripts.py)Protein sequence similarity and transcript selection:
_percent_similarityand_choose_most_similar_transcripthelpers to robustly score and select the transcript whose protein sequence is most similar to the query, using substring matching andSequenceMatcher. (src/dcd_mapping/transcripts.py)_select_protein_referencefunction now performs per-gene best transcript selection (MANE priority, fallback to longest), then chooses the globally best match by protein sequence similarity, improving accuracy and stability. (src/dcd_mapping/transcripts.py)Testing and validation:
tests/test_transcript.py)