Output gene name for mapped targets #67
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces a new mechanism for inferring and attaching gene symbol information to target annotations, with provenance, in the mapping pipeline. It adds the
compute_target_gene_infofunction, which determines a single gene symbol per target using a prioritized approach, and integrates this logic into the mapping API. Additionally, it restructures thereference_sequencesdata structure to use a newTargetAnnotationclass, and makes several supporting improvements and bug fixes.Gene symbol inference and annotation improvements:
compute_target_gene_infoasync function inannotate.py, which determines a single gene symbol per target using a prioritized strategy (selected transcript, alignment overlap, variant spans, or fallback to metadata), and returns aGeneInfoobject with provenance. Supporting helper functions for overlap-based inference and interval merging were also added.map_scoresetAPI route: for each target, the computed gene info is attached to itsTargetAnnotationin the response. [1] [2]Data structure and schema changes:
reference_sequencesstructure inmap_scoresetto useTargetAnnotationobjects, which now include alayersattribute and a newgene_infofield. Adjusted all code paths to reference the new structure. [1] [2] [3]TargetAnnotationandGeneInfoimports to relevant modules, ensuring the new schema types are properly used throughout. [1] [2]Supporting infrastructure and environment:
ENSEMBL_API_URLto the.env.devsettings file to support Ensembl API queries for gene overlap.request_with_backoffandENSEMBL_API_URLinlookup.pyto enable robust gene feature queries.Anyimport inlookup.pyfor type hinting.Bug fixes and code improvements:
is not Noneinstead of truthiness) in multiple locations (annotate.py). [1] [2] [3]These changes collectively improve the accuracy and transparency of gene symbol assignment in the API, and lay the groundwork for robust, provenance-aware gene annotation in downstream analyses.