Skip to content

Conversation

@yruslan
Copy link
Collaborator

@yruslan yruslan commented Nov 26, 2025

Closes #807

Summary by CodeRabbit

  • New Features

    • Added public indicators to detect copybook structure (flat vs hierarchical).
  • Performance & Refactoring

    • Improved repeated field-access performance via internal caching.
    • Simplified and centralized field-lookup logic for clearer, more maintainable behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 26, 2025

Walkthrough

Adds two concurrent caches for primitives and statements to Copybook, exposes structure flags (isFlatCopybook, isHierarchical), centralizes primitive lookup via a new private helper, and updates field-get/set paths to use caching and simplified pattern matches. (47 words)

Changes

Cohort / File(s) Summary
Copybook concurrency & lookup caching
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala
Adds cachePrimitives: ConcurrentHashMap[String, Primitive] and cacheStatements: ConcurrentHashMap[String, Statement]. Introduces val isFlatCopybook and lazy val isHierarchical. Adds private getPrimitiveFieldByName(fieldName: String): Primitive. Updates getFieldValueByName and setFieldValueByName to use the helper and caches. Refactors several internal lookup helpers to use flatMap closures and caching. Minor reorganization of method placement and class end braces.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Copybook
    participant Cache as cachePrimitives
    participant AST as Copybook AST

    rect rgb(230,200,230)
    Note over Client,Cache: New flow: cache checked first
    Client->>Copybook: getFieldValueByName(name)
    Copybook->>Cache: get(name)
    alt Cache Hit
        Cache-->>Copybook: Primitive
    else Cache Miss
        Copybook->>AST: traverse/search for field
        AST-->>Copybook: Primitive
        Copybook->>Cache: put(name, Primitive)
    end
    Copybook-->>Client: field value (extracted from Primitive)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Attention areas:
    • Correctness of cache key selection and concurrent usage (ConcurrentHashMap patterns).
    • Ensure getPrimitiveFieldByName preserves previous semantics and error/exception behavior.
    • Validate isFlatCopybook / isHierarchical computation and laziness.
    • Confirm refactored flatMap/pattern-match changes don't alter edge-case lookups (unique names, nested groups).
    • Tests for lifecycle consistency of cached Statement/Primitive objects.

Poem

🐰 I hopped the AST, then paused and stared,
Caches planted where once I fared,
Now names resolve with fewer bounds,
Faster hops on cached playgrounds,
My little heart — a quicker drum, hooray! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately describes the main change: adding field name to AST object memoization to improve performance of raw record processing, which aligns with the changeset's introduction of caching mechanisms.
Linked Issues check ✅ Passed The pull request successfully implements the objective from issue #807 by introducing cachePrimitives and cacheStatements caches to memoize field lookups, eliminating repeated AST traversal when accessing fields by name.
Out of Scope Changes check ✅ Passed All changes are directly related to the caching/memoization objective: new caches, helper methods for cached retrieval, and updates to existing methods to use caching. No unrelated modifications detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/807-field-access-memoize

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Nov 26, 2025

JaCoCo code coverage report - 'cobol-parser'

Overall Project 91.55% 🍏
Files changed 98.74% 🍏

File Coverage
Copybook.scala 83.57% -0.28% 🍏

@github-actions
Copy link

JaCoCo code coverage report - 'spark-cobol'

There is no coverage information present for the Files changed

Total Project Coverage 81.37% 🍏

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd292b4 and d66f109.

📒 Files selected for processing (1)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/CopybookParser.scala (1)
  • getAllSegmentRedefines (332-348)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
  • GitHub Check: Spark 2.4.8 on Scala 2.11.12
  • GitHub Check: Spark 3.5.7 on Scala 2.13.17
  • GitHub Check: Spark 3.5.7 on Scala 2.12.20
  • GitHub Check: Spark 3.4.4 on Scala 2.12.20
🔇 Additional comments (9)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala (9)

24-24: LGTM!

The use of ConcurrentHashMap for caching field lookups is appropriate for thread-safe memoization and aligns with the PR objective to improve performance of field name access.

Also applies to: 32-33


35-35: LGTM!

The isFlatCopybook flag correctly identifies whether the copybook has a flat structure by checking for primitive children at the root level.


41-44: LGTM!

The isHierarchical property correctly identifies hierarchical copybook structures by checking for parent-child relationships in segment redefines. The lazy evaluation is appropriate for this potentially expensive check.


46-47: LGTM!

Simple accessor method for the AST. While not used within this file, it provides a consistent API for external callers.


90-94: LGTM!

The refactor to use getPrimitiveFieldByName enables caching of primitive field lookups, which directly addresses the performance objective in issue #807.


109-113: LGTM!

Consistent with the read path, this refactor enables caching for field writes, improving performance for both read and write operations.


129-132: LGTM!

The pattern matching refactor is a nice Scala idiom improvement, making the code more concise while maintaining the same functionality.


144-153: LGTM!

Consistent pattern matching improvement for better code readability and Scala idioms.


350-364: LGTM!

The caching pattern is correctly implemented here—the method checks the cache, computes if missing, stores the result, and returns it. This provides the performance improvement for primitive field access as intended.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala (2)

180-203: Consider using computeIfAbsent for cleaner cache pattern.

The current "get-check-compute-put" pattern is correct but could be simplified and made slightly more efficient using computeIfAbsent, which handles the null check and atomic insertion in one call.

Apply this diff to use the more idiomatic pattern:

-    val cachedStatement = cacheStatements.get(fieldName)
-
-    if (cachedStatement == null) {
+    cacheStatements.computeIfAbsent(fieldName, _ => {
       val schema = getCobolSchema
 
       val foundFields = if (fieldName.contains('.')) {
         getFieldByPathName(schema, fieldName)
       } else {
         getFieldByUniqueName(schema, fieldName)
       }
 
       if (foundFields.isEmpty) {
         throw new IllegalStateException(s"Field '$fieldName' is not found in the copybook.")
       } else if (foundFields.lengthCompare(1) == 0) {
-        val result = foundFields.head
-        cacheStatements.put(fieldName, result)
-        result
+        foundFields.head
       } else {
         throw new IllegalStateException(s"Multiple fields with name '$fieldName' found in the copybook. Please specify the exact field using '.' " +
           s"notation.")
       }
-    } else {
-      cachedStatement
-    }
+    })

352-366: Consider similar computeIfAbsent pattern for primitive cache.

For consistency and the same benefits as mentioned in the previous comment, getPrimitiveFieldByName could also use computeIfAbsent.

Apply this diff:

-    val cachedPrimitive = cachePrimitives.get(fieldName)
-
-    if (cachedPrimitive == null) {
+    cachePrimitives.computeIfAbsent(fieldName, _ => {
       val ast = getFieldByName(fieldName)
       ast match {
         case s: Primitive =>
-          cachePrimitives.put(fieldName, s)
           s
         case _ => throw new IllegalStateException(s"$fieldName is not a primitive field, cannot extract its value.")
       }
-    } else {
-      cachedPrimitive
-    }
+    })
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d66f109 and 44b6d38.

📒 Files selected for processing (1)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/CopybookParser.scala (1)
  • getAllSegmentRedefines (332-348)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala (1)
  • getCobolSchema (61-61)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Spark 3.5.7 on Scala 2.13.17
  • GitHub Check: Spark 3.5.7 on Scala 2.12.20
  • GitHub Check: Spark 2.4.8 on Scala 2.11.12
  • GitHub Check: Spark 3.4.4 on Scala 2.12.20
  • GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
🔇 Additional comments (6)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala (6)

180-203: Cache population has been correctly implemented.

The previous review correctly identified that the cache was never populated. This has now been fixed with the addition of cacheStatements.put(fieldName, result) at line 195. The cache will now provide the intended performance benefits.


90-94: LGTM! Clean refactoring with caching.

The refactoring to use getPrimitiveFieldByName centralizes the primitive lookup logic and ensures caching is applied consistently.


109-113: LGTM! Consistent with the getter refactoring.

The setter now uses the same cached primitive lookup, maintaining symmetry with getFieldValueByName.


129-132: LGTM! More idiomatic pattern matching.

The refactoring to use flatMap is cleaner and more functional in style, making the code easier to read and maintain.

Also applies to: 144-153


32-35: LGTM! Cache declarations and structural flags are well-designed.

The use of ConcurrentHashMap ensures thread-safe caching, and the new isFlatCopybook flag provides a simple structural check. Both additions align well with the PR's performance objectives.


41-44: LGTM! Lazy evaluation is appropriate for hierarchical check.

The isHierarchical flag correctly uses lazy evaluation since it requires traversing the AST to find segment redefines, which is an operation that should only be performed when needed.

@yruslan yruslan merged commit 316c952 into master Nov 26, 2025
7 checks passed
@yruslan yruslan deleted the feature/807-field-access-memoize branch November 26, 2025 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve performance of getting values by field name

2 participants