-
Notifications
You must be signed in to change notification settings - Fork 86
#807 Add field name to AST object memoization for better performance of raw record processing #808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…of raw record processing.
WalkthroughAdds two concurrent caches for primitives and statements to Copybook, exposes structure flags (isFlatCopybook, isHierarchical), centralizes primitive lookup via a new private helper, and updates field-get/set paths to use caching and simplified pattern matches. (47 words) Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Copybook
participant Cache as cachePrimitives
participant AST as Copybook AST
rect rgb(230,200,230)
Note over Client,Cache: New flow: cache checked first
Client->>Copybook: getFieldValueByName(name)
Copybook->>Cache: get(name)
alt Cache Hit
Cache-->>Copybook: Primitive
else Cache Miss
Copybook->>AST: traverse/search for field
AST-->>Copybook: Primitive
Copybook->>Cache: put(name, Primitive)
end
Copybook-->>Client: field value (extracted from Primitive)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
JaCoCo code coverage report - 'cobol-parser'
|
JaCoCo code coverage report - 'spark-cobol'
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/CopybookParser.scala (1)
getAllSegmentRedefines(332-348)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
- GitHub Check: Spark 2.4.8 on Scala 2.11.12
- GitHub Check: Spark 3.5.7 on Scala 2.13.17
- GitHub Check: Spark 3.5.7 on Scala 2.12.20
- GitHub Check: Spark 3.4.4 on Scala 2.12.20
🔇 Additional comments (9)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala (9)
24-24: LGTM!The use of
ConcurrentHashMapfor caching field lookups is appropriate for thread-safe memoization and aligns with the PR objective to improve performance of field name access.Also applies to: 32-33
35-35: LGTM!The
isFlatCopybookflag correctly identifies whether the copybook has a flat structure by checking for primitive children at the root level.
41-44: LGTM!The
isHierarchicalproperty correctly identifies hierarchical copybook structures by checking for parent-child relationships in segment redefines. The lazy evaluation is appropriate for this potentially expensive check.
46-47: LGTM!Simple accessor method for the AST. While not used within this file, it provides a consistent API for external callers.
90-94: LGTM!The refactor to use
getPrimitiveFieldByNameenables caching of primitive field lookups, which directly addresses the performance objective in issue #807.
109-113: LGTM!Consistent with the read path, this refactor enables caching for field writes, improving performance for both read and write operations.
129-132: LGTM!The pattern matching refactor is a nice Scala idiom improvement, making the code more concise while maintaining the same functionality.
144-153: LGTM!Consistent pattern matching improvement for better code readability and Scala idioms.
350-364: LGTM!The caching pattern is correctly implemented here—the method checks the cache, computes if missing, stores the result, and returns it. This provides the performance improvement for primitive field access as intended.
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala (2)
180-203: Consider usingcomputeIfAbsentfor cleaner cache pattern.The current "get-check-compute-put" pattern is correct but could be simplified and made slightly more efficient using
computeIfAbsent, which handles the null check and atomic insertion in one call.Apply this diff to use the more idiomatic pattern:
- val cachedStatement = cacheStatements.get(fieldName) - - if (cachedStatement == null) { + cacheStatements.computeIfAbsent(fieldName, _ => { val schema = getCobolSchema val foundFields = if (fieldName.contains('.')) { getFieldByPathName(schema, fieldName) } else { getFieldByUniqueName(schema, fieldName) } if (foundFields.isEmpty) { throw new IllegalStateException(s"Field '$fieldName' is not found in the copybook.") } else if (foundFields.lengthCompare(1) == 0) { - val result = foundFields.head - cacheStatements.put(fieldName, result) - result + foundFields.head } else { throw new IllegalStateException(s"Multiple fields with name '$fieldName' found in the copybook. Please specify the exact field using '.' " + s"notation.") } - } else { - cachedStatement - } + })
352-366: Consider similarcomputeIfAbsentpattern for primitive cache.For consistency and the same benefits as mentioned in the previous comment,
getPrimitiveFieldByNamecould also usecomputeIfAbsent.Apply this diff:
- val cachedPrimitive = cachePrimitives.get(fieldName) - - if (cachedPrimitive == null) { + cachePrimitives.computeIfAbsent(fieldName, _ => { val ast = getFieldByName(fieldName) ast match { case s: Primitive => - cachePrimitives.put(fieldName, s) s case _ => throw new IllegalStateException(s"$fieldName is not a primitive field, cannot extract its value.") } - } else { - cachedPrimitive - } + })
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/CopybookParser.scala (1)
getAllSegmentRedefines(332-348)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala (1)
getCobolSchema(61-61)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Spark 3.5.7 on Scala 2.13.17
- GitHub Check: Spark 3.5.7 on Scala 2.12.20
- GitHub Check: Spark 2.4.8 on Scala 2.11.12
- GitHub Check: Spark 3.4.4 on Scala 2.12.20
- GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
🔇 Additional comments (6)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala (6)
180-203: Cache population has been correctly implemented.The previous review correctly identified that the cache was never populated. This has now been fixed with the addition of
cacheStatements.put(fieldName, result)at line 195. The cache will now provide the intended performance benefits.
90-94: LGTM! Clean refactoring with caching.The refactoring to use
getPrimitiveFieldByNamecentralizes the primitive lookup logic and ensures caching is applied consistently.
109-113: LGTM! Consistent with the getter refactoring.The setter now uses the same cached primitive lookup, maintaining symmetry with
getFieldValueByName.
129-132: LGTM! More idiomatic pattern matching.The refactoring to use
flatMapis cleaner and more functional in style, making the code easier to read and maintain.Also applies to: 144-153
32-35: LGTM! Cache declarations and structural flags are well-designed.The use of
ConcurrentHashMapensures thread-safe caching, and the newisFlatCopybookflag provides a simple structural check. Both additions align well with the PR's performance objectives.
41-44: LGTM! Lazy evaluation is appropriate for hierarchical check.The
isHierarchicalflag correctly uses lazy evaluation since it requires traversing the AST to find segment redefines, which is an operation that should only be performed when needed.
Closes #807
Summary by CodeRabbit
New Features
Performance & Refactoring
✏️ Tip: You can customize this high-level summary in your review settings.