pyspark-cost-linter
by Ikerg
Detect and fix the 9 most expensive PySpark anti-patterns to slash your Spark compute costs.
- Identify Python UDFs that are disabling Photon acceleration and increasing DBUs.
- Find collect() or toPandas() calls that risk driver OOMs and idle clusters.
- Detect withColumn loops that cause exponential Spark optimizer slowdowns.
Free
Sample input
Review my Spark job in src/silver_layer.py for any performance issues or code that might be making our Databricks bill too high.
Sample output
Found 2 high-severity issues in silver_layer.py:
- [R041] Line 45: Python UDF detected. This disables Photon acceleration. Refactor using pyspark.sql.functions.
- [R045] Line 82: withColumn inside a loop. This causes quadratic plan growth. Use select() or withColumns() instead.
pyspark-cost-linter
by Ikerg
Detect and fix the 9 most expensive PySpark anti-patterns to slash your Spark compute costs.
Free
Included in download
- Downloadable skill package
- 1 permission declared
- Instant install
Sample input
Review my Spark job in src/silver_layer.py for any performance issues or code that might be making our Databricks bill too high.
Sample output
Found 2 high-severity issues in silver_layer.py:
- [R041] Line 45: Python UDF detected. This disables Photon acceleration. Refactor using pyspark.sql.functions.
- [R045] Line 82: withColumn inside a loop. This causes quadratic plan growth. Use select() or withColumns() instead.
About This Skill
What it does
The PySpark Cost Linter is a specialized diagnostic tool designed to scan PySpark scripts for expensive anti-patterns that inflate cloud compute bills. It identifies specific code-level inefficiencies—such as driver-side bottlenecks, unoptimized UDFs, and plan-bloating loops—providing severity ratings and idiomatic refactoring advice for each finding.
Why use this skill
Standard linters catch syntax errors, but they don't catch the "silent killers" of Spark performance. This skill is built for data engineers who need to optimize Databricks or EMR jobs without manually auditing thousands of lines of code. It uses deterministic rules (R040–R050) to find patterns that disable Photon acceleration or cause quadratic plan analysis times, saving you from expensive trial-and-error debugging.
Supported tools
- PySpark (Core & SQL)
- Databricks (Photon & AQE optimization checks)
- CI/CD Integration (via JSON output)
- Common data formats (Delta, Parquet, CSV, JSON)
Output
The skill produces a structured report mapping rule IDs to specific line numbers. Each finding includes a description of the cost impact (e.g., "OOM Risk" or "Photon Disabled") and a code-level recommendation to fix the leak.
Use Cases
- Identify Python UDFs that are disabling Photon acceleration and increasing DBUs.
- Find collect() or toPandas() calls that risk driver OOMs and idle clusters.
- Detect withColumn loops that cause exponential Spark optimizer slowdowns.
- Audit Spark configurations like AQE and shuffle partitions for cost efficiency.
- Analyze schema inference logic to prevent redundant data passes.
How to Install
mkdir -p ~/.claude/skills && curl -sL https://www.agensi.io/api/install/pyspark-cost-linter -o /tmp/pyspark-cost-linter.zip && unzip -o /tmp/pyspark-cost-linter.zip -d ~/.claude/skills && rm /tmp/pyspark-cost-linter.zipFree skills install directly. Paid skills require purchase - use the download button above after buying.
Reviews
No reviews yet - be the first to share your experience.
Only users who have downloaded or purchased this skill can leave a review.
No reviews yet - be the first to share your experience.
Only users who have downloaded or purchased this skill can leave a review.
Security Scanned
Passed automated security review
Permissions
Allowed Hosts
File Scopes
Creator
Lead Data Engineer with 11 years of experience designing and delivering scalable data platforms across Databricks, AWS, and Azure ecosystems. Proven track record of building high-performance data solutions for large-scale, data-intensive organizations in industries including healthcare and robotics. Extensive experience working in highly regulated environments, managing complex data pipelines and large volumes of structured and unstructured data.
Frequently Asked Questions
Learn More About AI Agent Skills
More Premium Skills

inline-comment
Best way to steer your agents, effortlessly.
designing-hybrid-context-layers
Architects the right retrieval strategy for every query — teaching your agent when to use RAG, a knowledge graph, or a temporal index instead of defaulting to vector search for everything.
consumer-motivation-analyzer
Go beyond surface-level feedback to uncover the psychological drivers and hidden motivations behind buyer behavior.
Bounty Security Pattern Master Library — 399 Vulnerability Patterns
A premium library of 399 vulnerability patterns and DeFi attack vectors for AI-driven bug hunting and security audits.