1

    pyspark-cost-linter

    by Ikerg

    Detect and fix the 9 most expensive PySpark anti-patterns to slash your Spark compute costs.

    Updated Jun 2026
    0 installs

    Free

    Included in download

    • Downloadable skill package
    • 1 permission declared
    • Instant install

    Sample input

    Review my Spark job in src/silver_layer.py for any performance issues or code that might be making our Databricks bill too high.

    Sample output

    Found 2 high-severity issues in silver_layer.py:

    • [R041] Line 45: Python UDF detected. This disables Photon acceleration. Refactor using pyspark.sql.functions.
    • [R045] Line 82: withColumn inside a loop. This causes quadratic plan growth. Use select() or withColumns() instead.

    About This Skill

    What it does

    The PySpark Cost Linter is a specialized diagnostic tool designed to scan PySpark scripts for expensive anti-patterns that inflate cloud compute bills. It identifies specific code-level inefficiencies—such as driver-side bottlenecks, unoptimized UDFs, and plan-bloating loops—providing severity ratings and idiomatic refactoring advice for each finding.

    Why use this skill

    Standard linters catch syntax errors, but they don't catch the "silent killers" of Spark performance. This skill is built for data engineers who need to optimize Databricks or EMR jobs without manually auditing thousands of lines of code. It uses deterministic rules (R040–R050) to find patterns that disable Photon acceleration or cause quadratic plan analysis times, saving you from expensive trial-and-error debugging.

    Supported tools

    • PySpark (Core & SQL)
    • Databricks (Photon & AQE optimization checks)
    • CI/CD Integration (via JSON output)
    • Common data formats (Delta, Parquet, CSV, JSON)

    Output

    The skill produces a structured report mapping rule IDs to specific line numbers. Each finding includes a description of the cost impact (e.g., "OOM Risk" or "Photon Disabled") and a code-level recommendation to fix the leak.

    Use Cases

    • Identify Python UDFs that are disabling Photon acceleration and increasing DBUs.
    • Find collect() or toPandas() calls that risk driver OOMs and idle clusters.
    • Detect withColumn loops that cause exponential Spark optimizer slowdowns.
    • Audit Spark configurations like AQE and shuffle partitions for cost efficiency.
    • Analyze schema inference logic to prevent redundant data passes.

    Reviews

    No reviews yet - be the first to share your experience.

    Only users who have downloaded or purchased this skill can leave a review.

    Security Scanned

    Passed automated security review

    Permissions

    Terminal / Shell

    Allowed Hosts

    agensi.io

    File Scopes

    scripts/**
    examples/**

    Creator

    Lead Data Engineer with 11 years of experience designing and delivering scalable data platforms across Databricks, AWS, and Azure ecosystems. Proven track record of building high-performance data solutions for large-scale, data-intensive organizations in industries including healthcare and robotics. Extensive experience working in highly regulated environments, managing complex data pipelines and large volumes of structured and unstructured data.

    Frequently Asked Questions

    More Premium Skills

    Free