2024-09-21 · Min-jun Park

Local Spark before you pay for clusters

Close-up of Python notebook cell running Spark local session

Spark Batch Processing opens in local mode on purpose. Learners profile a 2GB clickstream sample, read physical plans, and only then request a small cluster shape.

Mentors share a worksheet comparing executor counts vs shuffle spill for our sample job. Most students cut runtime by repartitioning before touching hardware.

When we move to cloud labs, billing alerts are mandatory. We cap executor hours in the sandbox; overages are learner-managed with explicit acknowledgment.

The pedagogical point: optimization discipline saves more money than defaulting to bigger instances.

Tags: Spark, Cost

← All field notes