Spark¶
Troubleshooting¶
- Container killed by YARN for exceeding memory limit (Max 10.4GB).
- Increase number of Spark partitions
- Decrease number of executors, thus allocating more of node memory / executor
- Instead of persisting RDDs in memory, use disk+memory (or just disk)
- check for and remedy any data skews
- AWS: Executors are removed even when jobs are running
- Auto-scaling policy had
ContainerPendingRatiotrigger set to0, which started scaling down when no applications were waiting for allocation, but jobs were still running.
- Auto-scaling policy had
v/s Pandas¶
- Add a column with a static value
- Add a column with a derived value
Note: adding a column will return a
import org.apache.spark.sql.functions.col val isPremUDF: UserDefinedFunction = udf[Boolean, String](_ == "Premier League") val teamsWithLeague: DataFrame = teams.withColumn("premier_league", isPremUDF(col("league")))DataFrame, which can be converted toDatasetby:df.as[NewType] - Drop a column
- Filter rows
- Aggregation
- Union
scala teams.unionByName(anotherTeams)