Spark¶
Troubleshooting¶
- Container killed by YARN for exceeding memory limit (Max 10.4GB).
- Increase number of Spark partitions
- Decrease number of executors, thus allocating more of node memory / executor
- Instead of persisting RDDs in memory, use disk+memory (or just disk)
- check for and remedy any data skews
- AWS: Executors are removed even when jobs are running
- Auto-scaling policy had
ContainerPendingRatiotrigger set to0, which started scaling down when no applications were waiting for allocation, but jobs were still running.
v/s Pandas¶
- Add a column with a static value
- Add a column with a derived value
import org.apache.spark.sql.functions.col
val isPremUDF: UserDefinedFunction = udf[Boolean, String](_ == "Premier League")
val teamsWithLeague: DataFrame = teams.withColumn("premier_league", isPremUDF(col("league")))
Note: adding a column will return a DataFrame, which can be converted to Dataset by: df.as[NewType]
- Drop a column
- Filter rows
- Aggregation
teams.groupBy("league").count()
teams.agg(Map("matches_played" -> "avg", "goals_this_season" -> "count")) // multiple aggregates
- Union
scala
teams.unionByName(anotherTeams)