Pyspark Series: (Zero to Hero QnA)

4 min readJul 10, 2024

Day-1

Basics to Moderate Level:

1. What is PySpark?
-> PySpark is the Python API for Apache Spark, a fast and general-purpose cluster computing system for Big Data.

2. What are the key features of Apache Spark?
-> In-memory computation, fault-tolerance, scalability, support for multiple data sources (HDFS, S3, etc.), and support for various workloads (batch processing, iterative algorithms, interactive queries, and streaming).

3. How does PySpark relate to SparkContext?
-> `SparkContext` is the main entry point for Spark functionality in Python. It connects to the cluster manager and coordinates the execution of jobs.

4. Explain RDD (Resilient Distributed Dataset)?
-> RDD is the fundamental data structure in Spark, representing an immutable, distributed collection of objects that can be operated on in parallel. RDDs are fault-tolerant and are partitioned across nodes in a cluster.

5. How can you create RDDs in PySpark?
-> RDDs can be created by parallelizing an existing collection (e.g., a list) in your driver program, by referencing a dataset in an external storage system (like HDFS, S3), or by transforming an existing RDD.

6. What are transformations and actions in PySpark?
-> Transformations are operations that create a new RDD from an existing one (e.g., map, filter, reduceByKey).
-> Actions are operations that compute a result based on an RDD and either return it to the driver program or save it to an external storage system (e.g., count, collect, save).

7. Explain the concept of lazy evaluation in PySpark?
-> PySpark uses lazy evaluation to optimize the execution of Spark jobs. Transformations on RDDs are not executed until an action is called, allowing Spark to optimize the entire data flow.

8. How does Spark handle data partitioning?
-> Spark divides data into partitions, which are basic units of parallelism. Each partition is processed by one task in a cluster. Partitioning ensures that operations are distributed and can be executed in parallel.

9. What are some common Spark transformations?
-> `map`, `filter`, `flatMap`, `reduceByKey`, `sortByKey`, `join`, `union`, `distinct`, `sample`, etc.

10. What are some common Spark actions?
-> `collect`, `count`, `take`, `reduce`, `saveAsTextFile`, `foreach`, `countByKey`, `first`, `foreachPartition`, etc.

11. How can you create a DataFrame in PySpark?
-> You can create a DataFrame by loading data from a structured data source (like CSV, JSON), converting an existing RDD, or creating a DataFrame from a Pandas DataFrame.

12. What are the advantages of using DataFrame over RDD?
-> DataFrames provide a higher-level API that simplifies data manipulation, optimization via Catalyst optimizer, integration with Spark SQL for SQL queries, and seamless integration with Python libraries like Pandas.

13. Explain the role of SparkSession in PySpark?
-> `SparkSession` is the entry point to Spark SQL and provides a consolidated environment to work with structured data and execute SQL queries alongside PySpark functionality.

14. How can you perform SQL queries on DataFrame in PySpark?
-> By registering a DataFrame as a temporary view using `createOrReplaceTempView` method and then executing SQL queries using `spark.sql()`.

15. What is caching in PySpark and why is it useful?
-> Caching refers to persisting an RDD or DataFrame in memory across operations. It speeds up iterative algorithms and interactive data analysis by avoiding recomputation of the same data.

16. How does Spark handle fault-tolerance?
-> Spark achieves fault-tolerance through RDDs, which track lineage information to reconstruct lost data partitions. When a partition is lost, Spark can recompute it using lineage information.

17. What is broadcast variable in PySpark?
-> A broadcast variable is a read-only variable cached on each machine rather than serialized with tasks. It’s used for efficiently sharing large, read-only data across all nodes in a cluster.

18. Explain the difference between persist() and cache() in PySpark?
-> `persist()` allows you to specify storage levels (e.g., MEMORY_ONLY, MEMORY_AND_DISK), while `cache()` defaults to MEMORY_ONLY. Both methods persist the RDD or DataFrame in memory.

19. How can you monitor Spark jobs and tasks?
-> Spark provides a web UI (usually at port 4040) that displays information about running Spark jobs, tasks, stages, and storage usage. You can also monitor Spark jobs programmatically using Spark listeners.

20. What are some common data sources that PySpark can read from and write to?
-> HDFS, S3, Hive, HBase, JDBC databases, local files (CSV, JSON, Parquet), Cassandra, Elasticsearch, etc.

21. How can you set the number of partitions to be used when creating an RDD or DataFrame?
-> By passing the number of partitions as an argument to `parallelize()` (for RDDs) or `repartition()` and `coalesce()` methods (for DataFrames).

22. What are accumulator variables in PySpark?
-> Accumulators are variables that are only “added” to through an associative and commutative operation and can be efficiently supported in parallel. They are used to implement counters and sums.

23. Explain the difference between map() and flatMap() transformations in PySpark.
-> `map()` applies a function to each element of an RDD/DataFrame and returns an RDD/DataFrame of the same length. `flatMap()` applies a function that returns an iterator and flattens the result into a new RDD/DataFrame.

24. How does PySpark handle broadcasting and partitioning in joins?
-> PySpark automatically broadcasts smaller tables (based on configurable thresholds) in join operations to optimize performance. Partitioning ensures that data shuffling is minimized during joins.

25. What is the role of the Spark Executor in Spark applications?
-> Executors are responsible for executing Spark tasks and store data for RDDs cached in memory or disk across Spark applications. They run on worker nodes in the cluster.

These questions cover fundamental concepts in PySpark that are crucial for interviews. Understanding these will provide a strong foundation for discussing more advanced topics in Big Data processing using PySpark. Do comment for any concept to elaborate in more detail. Happy Learning !!

Pyspark Series: (Zero to Hero QnA)

Basics to Moderate Level:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Rahul Tiwari

No responses yet