Big Data: Apache Spark Fundamentals

Rahul Tiwari
4 min readMay 29, 2024

--

Introduction:

Apache Spark is a powerful open-source unified analytics engine designed for large-scale data processing. It offers a robust framework for executing large-scale data analytics applications and machine learning workloads. If you’re new to Spark, this blog will walk you through the fundamentals and architecture to get you started.

What is Apache Spark?

Apache Spark is an open-source distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It is designed to be fast and general-purpose, making it ideal for a wide range of applications from batch processing to real-time streaming and machine learning.

Why Use Spark?

1. Speed: Spark processes data in-memory, which is much faster than traditional disk-based processing. This makes it significantly faster for both iterative algorithms and interactive data mining.
2. Ease of Use: Spark provides high-level APIs in Java, Scala, Python, and R, making it accessible to a wide range of developers.
3. General Computation Engine: It supports various types of data processing workloads, including batch processing, interactive querying, streaming, and machine learning.
4. Unified Engine: Spark provides a unified engine for processing both batch and streaming data, making it easier to build complex data pipelines.

Core Components of Spark:

1. Spark Core: The foundation of the Apache Spark platform, Spark Core provides essential functionalities such as task scheduling, memory management, fault recovery, and storage system interactions.
2. Spark SQL: This component allows for querying data via SQL as well as using Data Frames and Datasets. It’s useful for structured and semi-structured data processing.
3. Spark Streaming: Enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
4. MLlib: Spark’s machine learning library, which provides scalable machine learning algorithms.
5. Graph X: A component for graph processing and graph-parallel computation.

Spark Architecture:

Understanding Spark’s architecture is key to leveraging its full potential. Here’s a simplified overview:

1. Driver: The driver is the orchestrator of the Spark application. It converts the user’s program into tasks and schedules them to run on executors. It also tracks the progress of the tasks and manages the overall application lifecycle.

2. Cluster Manager: The cluster manager allocates resources to the Spark application. Spark supports different types of cluster managers:
Standalone: A simple cluster manager included with Spark.
Apache Mesos: A general cluster manager that can also run Hadoop MapReduce and other applications.
Hadoop YARN: The resource manager in Hadoop 2.0.
Kubernetes: An open-source system for automating deployment, scaling, and management of containerized applications.

3. Executors: These are the worker nodes in the cluster that run individual tasks. Executors are responsible for executing the tasks assigned by the driver and returning the results. They also provide in-memory storage for RDDs (Resilient Distributed Datasets) that are cached by user programs through Spark-Context.

Resilient Distributed Datasets (RDDs):

RDDs are a fundamental concept in Spark. They are fault-tolerant collections of elements that can be operated on in parallel. RDDs provide:
-> Immutability: Once created, they cannot be changed. Transformations on RDDs result in the creation of new RDDs.
-> Lazy Evaluation: Transformations on RDDs are not executed immediately. Spark builds a DAG (Directed Acyclic Graph) of transformations and only when an action is called, does it execute the DAG.
-> Fault Tolerance: Spark can recompute lost data using the lineage information stored in RDDs.

DataFrames:

In Spark, DataFrames are the distributed collections of data, organized into rows and columns. Each column in a DataFrame has a name and an associated type. DataFrames are similar to traditional database tables, which are structured and concise. We can say that DataFrames are relational databases with better optimization techniques.

Spark DataFrames can be created from various sources, such as Hive tables, log tables, external databases, or the existing RDDs. DataFrames allow the processing of huge amounts of data.

Transformations and Actions:

-> Transformations: Operations on RDDs that return a new RDD, such as `map()`, `filter()`, and `reduceByKey()`. These are lazy and are not executed until an action is called.
-> Actions: Operations that trigger the execution of the transformations to produce a result, such as `collect()`, `count()`, and `saveAsTextFile()`.

Conclusion:

Apache Spark is a versatile and powerful framework for large-scale data processing. Its architecture is designed to be fast and flexible, making it suitable for a variety of use cases. Understanding the core components, the role of the driver and executors, and the concept of RDDs will help you get started with Spark. As you become more familiar with its APIs and components, you’ll be able to leverage Spark’s full capabilities for your data processing needs. Happy Learning!!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Rahul Tiwari
Rahul Tiwari

No responses yet