Apache Spark – for faster processing

Apache Spark – for faster processing

When we talk about Big Data, the time it takes to process this huge data is too important. Lot of advancements are being done to reduce the processing time. 

Apache Spark is one of such latest processing framework in BigData world leaving behind MapReduce Processing Framework. 

Apache Spark is 10 to 100 times faster as compared to MapReduce. To know more about Spark, please read below links:

Apache Spark was Built in Scala, but support Java and Python as well

It also supports SQL [for query analytics] and R [for ML/AI]

Spark Internal Data Structures:

  1. RDD
    • Resilient Distributed Dataset
    • It is Immutable
    • Creates DAGs
    • Uses Lazy evaluation
  2. DataFrames
    • More flexible and robust as compared to RDD
    • Provides SQL support
    • DataFrame is an alias for an untyped Dataset [Row]
  3. Datasets
    • Combines features of both i.e. RDD and Dataframes
    • Not used widely
    • Not supported by PySpark as of Spark 2.0.0
      • Support for Datasets only in Scala and Java

  • PySpark i.e. Python with Spark is most widely used framework for big data processing for wide variety of domains like
    1. Spark Core for file level data processing
    2. Spark SQL for database kind of SQL analytics using DataFrames and Spark DSL
    3. Spark Streaming for real time data processing by integrating with messaging systems like Kafka
    4. Spark ML for Machine Learning at scale using parallel processing power
    5. Spark GraphX for graphical data processing. Graph data is network data containing nodes and edges to represent information and relations. Common Graph DBs to store such data are Neo4J, AWS Neptune, Titan, etc.

Please refer my GitHub repo for Spark Hands-on

Latest version of Spark is 3.3.0 as of July 2022

Apache Spark Supports multiple Data Processing paradigms:
1. Spark SQL (Batch Processing)
2. Spark Streaming (Real Time Processing)
3. Spark ML (Machine Learning)
4. Spark GraphX (Graph Data Processing)

Rahul Aggarwal
http://guardiancoder.in

Senior Data Scientist and Gen-AI Engineer #DataScience #AI #RNN #CNN #GenAI #ChatGPT #LLMs

Leave a Reply

Discover more from Rahul Aggarwal's EdTech

Subscribe now to keep reading and get access to the full archive.

Continue reading