Apache Spark – for faster processing
When we talk about Big Data, the time it takes to process this huge data is too important. Lot of advancements are being done to reduce the processing time.
Apache Spark is one of such latest processing framework in BigData world leaving behind MapReduce Processing Framework.
Apache Spark is 10 to 100 times faster as compared to MapReduce. To know more about Spark, please read below links:
Apache Spark was Built in Scala, but support Java and Python as well
It also supports SQL [for query analytics] and R [for ML/AI]
Spark Internal Data Structures:
- RDD
- Resilient Distributed Dataset
- It is Immutable
- Creates DAGs
- Uses Lazy evaluation
- DataFrames
- More flexible and robust as compared to RDD
- Provides SQL support
- DataFrame is an alias for an untyped Dataset [Row]
- Datasets
- Combines features of both i.e. RDD and Dataframes
- Not used widely
- Not supported by PySpark as of Spark 2.0.0
- Support for Datasets only in Scala and Java
- PySpark i.e. Python with Spark is most widely used framework for big data processing for wide variety of domains like
- Spark Core for file level data processing
- Spark SQL for database kind of SQL analytics using DataFrames and Spark DSL
- Spark Streaming for real time data processing by integrating with messaging systems like Kafka
- Spark ML for Machine Learning at scale using parallel processing power
- Spark GraphX for graphical data processing. Graph data is network data containing nodes and edges to represent information and relations. Common Graph DBs to store such data are Neo4J, AWS Neptune, Titan, etc.
Please refer my GitHub repo for Spark Hands-on
Latest version of Spark is 3.3.0 as of July 2022
Apache Spark Supports multiple Data Processing paradigms:
1. Spark SQL (Batch Processing)
2. Spark Streaming (Real Time Processing)
3. Spark ML (Machine Learning)
4. Spark GraphX (Graph Data Processing)
Leave a Reply