SerDes – Serializer and Deserializer

SerDes – Serializer and Deserializer

Hadoop can support wide variety of data formats, commonly referred as SerDes i.e. Serializer and Deserializer. SerDes defines the Input and Output Data formats.

Here are the commonly used SerDes in Hadoop/Hive.

  • Types of SerDes:
    1. Text SerDes:
      • CSV
      • JSON
      • XML
    2. Binary SerDes (most compressed)
      • Sequential File
      • Avro
    3. Columnar SerDes (efficient read and writes)
      • RC
      • ORC
      • Parquet

For best performance:

Use ORC with Apache Hive

Use Parquet with Apache Spark

These SerDes are associated with different Compression Codecs, e.g.:
1. gzip
2. lz4
3. snappy - most important

Rahul Aggarwal
http://guardiancoder.in

Senior Data Scientist and Gen-AI Engineer #DataScience #AI #RNN #CNN #GenAI #ChatGPT #LLMs

Leave a Reply

Discover more from Rahul Aggarwal's EdTech

Subscribe now to keep reading and get access to the full archive.

Continue reading