SerDes – Serializer and Deserializer
Hadoop can support wide variety of data formats, commonly referred as SerDes i.e. Serializer and Deserializer. SerDes defines the Input and Output Data formats.
Here are the commonly used SerDes in Hadoop/Hive.
- Types of SerDes:
- Text SerDes:
- CSV
- JSON
- XML
- Binary SerDes (most compressed)
- Sequential File
- Avro
- Columnar SerDes (efficient read and writes)
- RC
- ORC
- Parquet
- Text SerDes:
For best performance:
Use ORC with Apache Hive
Use Parquet with Apache Spark
These SerDes are associated with different Compression Codecs, e.g.:
1. gzip
2. lz4
3. snappy - most important
Leave a Reply