Understanding the Apache Spark Streaming

Reading Time: 2 minutes

The Apache Streaming module is a stream processing-based module within Apache Spark. It uses the Spark cluster to offer the ability to scale to a high degree. Being based on Spark, it is also highly fault-tolerant, having the ability to rerun failed tasks by checkpointing the data stream that is being processed.

Four Major Aspects of Spark Streaming

Fast recovery from failures and stragglers
Better load balancing and resource usage
Combining of streaming data with static datasets and interactive queries
Native integration with advanced processing libraries (SQL, machine learning, graph processing)

Need for Apache Spark Streaming

We have the stream processing systems designed with a continuous operator model to process the data, which works as follows:

From data sources, we receive the streaming data (e.g. live logs, system telemetry data, IoT device data, etc.), and then it goes into some data ingestion systems like Apache Kafka, Amazon Kinesis, etc.
The data is then processed in parallel on a cluster.
Results go to downstream systems like HBase, Cassandra, Kafka, etc.

Now we will cover some important topics used in spark streaming and will get a brief idea about them and their uses :

Error Recovery
Checkpointing

Error Recovery

Your application’s error management should be robust and self-sufficient. What we mean by this is that if an exception is non-critical, then manage the exception, perhaps log it, and continue processing.
For instance, when a task reaches the maximum number of failures (specified by spark.task.maxFailure ),
it will terminate processing.

Checkpointing in Spark Streaming

On batch processing, fault tolerance is present. This means the job does not lose its state if a node crash. Instead of that, the lost task goes on to other workers. Intermediate results are goes to persistent storage (which of course has to be fault-tolerant as well which is the case for HDFS, GPFS, or Cloud Object Storage)

Now we want to achieve the same guarantees in streaming as well since it might be crucial that we did not lose the data stream which we are processing.

It is possible to set up an HDFS-based checkpoint directory to store Apache Spark-based streaming information. And the process of writing received records at checkpoint intervals to HDFS is checkpointing.

Conclusion

In this blog, we’ve learned about Apache Spark Streaming and some other topics like error recovery, checkpointing, and the need for streaming in apache spark.

Hope you enjoyed the blog. Thanks for reading

References

https://databricks.com/glossary/what-is-spark-streaming#:~:text=Apache%20Spark%20Streaming%20is%20a,both%20batch%20and%20streaming%20workloads.

Four Major Aspects of Spark Streaming

Need for Apache Spark Streaming

Error Recovery

Checkpointing in Spark Streaming

Conclusion

References

Recommend

英文缩写

Leveraging Amazon S3 Cloud Object Storage for Analytics

推进互联互通，传淘宝正测试微信支付；民航局：空难东航客机人员全部遇难；比亚迪王传...

等幂求和

推荐10个颜值超高的Boostrap风格的免费企业官网建站模板

外卖骑手劳务协议不规范，黑龙江省人社厅约谈美团

华为财报出炉！2021年净利润1137亿，同比大增75.9%！孟晚舟：盈利能力和现金流获取能...

两年亏损200万，我的健身房没有春天

从迭代看产品：万字分析微信读书2-3版本——内容是一个基础资源

QgsPointsAlongGeometryAlgorithm

About Joyk