Flink: Join two Data Streams

Reading Time: 3 minutes

Apache Flink offers rich sources of API and operators which makes Flink application developers productive in terms of dealing with the multiple data streams. Flink provides many multi streams operations like Union, Join, and so on. In this blog, we will explore the Window Join operator in Flink with an example. It joins two data streams on a given key and a common window.

Let say we have one stream which contains salary information of all the individual who belongs to an organization. The salary information has the id, name, and salary of an individual. This stream is available at port 9000 on the localhost.

 final DataStream<Tuple3<Integer, String, Double>> salaryStream = executionEnvironment
                    .socketTextStream("localhost", 9000)
                    .map((MapFunction<String, Tuple3<Integer, String, Double>>) salaryTextStream -> {
                        String[] salaryFields = salaryTextStream.split(" ");
                        if (salaryFields.length == 3 &&
                                !(salaryFields[0].isEmpty()
                                        || salaryFields[1].isEmpty()
                                        || salaryFields[2].isEmpty())) {
                            return new Tuple3<>(Integer.parseInt(salaryFields[0]),
                                    salaryFields[1],
                                    Double.parseDouble(salaryFields[2]));
                        } else {
                            throw new Exception("Not valid input passed");
                        }
                    }, TypeInformation.of(new TypeHint<Tuple3<Integer, String, Double>>() {
                    }));

We have another stream which contains department information of all the individual who belongs to the same organization. The department information has the id, and department of an individual. This stream is available at port 9001 on the localhost.

 final DataStream<Tuple2<Integer, String>> departmentStream = executionEnvironment
                    .socketTextStream("localhost", 9001)
                    .map((MapFunction<String, Tuple2<Integer, String>>) departmentTextStream -> {
                        String[] salaryFields = departmentTextStream.split(" ");
                        if (salaryFields.length == 2 &&
                                !(salaryFields[0].isEmpty()
                                        || salaryFields[1].isEmpty())) {
                            return new Tuple2<>(Integer.parseInt(salaryFields[0]), salaryFields[1]);
                        } else {
                            throw new Exception("Not valid input passed");
                        }
                    }, TypeInformation.of(new TypeHint<Tuple2<Integer, String>>() {
                    }));

Now, join the salary data stream and department data stream on a key id of an individual which is common in both the streams. After joining, The resultant data stream will have all the information in one go -: id, name, salary, and department of an individual.

final DataStream<Tuple4<Integer, String, String, Double>> joinedStream =
                 salaryStream.join(departmentStream)
                 .where(getSalaryJoinKey -> getSalaryJoinKey.f0, TypeInformation.of(new TypeHint<Integer>() {}))
                 .equalTo((KeySelector<Tuple2<Integer, String>, Integer>) getDepartmentKey -> getDepartmentKey.f0)
                 .window(TumblingProcessingTimeWindows.of(Time.seconds(30)))
                 .apply((JoinFunction<Tuple3<Integer, String, Double>,
                         Tuple2<Integer, String>, Tuple4<Integer, String, String, Double>>) (salaryDetail, departmentDetail) ->
                                    new Tuple4<>(salaryDetail.f0, salaryDetail.f1, departmentDetail.f1, salaryDetail.f2),
                                    TypeInformation.of(new TypeHint<Tuple4<Integer, String, String, Double>>() {}));

Here, using a common window for both the stream. We want a tumbling window and window to be based on processing time that’s why using TumblinProcessingTimeWindows Class. The window size is 30 sec which means all entities from both the streams that come within 30 seconds will be included in one window. Then apply JoinFunction to perform join on both the streams and get the resultant complete joined information of an individual.

You can find the complete code here

Join in Action

To run the application open two socket terminal one with port 9000 and another with port 9001. Streaming application is going to listen these ports.

nc -l 9000
nc -l 9001

Start the flink local cluster-

./bin/start-cluster.sh

Now run the flink application and also tail the log to see the output.

tail -f log/flink--taskexecutor-.out

Enter messages in both of these two netcat windows within a window of 30 seconds to join both the streams. The resultant data stream has complete information of an individual-: the id, name, department, and salary.

Happy Blogging and keep Learning!!

Flink: Join two Data Streams

Flink: Join two Data Streams

Join in Action

Recommend

周日出发去参加孚日债权人大会

Introducing Trait Parameters in Scala 3

Snapchat Acquires 3D Mapping Company Pixel8Earth to Expand AR Capabilities

DevOps Shorts: How to increase the replication factor for a Kafka topic

也来问哈工作，31岁如何从国企跨界到体制内？

直播系统聊天技术(四)：百度直播的海量用户实时消息系统架构演进实践

Instagram Provides an Update on its Efforts to Eliminate Potential Systemic Bias...

Form Handling and Validation with VueJs

Facebook Announces New Integration with Spotify Enabling Users to Listen to Musi...

Snapchat Shares New Insights into Rising User Excitement Around the Return of th...

About Joyk