mdns: don't perform IO on main task #2591

rkuhn opened this issue Mar 25, 2022 · 7 comments

· May be fixed by #4623

mdns: don't perform IO on main task #2591

rkuhn opened this issue Mar 25, 2022 · 7 comments

· May be fixed by #4623

Comments

Contributor

rkuhn

commented

Mar 25, 2022

•

edited by thomaseizinger

Currently, the mdns::Behaviour polls the InterfaceState on the main task, see

InterfaceState in turn performs IO as part of this:

We should not be doing any form of IO on the main behaviour task because it reduces latency in the dispatch and handling of events distributed to the various behaviours.

To fix this, we should spawn a separate task per runtime and communicate with it via a channels.

Contributor

Author

addendum: lsof -nP output showing the relevant FDs

actyx   29873 rkuhn   35u  netlink                           0t0 877398060 ROUTE
actyx   29873 rkuhn   36u  a_inode               0,13          0      9494 [eventpoll]
actyx   29873 rkuhn   37u  a_inode               0,13          0      9494 [eventfd]
actyx   29873 rkuhn   38u  a_inode               0,13          0      9494 [timerfd]
actyx   29873 rkuhn   39u     IPv4          877389402        0t0       UDP *:5353 
actyx   29873 rkuhn   40u     IPv4          877389403        0t0       UDP 95.217.194.97:44860 
actyx   29873 rkuhn   41u     IPv4          877389404        0t0       UDP *:5353 
actyx   29873 rkuhn   42u     IPv4          877389405        0t0       UDP 172.17.0.1:45727 
actyx   29873 rkuhn   43u     IPv6          877398061        0t0       TCP *:4001 (LISTEN)
actyx   29873 rkuhn   44u     IPv4          877398062        0t0       TCP *:4001 (LISTEN)

Contributor

Author

okay, here’s the really weird part: with this patch to async-io the poll loop gets broken after some time, settling down into its proper “wait for netlink messages” state

diff --git a/src/reactor.rs b/src/reactor.rs
index b6ae153..7e79478 100644
--- a/src/reactor.rs
+++ b/src/reactor.rs
@@ -382,9 +382,11 @@ impl Direction {
     /// Moves all wakers into a `Vec`.
     fn drain_into(&mut self, dst: &mut Vec<Waker>) {
         if let Some(w) = self.waker.take() {
+            log::trace!("waking poller");
             dst.push(w);
         }
         for (_, opt) in self.wakers.iter_mut() {
+            log::trace!("waking waiter");
             if let Some(w) = opt.take() {
                 dst.push(w);
             }

I then tried generating a netlink message by adding an IP address, which first did the right and expected thing and then spun a few dozen loops before going back to sleep again. So this looks like a race condition that when triggered leads to endless wakeups of the swarm.

Contributor

Author

It looks like epoll is being interrupted by registering read interest, which immediately notifies, and waits, upon which this vicious cycle restarts.

2022-03-25T15:59:07.068412Z TRACE async_io::reactor: woke up after IO    
2022-03-25T15:59:07.068435Z TRACE async_io::reactor: waking poller    
2022-03-25T15:59:07.068460Z TRACE async_io::reactor: react: 1 ready wakers    
2022-03-25T15:59:07.068491Z TRACE async_io::driver: main_loop: waiting on I/O    
2022-03-25T15:59:07.068521Z TRACE async_io::reactor: process_timers: 0 ready wakers    
2022-03-25T15:59:07.068537Z DEBUG netlink_proto::connection: reading incoming messages    
2022-03-25T15:59:07.068545Z TRACE polling: Poller::wait(_, Some(297.113158634s))    
2022-03-25T15:59:07.068568Z TRACE polling::epoll: wait: epoll_fd=36, timeout=Some(297.113158634s)    
2022-03-25T15:59:07.068573Z DEBUG netlink_proto::codecs: NetlinkCodec: decoding next message    
2022-03-25T15:59:07.068591Z TRACE polling::epoll: modify: epoll_fd=36, fd=38, ev=Event { key: 18446744073709551615, readable: true, writable: false }    
2022-03-25T15:59:07.068611Z TRACE async_io::reactor: reusing existing waker 0    
2022-03-25T15:59:07.068639Z DEBUG netlink_proto::connection: forwarding unsolicited messages to the connection handle    
2022-03-25T15:59:07.068669Z DEBUG netlink_proto::connection: forwaring responses to previous requests to the connection handle    
2022-03-25T15:59:07.068699Z DEBUG netlink_proto::connection: handling requests    
2022-03-25T15:59:07.068726Z DEBUG netlink_proto::connection: sending messages    
2022-03-25T15:59:07.068769Z TRACE async_io::reactor: modify from poll    
2022-03-25T15:59:07.068794Z TRACE polling::epoll: modify: epoll_fd=36, fd=43, ev=Event { key: 5, readable: true, writable: false }    
2022-03-25T15:59:07.068823Z TRACE polling::epoll: new events: epoll_fd=36, res=1    
2022-03-25T15:59:07.068823Z TRACE async_io::reactor: reusing existing waker 0    
2022-03-25T15:59:07.068846Z TRACE polling::epoll: modify: epoll_fd=36, fd=37, ev=Event { key: 18446744073709551615, readable: true, writable: false }    
2022-03-25T15:59:07.068852Z TRACE async_io::reactor: reusing existing waker 0    
2022-03-25T15:59:07.068872Z TRACE async_io::reactor: woke up after IO

Contributor

Author

This rabbit hole is deeper than I can follow right now:

epoll immediately returns readable
since no EPOLLET option is set, this indicates that the socket still has bytes, but those bytes are not consumed between epoll_wait calls (and we note the conspicuous absence of generated events in the logs above)
I have absolutely no clue where to look for the offending code

Contributor

Author

Okay @mxinden I’m pretty sure I know how it happens, but I’m not sure how to fix it:

InterfaceState::poll registers read interest in the mDNS receiver socket (possibly after having drained received datagrams)
sometimes this works as intended, namely that epoll doesn’t return this socket as readable until something happens — I think this is the case when read interest is registered before the async-io run loop calls epoll_wait again
but most of the time, epoll_wait immediately returns, saying that the mDNS receiver socket is ready, but there are no messages there and the whole cycle begins again

If I remove self.recv_socket.poll_readable(cx) from the poll function in iface.rs, then everything works nicely, including mDNS discovery. I suppose the functionality will be rescued by one of the various timers that wake up the behaviour in any case, leading to poll calls. I’ll probably patch libp2p-mdns in this fashion so that Actyx works again, but what is the proper solution?

Contributor

Author

Sleeping on it, I have doubts on the current design of libp2p-mdns: I don’t think the UDP ports should be polled every time something unrelated happens in the rest of the Swarm, and I also don’t think that the rest of the Swarm should be polled every time something happens on those UDP ports.

mxinden

changed the title ~~mDNS poll loop~~

protocols/mDNS: Poll UDP socket in separate task

Oct 5, 2022

mxinden

added difficulty:moderate

help wanted

getting-started Issues that can be tackled if you don't know the internals of libp2p very well

labels

Oct 5, 2022

thomaseizinger

added priority:important The changes needed are critical for libp2p, or are blocking another project getting-started Issues that can be tackled if you don't know the internals of libp2p very well

and removed getting-started Issues that can be tackled if you don't know the internals of libp2p very well

labels

Sep 12, 2023

Tagging this as important because it performs IO on the main task that all NetworkBehaviours run on.

thomaseizinger

removed the getting-started Issues that can be tackled if you don't know the internals of libp2p very well label

Sep 12, 2023

thomaseizinger

changed the title ~~protocols/mDNS: Poll UDP socket in separate task~~

mdns: don't perform IO on main task

Sep 12, 2023

Labels

bug difficulty:moderate help wanted priority:important The changes needed are critical for libp2p, or are blocking another project

Projects

None yet

Milestone

No milestone

Development

Successfully merging a pull request may close this issue.

3 participants

mdns: don't perform IO on main task · Issue #2591 · libp2p/rust-libp2p · GitHub

mdns: don't perform IO on main task #2591

mdns: don't perform IO on main task #2591

Comments

Recommend

使用rust重写: 和Go版本mping比较

Managing JSON data with NestJS, PostgreSQL and Kysely

“HTTP/2 快速重置”零日漏洞被利用发起历史上最大规模的 DDoS 攻击

Using Every Door Direct Mail Web Push Surveys and Multi-level modelling with Pos...

超燃官宣！TCL成为德国国家男子足球队官方合作伙伴

Move `needless_raw_string_hashes` to `pedantic` by Alexendoo · Pull Request #114...

Fun with stroke-dasharray

Promote loongarch64-unknown-none* to Tier 2 by heiher · Pull Request #115368 · r...

Introducing Auth0 React Native SDK 3.0!

Fix generic bound of `str::SplitInclusive`'s `DoubleEndedIterator` impl by timve...

About Joyk