6

My rants about TP-Link Omada networking products

 6 months ago
source link: http://rachelbythebay.com/w/2023/11/17/omada/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

My rants about TP-Link Omada networking products

Perhaps you've been running Ubiquiti stuff for a while, and you've been disappointed by their stock issues, their goofy software issues, and the general lack of quality all the way around. Maybe you turned your eyes to the TP-Link Omada ecosystem. I'm here to warn you that the grass is not greener on that side of the fence. It may in fact be spray-painted.

First, some context. I'm the family sysadmin - not by choice, but because nobody else would do it. When I visit family, I have to fix their stuff. There are some gearhead types and I do my best to make them happy. Various ISPs are starting to sell services that are well above 1 Gbps. This is typically symmetric fiber stuff.

That's the situation with one of the sites I support, and their existing Ubiquiti stuff from years gone by became a bottleneck once they had that installed. Obviously, they want to get this greater-than-gig performance wherever possible. That means a derpy Windoze box or two, and that brought on a whole hellscape of dealing with resource conflicts the likes of which I hadn't seen in 20 years.

But no, this isn't about that. This is about TP-Link. I was pointed at this ecosystem as a possible escape from the clowntown that is Ubiquiti, so that's what I bought this time around: one of their gateway boxes (calling it a router would be too kind), a switch, and a hardware controller for local control - none of that cloud crap here, thanks.

It's been a new bit of stupid every week with this stuff. First of all, the switch is really best suited for a closet at a business, not anywhere in someone's home. It has dinky little fans that run pretty hard all the time, with all the noise that entails. People who replace them invariably get fan errors and then the thing eats itself within a year. (Maybe the switches fail by themselves either way - the jury is still out on that.)

The latest build of their controller software flat out does not work on Safari. I mean, sure, it loads up, and then the browser starts doing something indicative of a horrible big-O blowup factor somewhere in their Javascript. It'll hang for a minute at a time any time you move the pointer around. Or, it'll prompt you to download a file CONSTANTLY. Like, WTF kind of content-type brain damage are you doing? It doesn't happen in Firefox or Chrome, apparently, but it still goes to show that they gave zero fucks about even TRYING TO LOGIN in Safari when they were developing it. You know, the browser that every Mac and iOS device ships with?

So, you have to roll back the controller to get out of this mess. Doing that wipes your config. Fortunately for me, I discovered this during my early shakedown testing at my own residence before hauling it out to the site, and there was no actual config to lose.

Next up, their NAT implementation is just plain obnoxious. Typically with this kind of stuff, if you fire a packet from the inside to the outside, the source address gets changed from RFC 1918 or whatever you're using internally to whatever you have on the outside. That much works. What also happens here on the TP-Link ecosystem is that they mangle your source port, too. This affects both UDP and TCP.

Why does this matter? It makes NAT hole-chopping tricks much harder to pull off. Normally, you can do such fun things as configuring WireGuard to punch through from either side by lining up the ports exactly. This will let two sites connect to each other without going through a third fixed spot. This is very handy if that third spot goes down and you need an OMFG backdoor into your networks!

This does not work if the source ports change. At that point, you have to resort to all kinds of nasty birthday paradox type stuff to figure it out, and that requires Actual Work to pull it off and keep it working. Me, I don't want to put TailScale everywhere. But I digress.

Last week, something very bad happened that I haven't managed to troubleshoot since I'm remote and can only do limited things from here. HomeKit stuff stopped working. By that, I mean that viewing the home from off the local wifi said the usual "no hubs online" thing. But, stranger still, HomeKit *clients* on that wifi also couldn't connect *outward* to other spots! They, too, got the same thing about no hubs about other HomeKit locations... even when those locations were actually fine and worked for other people.

The only commonality was crossing that Omada-powered network. I had some luck in this case since there's a Mac out there which I can hop into and beat into submission, and beat I did. I figured maybe it was something goofy about the routing to Apple's cloud stuff, and started shunting all of the traffic through a tunnel. Nothing helped ... until I also switched DNS resolution on that Mac to something I controlled instead of using whatever resolver is inside the TP-Link gateway box.

Once I did that, it started working again. Even after I turned off the tunneling, it kept going. This was enough for me. I stood up unbound on a couple of Raspberry Pis out there and changed the DHCP config to make sure clients would resolve things through them instead of the ER8411 gateway. It took a while, but eventually, everything stopped being stupid.

Now, big caveat here: I don't know 100% that it was the resolver in the thing. I wasn't on site, and could only do so much without kicking myself out of the network, since my access came in through those very devices. Also, my troubleshooting abilities are limited with this crap for yet another reason I'll get to later.

Then there's what happened this morning. One of my Pis behind this setup decided it wasn't going to run one of its WireGuard links. The other link on the same interface (going to another external host) was fine. The other link on the other interface was fine. The other Pi's two links were also fine.

It was just this one particular association that wasn't working. So, into tcpdump I went yet again, looking at it from both sides of the link. The exchange I saw from inside looked like this over and over:

their_internal_ip.AAAAA -> my_external_box.BBBBB: udp data
(no reply)

But, from the outside world, it looked like this:

their_external_ip.CCCCC -> my_external_box.BBBBB: udp data
my_external_box.BBBBB -> their_external_ip.CCCCC: udp data
their_external_ip -> my_external_box: ICMP port CCCCC unreachable

So yeah, even though it had JUST sent traffic to me from that port, upon reply, the gateway box was rejecting it. This to me says "really terrible IP connection tracking setting and/or implementation that dropped the association and is somehow not picking it back up".

This WG link has a keepalive on both ends. There's no excuse for this. It should be established in the firewall as soon as a packet goes out, as one did above. But the ICMP error indicates otherwise.

Note that the port-unreachable error is not coming from the Pi itself. The Pi was only sending actual traffic and had no idea why it wasn't getting any responses. WG won't switch source ports by itself, so it just keeps smacking its head into the wall ... over and over and over.

And this brings me to the final point of frustration: I wanted to ssh into the damn gateway to to see what they were doing to screw things up so badly. It took a while to find the knob to enable ssh, and once that was on, I found the ultimate insult: it's a completely neutered interface. You can't do anything useful. It's busybox, the "ip" utility, and something that apparently lets you point it at a controller for when the adoption process doesn't work.

su? sudo? Forget about it. You don't even have /proc in there - so no ps, no w. You can't run dmesg because it doesn't exist (and they probably lock down the kernel ring buffer anyway). You are a luser, and you will never be able to do anything useful from this setup.

When pressed, tech support tells people that such things are unsupported when using the controller - that is, the dumb pointy clicky web-based UI that takes the setup and pushes it out to the devices. You know, the one that broke on Safari in the latest version. They're locking you out _on purpose_.

Finally, I haven't run into this one yet since the ISP for this site is still in the dark ages in terms of providing access to the ENTIRE Internet, but it sounds like they don't do any sort of IPv6 firewalling. So, if your ISP switches that on and you suddenly get an allocation, look out world! It's the wild west on your network!

So, let's recap the suckiness here.

0. The switch is stupidly noisy.

1. Their latest version of the controller just does not work in Safari.

2. You can't easily roll back the controller when it does suck. You'd better save the config from the old version before you upgrade, just in case you ever have to go back. And, if you never ran that particular old version, you're doubly screwed.

3. Their NAT implementation mangles source ports needlessly. Sure, some scenarios call for it. They do it constantly.

4. *Something* broke HomeKit comms really badly, and switching recursive DNS services for clients away from whatever the gateway box provides fixed it. It's probably some terrible DNS forwarder implementation but I have no way to be sure at this point.

5. The NAT apparently dropped an assoc this morning and never put it back. I couldn't get my tunnel going until I restarted it to pick a new source port on the client. Completely ridiculous.

6. Forget about ssh to troubleshoot things. The hood is welded shut. You will never know what's really going on when one of the other items decides to rear up and bite you in a sensitive place.

7. They apparently have no IPv6 firewalling based on what other people have reported in various places. (This is the only one I haven't actually encountered myself... yet.)

So now what? I'm honestly looking at returning to what I was doing in the 90s: building my own Linux boxes with enough horsepower to handle the networks in question. It worked then and it'll work again now. Things will still break, but at least I'll be able to use my actual experience to do something useful about it. Right now, I can do nothing. My hands are tied.

Why did I think these clowns had any idea what to do? I've been both inside and outside of this world, and it's pretty clear that they do not. Just look at how awful these products really are.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK