Show HN: Curl modified to impersonate Firefox and mimic its TLS handshake

I run a MITM proxy for adblocking/general filtering and within the past little while I've noticed CloudFlare and other "bot protection" tends to get me blocked out of increasingly more sites I come across in search results, so this will be very useful for fixing that.

However, I should caution that in this era of companies being particularly user-hostile and authoritarian, especially Big Tech, I would be more careful with sharing stuff like this. Being forced to run JS is bad enough; profiling users based on other traits, and essentially determining if they are using "approved" software, is a dystopia we should fight strongly against. Stallman's Right To Read comes to mind as a very relevant warning story.

Cloudflare is likely one of the worst things that has happened to the internet in recent history.

Like, I get the need for some protective mechanisms for interactive content/posting/etc, but there should be zero cases where a simple HTTP 200 GET requires javascript/client side crap. If they serve me a slightly stale version of the remote resource (5 minutes/whatnot) that's fine.

They've effectively just turned into a google protection racket. Small/special purpose search/archive tools are just stonewalled.

You can't turn it off as a Cloudflare customer either.

The best you've got is "essentially off" but that wording is such because even with everything disabled there are still edge cases where their security will enforce a JS challenge or CAPTCHA.

At least on their basic plan there is also little to no indication of how often this is triggering. Leading to having know idea what the various settings are.

Not to be too dismissive of this, but for companies trying to just run a service and getting constantly bombarded by stuff like DDoS issues, Cloudflare and its ilk lets them service a large portion of "legitimate" users, compared to none.

I don't really know how you resolve that absent just like... putting everything behind logins, though.

What you're describing is solved by caching, not requiring JS crap.

So give me xxx minute old stale cached version. I'm fine with that.

Until you get misbehaving users crawling every page.

Rate limiting works, and isn't discriminatory against "non-standard" clients.

> If they serve me a slightly stale version of the remote resource (5 minutes/whatnot) that's fine.

Not all sites are configured to do this. Some pages are expensive to render and have no cache layer.

I get that, my point is it's the problem.

They solve the DDOS issue by requiring JS captchas (which fundamentally breaks the way the internet should work), rather then serving a cache of the page to reduce load on the real host.

Requiring JS doesn't disambiguate between well behaved automated (or headless. I used a custom proxy for a lot of my content browsing) user agents and malicious users, it breaks /all/ of them.

Some people shoot themselves in the foot, yes. There is no reason to not have some amount of microcaching even if it is very short and that puts an upper limit on the request rate per resource behind the caching layer.

I've noticed even GitHub has a login wall now for comments on open source projects. They truncate them if you aren't logged in, similar to reddit on mobile, instagram, twitter, etc. Hopefully the mobile version doesn't start pushing you to install some crappy apps where you can't use features like tabbed browsing, tab sync with another machine, etc.

The reasoning behind that might be the myriad of scrape-and-publish SEO spam pages with GitHub content.

Not sure if I am buying that excuse. I think they want to nudge people to make accounts and login. Really shady in case of Github and many other sides that are successful because of user content in my opinion.

Ironically, that makes the scraped copies more useful because they aren't truncated (at least for older pages) and I can actually get all the content. I wonder if that might be at least why Google seems to be giving them more weight in search result rankings.

> profiling users based on other traits, and essentially determining if they are using "approved" software, is a dystopia we should fight strongly against. Stallman's Right To Read comes to mind as a very relevant warning story.

Right to Read indeed... fanfiction.net has over the last months become really annoying. Especially at night, when you have the FFN UI set to dark, and then out of nothing a bright white Cloudflare page appears. Or why the Cloudflare "anti bot" protection leads to an endless loop when the browser is the Android web view inside a third-party Reddit client.

Maybe I'm just a techno-optimist, but I suspect big tech companies don't give a hoot about you running "unapproved" software, but rather care about their services being abused and "unapproved" software is just a useful signal that fails on a tiny percentage of total legit users.

You are a lot more charitable than I am. I believe the big tech companies use dark patterns to get us to sign up, improve their metrics and hoover up our data.

Just trying to keep services operational is a fine goal to pursue as an operator, but forcing users to small inbound funnels for the service is detrimental too. There needs to be better research to be done to allow simpler ways of operation to continue working.

A browser is becoming a universal agent by itself, but many people (maybe increasingly) use terminal to access to the resources, and stonewalling these paths are never OK in my book.

This is a distinction without a difference.

>impersonate Firefox 95

you should really be impersonating an ESR version (eg. 91). Versions from the release channel is updated every month or so, and everyone has autoupdate enabled. Therefore unless you keep it up to date, your fingerprint is going to stick out like a sore thumb in a few months. On the other hand, ESR sticks to one version and shouldn't change significantly during its one year lifetime. It's still going to stick out to some extent (most people don't use ESR), but at least you have some enterprises who use ESR to blend into.

They should really be impersonating Chrome. If this takes off, Firefox has such a small user share that I could see sites just banning Firefox altogether, like they do with Tor

I suspect Tor is being banned not because of a small user share.

Perhaps you may get broken sites with Firefox, because no-one cared. But banning? Seems like a stretch.

Tor is banned (or rather, tar-pitted in endless CAPTCHAs) because the amount of legit users is massively dwarfed by the abuse

If everyone running scrapers and attack probes start showing up as Firefox, then they'll end up in the same situation.

If there are a lot of abuse masquerading as Firefox, outstripping legit users, they can totally throw up a CAPTCHA for Firefox but not for Chrome. An outright ban isn’t the only annoying outcome.

Thanks for the suggestion, I had no idea ESR was a thing. I've just added support for Firefox ESR 91 (it was pretty similar and required adding one cipher to the cipher list and changing the user agent).

I think ESR is the way to go too, but either way, I wonder if some tests can be written to confirm the coverage/similarity of the requests? It would entail automating a both Firefox session and the recording of network traffic, and feels like it might end up as bikeshedding.

Cool, can't wait for anti-bot protection to start rejecting me because I use firefox.

Only a matter of time I'm afraid :( Firefox usage share is already low enough for many sites to make pages for Chrome and maybe Safari only.

Given the relative market shares it might make more sense to impersonate Chrome.

I will try to impersonate Chrome next, However, I suspect this is going to be more challenging. Chrome uses BoringSSL, which curl does not support. So it means either enforcing curl to compile with BoringSSL or modifying NSS to look like BoringSSL.

and make it seem like Firefox has less market share? sounds like a good way to kill Firefox even faster, my 2 cents.

Counter argument is service providers just choosing to block anything that looks like Firefox since the market share is so small and it's being used to circumvent their precious protections.

Whilst it's not as big as it once was, the idea of a service provider blocking all Firefox user agents is still ludicrous to the point that I can't believe you're not trolling here.

"Some web services therefore use the TLS handshake to fingerprint which HTTP client is accessing them. Notably, some bot protection platforms use this to identify curl and block it."

As a user of non-browser clients (not curl though) I have not run into this in the wild.^1

Anyone have an example of a site that blocks non-browser clients based on TLS fingerprint.

1. As far as I know. The only site I know of today that is blocking non-browser clients appears to be www.startpage.com. Perhaps this is the heuristic they are using. More likely it is something simpler I have not figured out yet.

Do you plan on getting this merged back into curl with an option to enable it? I can see that being useful for some people.

I hope to do so in the future, for now the implementation is extremely hacky so I doubt it can get accepted into curl.

There was a conversation on their mailing list contemplating dropping NSS support. https://curl.se/mail/lib-2022-01/0120.html If you have a use case for NSS in curl, you may want to speak up. Perhaps "I want curl to look exactly like a browser" is a significant use case?

Agreed, it is very important to bring this up on the mailing list. It might also be plausible to make curl look like Chrome if curl had BoringSSL support.

Currently, I cannot think about anything else other than "noscript/basic (x)html" /IRC to get us out of this, at least for sites where such protocols are "good enough" to provide their services to users over internet. But how? Enlighten the "javascript web" brain-washed devs to make them realize how toxic what they do is? regulations (at least for critical sites)? And how to deal with the other sites: those which devs are scammers and perfectly aware of how toxic they are and keep doing it.

In my own country, for critical sites, I will probably have go to court since 'noscript/basic (x)html" interop was broken in the last few years.

Would be cool if there was something like this for Python. Last time i tried to scrape something interesting i found that one of Cloudflare's enterprise options was easily blocking all of the main http libraries due to the identifiable TLS handshake.

Are you sure they blocked you because of the handshake?

Always thought it was the myriad of cookies and expiry time of said cookies that tend to make non-browser clients more obvious to CF.

The site wasn't using it to block me, just to prompt a captcha, without doing so to 'real' browsers.

The HTTP requests were exact copies of browser requests (in terms of how the server would've seen them), so it was something below HTTP. I ended up finding a lot of info about Cloudflare and the TLS stuff on StackOverflow, with others having similar issues. Someone even made an API to do the TLS stuff as a service, but was too expensive for me. https://pixeljets.com/blog/scrape-ninja-bypassing-cloudflare...

Thanks for the response, never came across the particular behaviour.

fwiw I think when it comes to the 'copy as curl', the HTTP header ordering may be different and it's worth loading up a page twice as some of the cookies are replaced.

I've used puppeteer as the article talks about. Manages the cookies better. Managed to do continuous requests without getting further CF blocks as opposed to a couple of hundred with cURL (due to cookies different from what CF expect over a time)

IIRC CF does have a sliding scale of how protected you want a site to be, so perhaps the TLS stuff belongs further up the scale.

I think most of the scraping libraries have stagnated since it's hard to scrape without a headless browser these days...too many sites with client-side rendered content.

Very cool! Thanks for sharing - it’s always nice to learn about fingerprinting tricks and workarounds, from both a privacy and a “don’t unintentionally look like a bot” perspective.

What inspired the project?

Good blog post. Stuff like this makes me wonder if by 2030 (1) the internet will mostly consist of machine generated content; (2) machines written by normal people in Python won't be authorized to access the machine-generated content anymore due to Protectify; (3) most client traffic will originate from Protectify's network, so people like bloggers won't have any visibility into whether their readers are humans or machines; (4) video compression algorithms will become indistinguishable from deepfakes; and (5) airborne pathogens will make alternatives to the above impractical.

This is cool, but is it really needed that often?

There are some industries (virtually all of Wall Street, for example, and certain parts of government) where the company needs to surveil 100% of what their employees do on the web from inside the office. These companies have been running MITM proxies for decades.

Wouldn't any website that rejects a non-browsery TLS client be blocking out these people as well?

They don't block you completely, just present you with a JS challenge that delays your access to the site. A browser, even if behind a MITM proxy, would be able to solve this challenge.

Very cool. I would have used Puppeteer/Playwright in a similar scenario, but thanks for sharing the bot detection trick they employ.

Great work mate, one of my team-mates showed me this library and we might use it in near future.

Handy. Is the TCP handshake, or other details about socket behavior, ever get used for assessing the remote process, and in turn libraries written to mimic known patterns?

Recommend

MonoLisa – A font designed for developers

30+ Yellow Websites With Beautiful Color Palettes

The Ultimate Guide to Understanding Design Roles and Which One You Should Go For...

The 21 Tech Giants Ruling the World

Programming Fonts - Test Drive

Show HN: Jless, a command-line JSON viewer

35 Best Wedding Invitation Fonts (Save the Date Fonts)

d36cz9buwru1tt.cloudfront.net

Why Design Critiques are useless outside of academia, and what to replace them w...

7 Ergonomics Rules for Designing VR Workspace

About Joyk