Web Scraping 🔍🔥

Web Scraping 🔍🔥

17d ago

24 replies

Scraping public data from the web, transforming it, and using it for a new product can become a very successful business.

What kind of web scraping projects have you worked on and which tools did you use?

Replies

Bertha Kgokong

Software Developer | Entrepreneur

(1) Scrapping job listing websites and creating your own product, mailing list etc for job hunters tools - python, selenium, Beautiful Soup

Nik Hazell

Predictive tools in the AdTech space

I never finished it - but I started a Strava scraping project. I think there's a ton of suuuuper interesting data in there, although I did it for interests sake, rather than to monetise it.

And yep, like @berthakgokong says - Python, Beautiful Soup, etc.

David Gregorian

Co-Founder of Aplano

@berthakgokong @nik_hazell Also pretty cool. I think collecting data for a while and then figuring out what do to with it later is also not a bad idea. The value of data in general will be rising in the future. Have you tried puppeteer?

Nik Hazell

Predictive tools in the AdTech space

David Gregorian

Co-Founder of Aplano

@berthakgokong @nik_hazell You should check it out. The usability is pretty good, especially if you use it with Typescript. It is based on Chromium. All in all it has some quirks when controlling a headless browser engine, but I think that's not the fault of Puppeteer itself.

Fabian Maume

Founder of Tetriz.io

QApop is build using NodeJS Puppetter and AWS lambda. I also have some side income from consulting around Phantombuster

David Gregorian

Co-Founder of Aplano

@fabian_maume QApop looks really good! Thanks for sharing :) Did you already think about applying the same to other (famous) forums?

Stefan Morris

I fight for the users

I had a website that scraped automotive listings and looked at the year, model, mileage, options and price to determine if it was a good deal (this was before everyone was doing it)

I found the whole process of scraping messy and a bit shady (listing sites really wanted to protect their data) so I eventually abandoned it. Data ownership is a very messy subject which I decided to avoid completely.

Decided to build a CMS instead - no reliance on external data :) It is currently in private release and I think it offers quite a few competitive features that separate it from the competition.

David Gregorian

Co-Founder of Aplano

@stefan_morris Yes it can be messy. Especially the data ownership. But it's not illegal in general. It really depends on the use-case.

With which tech stack are you building the CMS?

Stefan Morris

I fight for the users

@david_gregorian I agree, it's not necessarily illegal but depending on the site, it can break their Terms of Use agreement, which is where it can get messy.

My CMS is a SaaS platform built with Vue/Nuxt and MongoDB. I'm still ramping up but there's a bit of information on my website (check out the docs) at https://shustudios.com

I'm currently looking for a few beta testers.

David Gregorian

Co-Founder of Aplano

@stefan_morris Is your CMS completely headless? For example like Contentful?

Stefan Morris

I fight for the users

@david_gregorian Yes, it is! It uses a REST API, but you can define the endpoints yourself in the CMS, as well as what data it should return. This gives you the best of both worlds between a REST API and a GraphQL API in my opinion.

Amirali Nurmagomedov

Co-founder @ AnnounceKit

I remember my rookie days at coding. I was usually doing a lot of parsing, mostly bots fetching videos from various web sources. Everything done with preg_match function in PHP 🥲

David Gregorian

Co-Founder of Aplano

@amirali_nurmagomedov Damn that's old school :P How long ago was that?

Amirali Nurmagomedov

Co-founder @ AnnounceKit

@david_gregorian it was 2006-2007, damn 16 years ago :(

Victor G. Björklund

Building Remote Teams @ Jawdropping.io

Job websites, company databases, google serp, booking sites, etc. Mostly using google scrapy.

David Gregorian

Co-Founder of Aplano

@victorbjorklund What do you mean by google scrapy?

Renat Gabitov

Funny thing, I scraped the "Top Most Upvoted Products" using Bardeen.ai (our tool). It worked really nicely.

BUT I wanted to figure out which month is the best to launch, and turns out they haven't updated that page, so now I gotta scrape the all products.

https://www.producthunt.com/e/50...

Let's see where this takes me.

David Gregorian

Co-Founder of Aplano

@renat_gabitov Haha I also thought about it once. Can't you use the graphql api of producthunt? I think it is not public...

Jared Wright

Student and Developer

https://Metaheads.xyz - search engine for fb comments. nodejs + selenium :)

David Gregorian

Co-Founder of Aplano

@jawerty Looks awesome! Does it store all the scraped data on a custom db? Or is there something happening on the fly, when doing a search?

Brandon

Some Projects – LinkedIn, Szalesforce (AppExchange), GitHub, Amazon, Food Inspection Scores (Texas), Google, Government Data Sets, CraigsList, Library, lots of sites...

Tools (that I like) – Scrapestorm, Import.io, ParseHub, OctoParse, Scrapy, RPA Tools (UIPath, Automation Anywhere, etc), Selenium, CLI (wget, curl, shell scripts)...

Tools vary depending upon task - haven't found one tool that I can consistently use for everything ..

Scott K Wilder

I love products

I would like to scrap LinkedIn comments from a post. How can I do this?

Recommend

新冠研究急缺蝙蝠！科学家培育蝙蝠类器官解释蝙蝠与病毒共存却不生病

End of the Year Stats | Voice of the DBA

GUI Calender Using Tkinter in Python - Videos | GeeksforGeeks

是時候停止使用 python 3.6

元宇宙中即将兴起的八大职业——2022年到一起去元宇宙上班

网易抢跑元宇宙：推虚拟活动空间亮元宇宙技术版图

December 29, 2021

毕业五年还在卷：2021 年终总结

vRealizeLog Insight Cloud Content Packs 101 - VMware Cloud Management

2021-51: 颓废的一周

About Joyk