30
tiny & portable dom scraper using jQuery like syntax integrated with schedul...
source link: https://github.com/alash3al/scraply?_001=
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Scraply
Scraply a simple dom scraper to fetch information from any html based website using jQuery
like syntax and convert that info to JSON APIs
How it works?
it works by simple define some macros
/ endpoints
in HCL
format, and let the magic begins, here is an example:
# /scraply macro scraply { // the url to scrap // we will scrap scraply github page and get information from it url = "https://github.com/alash3al/scraply" // cache [time to live] in seconds // set it to any value < 1 to disable it. ttl = 120 // code to be executed // // this is a javascript code // you must set your returns in the exports variable exec = <<JS exports = { // fetching the title // similar to jQuery, right? title: $("title").Text(), description: $('meta[name=description]').AttrOr('content', '') } JS // schedule this macro to run at the specified cron style spec // it extends the cronjob with an additional field in the first // to supports seconds. schedule = "* * * * * *" // notify an endpoint with the result // the payload is a json object just like: {"error": "an error if any", "result": "the result will be here"} webhook = "http://some.endpoint.com" // whether you don't want to expose this macro to the api or not private = true // our $(..).Method() is just like jQuery's $(..).method() // our $(..).Method() is an alias for document.Find(..).Method() // // here is a table shows you jQuery methods and scraply Methods: // // jQuery : Scraply // ------------- --------------- // $(..).first() : $(..).First() // $(..).html() : $(..).Html() // $(..).text() : $(..).Text() // $(..).last() : $(..).Last() // $(..).find() : $(..).Find() // $(..).attr() : $(..).Attr() | $(..).AttrOr(needle, defaultValue) // $(..).children() : $(..).Children() // $(..).prev() : $(..).Prev() // $(..).next() : $(..).Next() // $(..).has() : $(..).Has() // also you have the following functions in js context // println()/console.log() // time() the current timestamp // sleep(ms) sleep the execution for x of milliseconds // macro(macro_name) executes the specified macro name and return its result } # /sqler macro sqler { url = "https://github.com/alash3al/sqler" ttl = 120 exec = <<JS exports = { title: $('title').Text(), description: $('meta[name="description"]').AttrOr('content', '') } JS } # /redix macro redix { url = "https://github.com/alash3al/redix" ttl = 120 exec = <<JS exports = { title: $('title').Text(), description: $('meta[name="description"]').AttrOr('content', '') } JS } # aggregate ? macro all { exec = <<JS exports = { redis: macro("redix"), sqler: macro("sqler") } JS }
Why?
I wanted a simple tool that fetches the required information in a simple way from web pages, I'm using it in the following cases:
- Scraping data from currency rates websites
- Scraping product pricing data from e-commerce sites
- Scraping news from news websites
- Scraping search data
- there are more use cases ...
Features
- Tiny & Portable Engine.
- You can scale & distribute it easily.
- Private/Public Macros.
- Cron like scheduler.
- Webhook Support.
- jQuery like API.
- Customize everythin in javascript.
How?
- Download the binary that fits your OS from here
-
Create a configuration file i.e
scraply.hcl
-
Run scrapply
./path/to/downloaded/scrapply --config=./scraply.hcl --listen=:9080
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK