30

tiny & portable dom scraper using jQuery like syntax integrated with schedul...

 4 years ago
source link: https://github.com/alash3al/scraply?_001=
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Scraply

Scraply a simple dom scraper to fetch information from any html based website using jQuery like syntax and convert that info to JSON APIs

How it works?

it works by simple define some macros / endpoints in HCL format, and let the magic begins, here is an example:

# /scraply
macro scraply {
    // the url to scrap
    // we will scrap scraply github page and get information from it
    url = "https://github.com/alash3al/scraply"

    // cache [time to live] in seconds
    // set it to any value < 1 to disable it.
    ttl = 120

    // code to be executed
    //
    // this is a javascript code
    // you must set your returns in the exports variable
    exec = <<JS
        exports = {
            // fetching the title
            // similar to jQuery, right?
            title: $("title").Text(),
            description: $('meta[name=description]').AttrOr('content', '')
        }
    JS

    // schedule this macro to run at the specified cron style spec
    // it extends the cronjob with an additional field in the first
    // to supports seconds.
    schedule = "* * * * * *"

    // notify an endpoint with the result
    // the payload is a json object just like: {"error": "an error if any", "result": "the result will be here"}
    webhook = "http://some.endpoint.com"

    // whether you don't want to expose this macro to the api or not
    private = true

    // our $(..).Method() is just like jQuery's $(..).method()
    // our $(..).Method() is an alias for document.Find(..).Method()
    // 
    // here is a table shows you jQuery methods and scraply Methods:
    //
    //  jQuery              :   Scraply
    //  -------------           ---------------
    //  $(..).first()       :   $(..).First()
    //  $(..).html()        :   $(..).Html()
    //  $(..).text()        :   $(..).Text()
    //  $(..).last()        :   $(..).Last()
    //  $(..).find()        :   $(..).Find()
    //  $(..).attr()        :   $(..).Attr() | $(..).AttrOr(needle, defaultValue)
    //  $(..).children()    :   $(..).Children()
    //  $(..).prev()        :   $(..).Prev()
    //  $(..).next()        :   $(..).Next()
    //  $(..).has()         :   $(..).Has()

    // also you have the following functions in js context
    // println()/console.log()
    // time() the current timestamp
    // sleep(ms) sleep the execution for x of milliseconds
    // macro(macro_name) executes the specified macro name and return its result
}

# /sqler
macro sqler {
    url = "https://github.com/alash3al/sqler"
    ttl = 120
    exec = <<JS
        exports = {
            title: $('title').Text(),
            description: $('meta[name="description"]').AttrOr('content', '')
        }
    JS
}

# /redix
macro redix {
    url = "https://github.com/alash3al/redix"
    ttl = 120
    exec = <<JS
        exports = {
            title: $('title').Text(),
            description: $('meta[name="description"]').AttrOr('content', '')
        }
    JS
}

# aggregate ?
macro all {
    exec = <<JS
        exports = {
            redis: macro("redix"),
            sqler: macro("sqler")
        }
    JS
}

Why?

I wanted a simple tool that fetches the required information in a simple way from web pages, I'm using it in the following cases:
  • Scraping data from currency rates websites
  • Scraping product pricing data from e-commerce sites
  • Scraping news from news websites
  • Scraping search data
  • there are more use cases ...

Features

  • Tiny & Portable Engine.
  • You can scale & distribute it easily.
  • Private/Public Macros.
  • Cron like scheduler.
  • Webhook Support.
  • jQuery like API.
  • Customize everythin in javascript.

How?

  • Download the binary that fits your OS from here
  • Create a configuration file i.e scraply.hcl
  • Run scrapply ./path/to/downloaded/scrapply --config=./scraply.hcl --listen=:9080

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK