6

Scrapy: Brings several spiders sharing the same elements, pipeline and parameter...

 3 years ago
source link: https://www.codesd.com/item/scrapy-brings-several-spiders-sharing-the-same-elements-pipeline-and-parameters-but-with-separate-outputs.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Scrapy: Brings several spiders sharing the same elements, pipeline and parameters, but with separate outputs

advertisements

I am trying to run multiple spiders using a Python script based on the code provided in the official documentation. My scrapy project contains multiple spider (Spider1, Spider2, etc.) which crawl different websites and save the content of each website in a different JSON file (output1.json, output2.json, etc.).

The items collected on the different websites share the same structure, therefore the spiders use the same item, pipeline, and setting classes. The output is generated by a custom JSON class in the pipeline.

When I run the spiders separately they work as expected, but when I use the script below to run the spiders from with scrapy API the items get mixed in the pipeline. Output1.json should only contain items crawled by Spider1, but it also contains the items of Spider2. How can I crawl multiple spiders with scrapy API using same items, pipeline, and settings but generating separate outputs?

Here is the code I used to run multiple spiders:

import scrapy
from scrapy.crawler import CrawlerProcess
from web_crawler.spiders.spider1 import Spider1
from web_crawler.spiders.spider2 import Spider2

settings = get_project_settings()
process = CrawlerProcess(settings)
process.crawl(Spider1)
process.crawl(Spider2)
process.start()

Example output1.json:

{
"Name": "Thomas"
"source": "Spider1"
}
{
"Name": "Paul"
"source": "Spider2"
}
{
"Name": "Nina"
"source": "Spider1"

}

Example output2.json:

{
"Name": "Sergio"
"source": "Spider1"
}
{
"Name": "David"
"source": "Spider1"
}
{
"Name": "James"
"source": "Spider2"
}

Normally, all the names crawled by spider1 ("source": "Spider1") should be in output1.json, and all the names crawled by spider2 ("source": "Spider2") should be in output2.json

Thank you for your help!


Acording to docs to run spiders sequentially on the same process, you must chain deferreds.

Try this:

import scrapy
from scrapy.crawler import CrawlerRunner
from web_crawler.spiders.spider1 import Spider1
from web_crawler.spiders.spider2 import Spider2

settings = get_project_settings()
runner = CrawlerRunner(settings)

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(Spider1)
    yield runner.crawl(Spider2)
    reactor.stop()

crawl()
reactor.run()


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK