Scrapy: Brings several spiders sharing the same elements, pipeline and parameters, but with separate outputs

advertisements

I am trying to run multiple spiders using a Python script based on the code provided in the official documentation. My scrapy project contains multiple spider (Spider1, Spider2, etc.) which crawl different websites and save the content of each website in a different JSON file (output1.json, output2.json, etc.).

The items collected on the different websites share the same structure, therefore the spiders use the same item, pipeline, and setting classes. The output is generated by a custom JSON class in the pipeline.

When I run the spiders separately they work as expected, but when I use the script below to run the spiders from with scrapy API the items get mixed in the pipeline. Output1.json should only contain items crawled by Spider1, but it also contains the items of Spider2. How can I crawl multiple spiders with scrapy API using same items, pipeline, and settings but generating separate outputs?

Here is the code I used to run multiple spiders:

import scrapy
from scrapy.crawler import CrawlerProcess
from web_crawler.spiders.spider1 import Spider1
from web_crawler.spiders.spider2 import Spider2

settings = get_project_settings()
process = CrawlerProcess(settings)
process.crawl(Spider1)
process.crawl(Spider2)
process.start()

Example output1.json:

{
"Name": "Thomas"
"source": "Spider1"
}
{
"Name": "Paul"
"source": "Spider2"
}
{
"Name": "Nina"
"source": "Spider1"

}

Example output2.json:

{
"Name": "Sergio"
"source": "Spider1"
}
{
"Name": "David"
"source": "Spider1"
}
{
"Name": "James"
"source": "Spider2"
}

Normally, all the names crawled by spider1 ("source": "Spider1") should be in output1.json, and all the names crawled by spider2 ("source": "Spider2") should be in output2.json

Thank you for your help!

Acording to docs to run spiders sequentially on the same process, you must chain deferreds.

Try this:

import scrapy
from scrapy.crawler import CrawlerRunner
from web_crawler.spiders.spider1 import Spider1
from web_crawler.spiders.spider2 import Spider2

settings = get_project_settings()
runner = CrawlerRunner(settings)

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(Spider1)
    yield runner.crawl(Spider2)
    reactor.stop()

crawl()
reactor.run()

Scrapy: Brings several spiders sharing the same elements, pipeline and parameter...

Scrapy: Brings several spiders sharing the same elements, pipeline and parameters, but with separate outputs

Recommend

深圳数字人民币发放专项资金至3月末已累计发放1.96亿元

顶级游戏公司世嘉（SEGA）投资NFT发行商，将于夏季发行NFT产品

看不了评论怎么解决

Show the current location marker in Google Maps

朱海涛自媒体，无底线抄袭！死骗子

NBA休斯敦火箭队老板名下豪车商店已支持狗狗币支付

Can I subscribe to SystemEvent using Reflection?

echarts相关问题

WhaleShark旗下潮牌E1337代币1337 24小时涨幅89.1%，成为市值第二高社交代币

为什么开源中国博客里面的评论都看不到了？

About Joyk