Scrapy - analyze a page to extract items - then track and save the contents of the article

advertisements

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items. Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.

But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.

And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.

My code so far looks like this:

class MySpider(CrawlSpider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/?q=example",
    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[@class="pagination"]'), callback='parse_item'),
        Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
    )

    def parse_item(self, response):
        main_selector = HtmlXPathSelector(response)
        xpath = '//h2[@class="title"]'

        sub_selectors = main_selector.select(xpath)

        for sel in sub_selectors:
            item = ExampleItem()
            l = ExampleLoader(item = item, selector = sel)
            l.add_xpath('title', 'a[@title]/@title')
            ......
            yield l.load_item()

After some testing and thinking, I found this solution that works for me. The idea is to use just the first rule, that gives you listings of items, and also, very important, add follow=True to that rule.

And in parse_item() you have to yield a request instead of an item, but after you load the item. The request is to item detail url. And you have to send the loaded item to that request callback. You do your job with the response, and there is where you yield the item.

So the finish of parse_item() will look like this:

itemloaded = l.load_item()

# fill url contents
url = sel.select(item_url_xpath).extract()[0]
request = Request(url, callback = lambda r: self.parse_url_contents(r))
request.meta['item'] = itemloaded

yield request

And then parse_url_contents() will look like this:

def parse_url_contents(self, response):
    item = response.request.meta['item']
    item['url_contents'] = response.body
    yield item

If anyone has another (better) approach, let us know.

Stefan

Scrapy - analyze a page to extract items - then track and save the contents of t...

Scrapy - analyze a page to extract items - then track and save the contents of the article

Recommend

How Java Developer Can Help You With Enterprise-Level Software Development

iQOO Neo 6 pictured: will come in Orange, Blue and Black

My own phone number is now spam texting me

卖家应对亚马逊Prime会员日？这些要点一定要注意！

紧急！横扫466家店铺，重力吊钩赶快下架！

VIM学习笔记环绕字符编辑(surround)

微信回应安徽ETC停用

抖音直播带货怎么选品？抖音直播选品的方法和技巧有哪些呢？

Welcoming the Bulgarian Government to Have I Been Pwned

【Shopee市场周报】虾皮新加坡站2022年3月第5周市场周报

About Joyk