regex: Scrapy LinkExtractor - which RegEx to follow?

dimanche 28 juin 2015

Scrapy LinkExtractor - which RegEx to follow?

I'm trying to scrape a category from amazon but the links that I get in Scrapy are different from the ones in the browser. Now I am trying to follow the next page trail and in Scrapy (printed response.body into a txt file) I see those links:

<span class="pagnMore">...</span>
<span class="pagnLink"><a href="/s?ie=UTF8&page=4&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011" >4</a></span>
<span class="pagnCur">5</span>
<span class="pagnLink"><a href="/s?ie=UTF8&page=6&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011" >6</a></span>
<span class="pagnMore">...</span>
<span class="pagnDisabled">20</span>
<span class="pagnRA"> <a title="Next Page"
                   id="pagnNextLink"
                   class="pagnNext"
                   href="/s?ie=UTF8&page=6&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011">
<span id="pagnNextString">Next Page</span>

I'd like to follow the pagnNextString link, but my spider doesn't even start crawling:

Rule(SgmlLinkExtractor(allow=("n\%3A2619533011\%", ),restrict_xpaths=('//*[@id="pagnNextLink"]',)) , callback="parse_items", follow= True),

If I get rid of the rule or do sth. like '^http.*' it's working but it follows everything. What am I doing wrong here?

regex

dimanche 28 juin 2015

Scrapy LinkExtractor - which RegEx to follow?

Aucun commentaire:

Enregistrer un commentaire