集搜客GooSeeker网络爬虫

标题: 这样的网页要怎么采集？ [打印本页]

作者: thursdayrain 时间: 2022-4-28 10:12
标题: 这样的网页要怎么采集？
本帖最后由 thursdayrain 于 2022-4-28 10:16 编辑

https://www.scb.se/hitta-statist ... _Tabellerochdiagram

https://www.scb.se/hitta-statist ... pa-arbetsmarknaden/

这两个网页上，那些类型是excel，要下载文件，类型是diagram或者tabell的有下级网页。
要怎么做规则，既能下载文件，又能采集下级链接，做层级抓取。

作者: thursdayrain 时间: 2022-4-28 15:05
任务的名字：
瑞典统计局_教育研究_第2级列表

作者: Fuller 时间: 2022-4-28 15:33
[attach]15472[/attach]

如果要使用红框中的信息作为定位标志，就可以使用自定义xpath，比如，attachment_url这个抓取内容：

原来的xpath是这样的：
td[position()=1]/a/@href

再加一个条件，把相邻的那个td节点中的内容作为标志：
td[position()=1 and contains(following-sibling::td[1]/text(), 'Excel')]/a/@href

title那个抓取内容类似，也是要加多一个条件

欢迎光临集搜客GooSeeker网络爬虫 (https://www.gooseeker.com/doc/)