江汉大学学报(自然科学版) ›› 2018, Vol. 46 ›› Issue (6): 522-527.doi: 10.16389/j.cnki.cn42-1737/n.2018.06.006

• 计算机科学 • 上一篇    下一篇

基于Scrapy 的中药材网络信息采集方法研究

张喜红,王玉香   

  1. 亳州职业技术学院,安徽 亳州 236800
  • 出版日期:2018-12-28 发布日期:2018-11-29
  • 作者简介:张喜红(1983—),男,讲师,硕士,研究方向:数据挖掘与分析。
  • 基金资助:
    安徽省青年人才支持计划项目(gxyq2018215);安徽省高校自然科学研究重大项目(KJ2016SD41)

Collection Method of Network Information for Traditional Chinese Medicinal Materials Based on Scrapy

ZHANG Xihong,WANG Yuxiang   

  1. Bozhou Vocational and Technical College,Bozhou 236800,Anhui,China
  • Online:2018-12-28 Published:2018-11-29

摘要: 以中药材天地网站的信息收集为例,基于Scrapy 框架设计了中药材品名、规格、产地、价格等信息提取的爬虫。首先,借助浏览器的网页审查元素工具分析目标网页的结构,并抽取目标元素的XPath 路径;接着,采用Scrapy框架构建网络爬虫工程,分别在相应的文件中设计目标元素的解析规则及元素的存储方法;最后利用设计的爬虫采集目标网站信息进行测试,以西洋参、三七为例,将线上采集数据与线下实地调研的数据进行对比。结果表明,所设计的爬虫能快速、高效、准确获取目标网站的信息,且与线下实地调研数据相符,可为后续的研究提供数据支撑。

关键词: Scrapy, 中药材, 爬虫

Abstract: At present,the data related to Chinese herbal medicines on the internet is increasing by tens of thousands. It is of great significance to excavate the potential relationship behind these data, establish commodity specifications and price warning mechanism to guide the smooth and orderly running of the market. Taking the information collection of Tiandi website of Chinese herbal medicine as an example,a spider based on Scrapy was designed to extract the information of Chinese herbal medicine name,specifications,origin,price and so on. Firstly,the structure of the target page was analyzed and the XPath path of the target element was extracted with the help of the web page elements reviewing tool of the browser. Then,the web spider project was constructed with the Scrapy framework,and the parsing rules of the target elements and the storage methods of the elements were designed in the corresponding files. Finally,the spider was used to collect the information of the target website for testing. Taking Panax quinquefolium and Panax notoginseng for example,the data collected on-line were compared with the data collected off-line and on the spot. The results show that the designed spider can obtain the information of the target website quickly, efficiently and accurately,it is consistent with the off-line field survey data,also it can provide data support for subsequent study.

Key words: Scrapy, traditional Chinese medicinal materials, spider

中图分类号: