Design and Implementation of Web Information Extraction System Based on Crawler
Conference: EMIE 2022 - The 2nd International Conference on Electronic Materials and Information Engineering
04/15/2022 - 04/17/2022 at Hangzhou, China
Proceedings: EMIE 2022
Pages: 6Language: englishTyp: PDF
Authors:
Xie, Wenjia; Zheng, Wenbin; Ting, Yan (School of Information and Communication, National University of Defense Technology, Wuhan, China)
Tang, Ping (Cloud network Operation Center, China United Telecommunications Corporation, Xi’an, Shaanxi, China)
Abstract:
Web crawler technology is an important means to obtain network information. Web crawler technology faces the challenge of Web information extraction. After analyzing and studying the crawler technology at home and abroad and the scrapy framework and working principle for web information acquisition technology, this paper designs and implements a theme web crawler system for web information acquisition. According to the functional and non functional requirements of the system, the system is designed to build corresponding collection rules for different digital periodical resources, convert the web resources of digital periodicals into a node tree, and directly obtain the download URL of the required resources from the node tree of the current layout, so that the scattered resources to be obtained can be processed and reorganized in a centralized and unified manner.