An extensible parsing pipeline for unstructured data processing

Jain, Shubham; de Buitléir, Amy; Fallon, Enda

View/Open

An extendible parsing pipeline ....pdf (3.280Mb)

Date

2021-03-10

Author

Jain, Shubham

de Buitléir, Amy

Fallon, Enda

Metadata

Show full item record

Abstract

Network monitoring and diagnostics systems depict the running system's state and generate enormous amounts of unstructured data through log files, print statements, and other reports. It is not feasible to manually analyze all these files due to limited resources and the need to develop custom parsers to convert unstructured data into desirable file formats. Prior research focuses on rule-based and relationship-based parsing methods to parse unstructured data into structured file formats; these methods are labor-intensive and need large annotated datasets. This paper presents an unsupervised text processing pipeline that analyses such text files, removes extraneous information, identifies tabular components, and parses them into a structured file format. The proposed approach is resilient to changes in the data structure, does not require training data, and is domain-independent. We experiment and compare topic modeling and clustering approaches to verify the accuracy of the proposed technique. Our findings indicate that combining similarity and clustering algorithms to identify data components had better accuracy than topic modeling.

URI

http://research.thea.ie/handle/20.500.12065/3555

Collections

Conferences - PRISM: Polymer, Recycling, Industrial, Sustainability and Manufacturing Institute [3]

The following license files are associated with this item:

Creative Commons

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivatives 4.0 International