Unsupervised noise detection in unstructured data for automatic parsing

Jain, Shubham; de Buitléir, Amy; Fallon, Enda

View/Open

Unsupervised noise detection ....pdf (366.0Kb)

Date

2020-11-30

Author

Jain, Shubham

de Buitléir, Amy

Fallon, Enda

Metadata

Show full item record

Abstract

The telecommunications industry makes extensive use of data extracted from logs, alarms, traces, diagnostics, and other monitoring devices. Analyzing the generated data requires that the data be parsed, re-structured, and re-formatted. Developing custom parsers for each input format is labor-intensive and requires domain knowledge. In this paper, we describe a novel unsupervised text processing pipeline to automatically detect and label relevant data and eliminate noise using Levenshtein similarity and Agglomerative clustering. We experiment with different similarity and clustering algorithms on a selection of common data formats to verify the accuracy of the proposed technique. The results suggest that the proposed methodology has higher accuracy.

URI

http://research.thea.ie/handle/20.500.12065/3722

Collections

Conferences - Software Research Institute [46]

The following license files are associated with this item:

Creative Commons

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivatives 4.0 International