Please use this identifier to cite or link to this item:
http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128
Title: | Optimizing ahocorasick for word counting. |
Other Titles: | Otimizando ahocorasick para contagem de palavras. |
???metadata.dc.creator???: | LUCENA, Emerson Leonardo. |
???metadata.dc.contributor.advisor1???: | GHEYI, Rohit. |
???metadata.dc.contributor.referee1???: | MONTEIRO , João Arthur Brunet. |
???metadata.dc.contributor.referee2???: | MASSONI , Tiago Lima. |
Keywords: | Aho-Corasick algoritm;Pattern matching;Correspondência de padrões;Filtrage;Coincidencia de patrones;Word counting;Recuento de palabras;Comptage de mots;Contagem de palavras;Algoritmo offline;Algorithme hors ligne;Algoritmo sin conexión;Offline algorithm;Processamento de textos;Processing of texts;Procesamiento de textos;Traitement des textes |
Issue Date: | 2020 |
Publisher: | Universidade Federal de Campina Grande |
Citation: | LUCENA, E. L. Optimizing ahocorasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128 |
Abstract: | The Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick. |
Keywords: | Aho-Corasick algoritm Pattern matching Correspondência de padrões Filtrage Coincidencia de patrones Word counting Recuento de palabras Comptage de mots Contagem de palavras Algoritmo offline Algorithme hors ligne Algoritmo sin conexión Offline algorithm Processamento de textos Processing of texts Procesamiento de textos Traitement des textes |
???metadata.dc.subject.cnpq???: | Ciência da Computação |
URI: | http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128 |
Appears in Collections: | Trabalho de Conclusão de Curso - Artigo - Ciência da Computação |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
EMERSON LEONARDO LUCENA - TCC CIÊNCIA DA COMPUTAÇÃO 2020.pdf | 1.49 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.