Automatic document clustering and indexing of multiple documents using KNMF for feature extraction through Hadoop and lucene on big data

Laxmi Lydia E.; Sharmili N.; Nguyen P.T.; Hashim W.; Maseleno A.

Publication:
Automatic document clustering and indexing of multiple documents using KNMF for feature extraction through Hadoop and lucene on big data

dc.citedby	3
dc.contributor.author	Laxmi Lydia E.	en_US
dc.contributor.author	Sharmili N.	en_US
dc.contributor.author	Nguyen P.T.	en_US
dc.contributor.author	Hashim W.	en_US
dc.contributor.author	Maseleno A.	en_US
dc.contributor.authorid	57196059278	en_US
dc.contributor.authorid	57191575400	en_US
dc.contributor.authorid	57216386109	en_US
dc.contributor.authorid	11440260100	en_US
dc.contributor.authorid	55354910900	en_US
dc.date.accessioned	2023-05-29T07:27:55Z
dc.date.available	2023-05-29T07:27:55Z
dc.date.issued	2019
dc.description	Automatic indexing; Big data; Cluster analysis; Extraction; Factorization; Indexing (of information); Information retrieval; K-means clustering; Natural language processing systems; Open source software; Open systems; Pattern matching; Software quality; Software testing; Text mining; Hadoop; Key phrase extractions; Map-reduce; Pattern-matching technique; Porters; Pre-processing algorithms; Software environments; Unlabeled; Matrix algebra	en_US
dc.description.abstract	The existence of unlabeledtext data in documents has become larger and excavating such datasets is a provocative task. The objective of Big Data is to store, retrieve and analyse multipletext documents. Problem Statement:The retrieval of the identical data over large databases is of major concern. Existing Solution:Existing problem is solved by Full-Text Search (FTS) which means pattern matching technique that allows searching of multiple keywords at specific time.Proposed Solution: In this paper, we consider multiple text documents as input and processed using text mining pre-processing algorithms like Key Phrase extraction, Porters stemming for tokenizing and TF_IDF toobtain all non-negative values. These values further processed to get matrix data throughNonnegative matrix factorization (NMF). On performing NMF, K-means algorithmis upgraded with NMF to obtain quality clusters of data sets.Performances of the algorithms are tested using Newsgroup20 data in Open Source Hadoop software environment which also analyses the performance of the MapReduce framework. The final outcome is to generate clusters and index them for the Newsgroup20dataset. Later on, Apache Lucene is presented for automatic document clustering with aGUI interface developed for indexing. Thus, this proposed algorithm resultsby improving the performance of document clustering through Map Reduce framework in Hadoop. � 2019 Mattingley Publishing. All rights reserved.	en_US
dc.description.nature	Final	en_US
dc.identifier.epage	1130
dc.identifier.issue	11-Dec
dc.identifier.scopus	2-s2.0-85079574447
dc.identifier.spage	1107
dc.identifier.uri	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85079574447&partnerID=40&md5=1ed7ff4baa70eeccef9e5755fa21fcec
dc.identifier.uri	https://irepository.uniten.edu.my/handle/123456789/24853
dc.identifier.volume	81
dc.publisher	Mattingley Publishing	en_US
dc.source	Scopus
dc.sourcetitle	Test Engineering and Management
dc.title	Automatic document clustering and indexing of multiple documents using KNMF for feature extraction through Hadoop and lucene on big data	en_US
dc.type	Article	en_US
dspace.entity.type	Publication

Collections

SCOPUS

Publication: Automatic document clustering and indexing of multiple documents using KNMF for feature extraction through Hadoop and lucene on big data

Options

Files

Collections

Publication:
Automatic document clustering and indexing of multiple documents using KNMF for feature extraction through Hadoop and lucene on big data