• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Distributed Document Clustering Analysis Based on a Hybrid Method

    2017-05-08 01:46:47JudithJayakumari
    China Communications 2017年2期

    J.E. Judith, J. Jayakumari

    Noorul Islam Centre for Higher Education, Kumaracoil, India.

    I. INTRODUCTION

    An era powered by the constant progress in Information Technology ventures on multitudes of opportunities on data analysis, information retrieval and transaction processing as the size of information repositories keep on expanding each day. This is due to enormous growth in the volume of information available on the internet, such as Digital Libraries, Reuters, Social Medias, etc. There is a dramatic increase in the volume and scope of data which forms huge corpus or databases.

    Clustering or unsupervised learning is one of the most important fields of machine learning which splits the data into groups of similar objects helping in extraction or summarization of new information. It is used in a variety of fields such as statistics, pattern recognition and data mining. This research focuses on the application of clustering on data mining. Clustering is one of the major areas in data mining that has a vital role in answering to the challenges of every IT industry.

    Data mining is a procedure to discover patterns from large datasets. Text mining is a sub- field of data mining that analyses a large collection of document datasets. The prime challenge incurred in thisfield is the amount of electronic text documents available which when increases exponentially makes room for effective methods to handle these documents.Also it is infeasible to centralize all the documents from multiple sites to a centralized location for processing. Nowadays these document datasets are increasing tremendously which often referred to as Big Data. The problems of analysis on these datasets are referred to as a curse of dimensionality since they are often highly dimensional.

    Distributed computing plays a major role in data mining due to these reasons. Distributed Data Mining (DDM) has evolved as a hot research area to solve these issues [29]. The use of machine learning in distributed environments for processing massive volumes of data is a recent research area. It is one of the important and emerging areas of research due to the challenges associated with the problem of extracting unknown information from very large centralized real-world databases.

    Document clustering has been investigated for use in a number of different areas of text mining and information retrieval. It is regarded as a major technology for intelligent unsupervised categorization of content [3] in text form of any kind; e.g. news articles, web pages, learning objects, electronic books, even textual metadata. Document clustering groups similar documents to form a coherent cluster while documents that are different are separated into different clusters [6]. The quality of document clustering in both centralized and decentralized environments can be improved by using an advanced clustering framework.

    Most of the conventional document clustering methods is designed for central execution[4] which maintains a single large repository of documents where clustering analysis is performed. They require clustering to be performed on a dedicated node, and are not suitable for deployment over large scale distributed networks [21]. These methods are based on the assumption that the data is memory resident, which makes them unable to cope with the increasing complexity of distributed algorithms. So there is an eminent need for mining knowledge from distributed resources [4]. Therefore, specialized algorithms for distributed clustering have to be developed[1], [8]. Distributed clustering algorithms are needed to overcome the challenges in distributed document clustering.

    In order to support data intensive distributed applications [13], an open source implementation based on Hadoop is used for processing of large datasets. MapReduce is a functional programming model [7] for distributed processing over several machines. The important idea behind MapReduce framework is to map the datasets into a group of <key,value> pairs, and then reduce all pairs with the same key [10]. A map function is performed by each machine which takes a part of the input and maps it to <key, value> pairs [28].This is then send to a machine which applies the reduce function. The reduce function combines the result for further processing. The outputs from the reduce function are then fed into the appropriate map function to begin the next round of processing.

    In this proposed work, a hybrid PSOKMeans (PKMeans) distributed document clustering method is formulated for better speedup and accuracy of document clustering.Along with these conceptual and algorithmic changes there is also a need to use any of the emerging frameworks for distributed analysis and storage. Today’s enterprises use Big Data infrastructure for analytical engineered applications [18]. One of the evolving technologies is the MapReduce methodology and its open source implementation is Hadoop. This proposed method of document clustering is based on MapReduce methodology which improves the performance of document clustering and enables handling of large document dataset.

    Section 2 highlights related work in the area of MapReduce based distributed document clustering using PSO. Section 3 describes the proposed methodology, the experimental setup and the document sets used for analysis. Result analysis and discussions are done at Section 4. The paper concludes at Section 5.

    II. RELATED WORKS

    A brief review on the MapReduce based distributed document clustering is described in this section. Andrew W. McNabb et al. (2007)proposed [2] a method for optimizing large volume of data by parallelizing PSO algorithm in order to optimize individual function evaluations. The results indicate that with more particles than the number of processors, the performance of the MapReduce system is improved. The system is not able to handle the dynamic changes in particles. Ibrahim Aljarah and Simone A. Ludwig (2012) proposed [9] a Parallel Particle Swarm Optimization Clustering Algorithm based on MapReduce Methodology. The author used synthetic and real data sets for evaluation. The experimental results reveal that scalability and speedup increases with increasing data set size while maintaining the clustering quality. An efficient DBSCAN algorithm based on MapReduce was proposed[22] by Yaobin et al. This algorithm solved the problems of existing parallel DBSCAN algorithms. The algorithms scalability and efficiency is improved by fully parallelized implementation and removing any sequential processing bottleneck. Z. Weizhong et al. introduced [23] a parallel K-means algorithm clustering algorithm based on MapReduce.The Map function calculates the centroids as the weighted average of each individual cluster points; and the Reduce function calculates a new centroid for each data point based on the distance calculations. The final centroids are determined using MapReduce iterative refinement technique.

    Ping et al. in [17] proposed an algorithm for document clustering using the MapReduce based on K-means algorithm. MapReduce is used to read the document collection and to calculate the term frequency and inverse document frequency. The performance of the algorithm is improved for clustering text documents with improved accuracy. Another K-means clustering algorithm using MapReduce was suggested by Li et al. in [15].This algorithm merges the K-means algorithm with ensemble learning method bagging. Outlier problem is solved using this algorithm and is efficient for large data. S. Nair et al. proposed a modification of self-organizing map(SOM) clustering using Hadoop MapReduce to improve its performance. Experiments were conducted with large real data sets which show an improvement in efficiency. S. Papadimitriou et al. proposed [19] a MapReduce framework to solve co-clustering problems. Usage of MapReduce for co-clustering provides good solution and can scale well for large data sets.Yang et al. in [11] proposed a big data clustering method based on the MapReduce framework. Parallel clustering is done based on ant colony approach. Data analysis is improved by using ant colony clustering with MapReduce.The proposed algorithm showed acceptable accuracy with good improved efficiency.

    The proposed method is a hybrid algorithm based on MapReduce framework. This method utilizes optimization approach to generate optimal centroids for clustering. The global search ability of PSO algorithm improves the performance of the proposed method.

    III. PROPOSED APPROACH

    3.1 Overview of the Process

    The different steps followed in this methodology are,

    1. Choosing a document dataset to perform analysis. A variety of document datasets are publicly available and used for text mining research.

    2. Document preprocessing is done to reduce the number of attributes. The input text documents are transformed in to a set of terms that can be included in the vector model.

    3. Document vector representation is performed to represent the documents in a vector form by determining the term weight.Term weight is an entry in the Document Term Matrix (DTM) which is determined based on MapReduce methodology.

    4. These document vectors are clustered using a hybrid PSO KMeans clustering algorithm based on MapReduce methodology(MR-PKMeans).

    The overview of the process is shown in fig.1.

    3.2 Document vector representation based on mapreduce

    The pre-processed documents are represented as a vector using vector space model. Documents have to be transformed from full text version to a document vector [24] which describes the content of the document. Let D={d1, d2…..dn} be a set of documents and let T ={t1, t2,…..tm} be the set of distinct terms occurring in D. Each document is represented on a Document-Term Matrix (DTM). Each entry in the document-term matrix is the term weight.Terms that appear frequently in a small number of documents but rarely in other documents, tend to be more relevant and specific for that group of documents. These terms are useful for finding similar documents. To capture these terms and reflect their importance tf*idf weighting scheme is used. For each index terms, the term frequency (tf) in each document and the inverse document frequency (idf) are calculated to determine the term weight based on Equation 3.1,

    where df(t) is the number of documents in which term t appears.

    Map and Reduce function

    The input to the Map function includes the document dataset stored in HDFS. The document dataset is split and <key, value> pairs are generated for each individual text documents in the dataset. The key is the doc ID and the value is the terms in the document <docID,term>. The Map function determines the term frequency of each term in the document. The list of terms frequency of each term in the document collection are given as input to reduce function which combines the term frequency to form a Document Term Matrix. Document term matrix is normalized by including the inverse document frequency along with the term frequency for each entry in the matrix as determined in Equation 3.1. The pseudo code for DTM construction based on MapReduce is shown in fig.2.

    3.3 Proposed enhanced mapreduce based distributed document clustering method

    A multidimensional document vector space is modeled as a problem space in this proposed document clustering method. Each term in the document datasets represent one dimension of the problem space. This enhanced model as proposed includes two modules MR-PSO module, MR-KMeans module. Multiple MapReduce jobs are chained together to improve the quality and accuracy of clusters with reduction in execution time.

    Fig.1 Overview of the process

    Fig.2 Pseudo code for document vector representation based on MapReduce

    3.3.1 MR-PSO module

    PSO is an iterative global search method.According to PSO each particles location in the multidimensional problem space represents a solution for the problem. A number of candidate clustering solutions for document collection are considered as a swarm.A different problem solution is generated when the location of the particle changes. The problem solution moves through the search space by following the best particles [12]. It moves through the search space looking for a personal best position. The velocity of each particle determines the convergence to an optimal solution. The inertia weight, particle personal experience and global experience will influence the movement of each particle in the problem space. A fitness value [14] is assigned by the objective function to each particle which has to be optimized based on the position. Thefitness function is defined according to the property of the problem where it is applied. The quality of clustering solution depends on the fitness function defined. It represents the average distance between document vector and cluster centroids. Thefitness evaluation is based on the following Equation 3.2.

    Where d(ti, nj) is the distance between document nijand the cluster centroid ti.,Piis the document number, Ncis the cluster number.The velocity and position of new particle are updated based on the following equations,

    Where vidis the velocity and xidis the position of the particle w represents the inertia weight pidis the personal best position where the particle experience the best fitness value and pgdis the global best position where the particle experience a global fitness value c1and c2denotes the acceleration coefficient,rand1and rand2are random values [25] with the values between (0, 1).The personal, global best positions and velocity of each particle are updated based on Equations 3.3 and Equations 3.4., where the two major operations are performed in this module; such asfitness evaluation and particle centroids updates. PSO module for generating optimal centroids for clustering is summarized as follows,

    For each particle

    Initialize each particle with k numbers of document vectors from document collection as cluster centroid vectors

    END

    LOOP

    For each particle

    Assign each document vector to the closest centroid vectors.

    Calculate fitness value based on equation 3.1

    If the fitness value is better than the previous personal best fitness value (pBest) set current value as the new pBest.

    Update the particle position and velocity according equations 3.3 and 3.4

    respectively to generate the next solution.END

    While maximum number of iterations is reached

    Choose the particle with the best fitness value of all the particles as the global best(gBest) fitness value. gBest value gives the optimal cluster centroids.

    Map and Reduce function

    The input to the Map function includes the tf-idf representation of documents stored in HDFS. Map function splits the documents and<key, value> pairs are generated for each individual text document in the dataset. In thefirst module the MapReduce job is used for generating optimal centroids using PSO. The map function evaluates thefitness of each particle in the swarm. All the information about the particle such as particleID, Cluster vector(C),Velocity vector(V), Personal Best Value(PB),Global Best Value(GB) are determined.

    The particleID represents the key and the corresponding content as the value. The col-lection of <key, value> pairs contained in files is referred to as blocks. The pseudo code for the map and reduce function is given below. The particle swarm is retrieved from the distributed storage. For each particleID the map function extracts centroids vectors and calculates the average distance between centroids vector and document vector. It returns a new fitness value based on global best values to the reduce function. The reduce function aggregates the values with the same key and updates the particle position and velocity. The reduce function emits the global best centroids as the optimal centroids to be stored in HDFS for the next module. The pseudo code for the determination optimal centroids is summarized in Fig.3.3.

    3.3.2 MR-KMeans module

    For the first iteration the clustering process gets the optimal initial cluster centroids from PSO and for the other iteration it gets cluster centroids from the last MapReduce output.The clustering process works according to the MapReduce program for similarity calculation,assignment of document to clusters and recalculation of new cluster centroids. KMeans algorithm repeats its iteration until it meets the convergence criterion which is the maximum iteration. The similarity measurement is based on Jaccard similarity which compares the sum weight of shared terms to the sum weight of terms that are present in either of two documents but are not the shared terms,

    Where taand tbare n-dimensional term vectors over the term set. The recalculation of new cluster centroids as the mean of the document vectors in that cluster is determined using the following equation 3.6,

    Where njis the number of document vectors with in cluster Qjand djis the document vector that belong to cluster Qj.

    The description of KMeans clustering method for generating document clusters are summarized as follows,

    o Retrieve the cluster centroids vector from PSO module.

    o Determine the cluster similarity using Jaccard similarity given in equation 3.5 and assign each document vector to the cluster centroids which has maximum similarity

    Fig.3 Pseudo code for the determination of optimal centroids based on MapReduce

    o Recalculate the cluster centroids using equation 3.6 as the mean of the document vectors in the cluster.

    Map and Reduce function

    The input to the Map function includes two parts, the input document dataset stored in HDFS and the centroids got from PSO or from the last iteration are stored in centers directory in HDFS. First the input document dataset is split as a series of <key, value> pairs in each node where key is the docID and value is the content of the document. At each Mapper calculate the similarity between each document vector and the cluster centroids using Jaccard coefficient and assign the document vector to the cluster centroids with the maximum similarity. This is repeated for all the document vectors. A list of <key, value> pairs with centroidID as the key and cluster contents as value is given to reduce function. The reduce function updates the centroids by finding the sum of all the document vectors with the same key. The new centroidID and the cluster contents are stored in HDFS to be used for the next round of MapReduce job. The pseudo code for the MapReduce job to perform K-Means clustering using optimal centroids is given in Fig.4.

    Fig.4 Pseudo code for proposed hybrid method for document clustering based on MapReduce

    3.4 Experimental setup

    The distributed environment is set up using Hadoop cluster environment to evaluate the performance of the proposed clustering algorithm. Hadoop version used is 0.20.2. Each node of the cluster consists of Intel i7 CPU 3GHz, 1TB of local hard disk storage reserved for HDFS [26] and 8GB for main memory.All nodes are connected by standard gigabit Ethernet network on a flat network topology.Table 4.1 shows the con figuration of Hadoop cluster. Parallel jobs are submitted on the parallel environment like Hadoop (MapReduce).Hadoop is run in a pseudo distributed mode where each Hadoop daemon runs as a separate process.

    3.4.1 Document datasets

    The Reuters Corpus is a benchmark document dataset [6] contains newspaper articles and have been widely used for evaluating clustering algorithms. It covers a wide range of international topics including business and finance,lifestyle, politics, sports etc. They are categorized into different categories such as topic,industry, region etc. which consume 3,804MB corpus size.

    Reuters Corpus Volume 1(RCV1) introduced by Lewis et al. consists of about 800000 documents in XML format for text categorization research [27]. They were collected over Reuter’s newswire during 1-year period. The documents contained in this corpus were sent over the Reuters newswire and is available at[27].Table 3.2 summarizes the characteristics of the dataset,

    3.4.2. Document dataset preparation

    In order to represent the documents the docu-ment datasets are preprocessed using common procedures like stemming using default porters algorithm[16], stopword removal, pruning,removal of punctuations, whitespaces, numbers and conversion to lowercase letters. Each of the extracted terms represents a dimension of the feature space. In this work the feature weighting scheme used is a combination of term frequency and document frequency known as term frequency-inverse document frequency (tf-idf) as in equation 4.1. It is based on the idea that terms that appear frequently in a small number of documents but rarely in other documents, tend to be more relevant and specific for that group of documents.

    IV. PERFORMANCE ANALYSIS AND DISCUSSION

    There are a variety evaluation metrics in order to evaluate the performance of the proposed clustering algorithm. The accuracy of the proposed clustering algorithm is determined using SSE (Sum of Squared Error) and performance is analyzed with respect to execution time.

    4.1 Analysis of clustering accuracy

    Clustering accuracy is determined using internal similarity criteria SSE for choosing appropriate cluster solution at which SSE slows down dramatically. It measures the intra-cluster (within) similarity of documents with in a cluster.

    Within cluster sum of squares (within-SSE)error as given in eqn.7 is used to determine cohesion.

    The between cluster sum of squares is used to determine cluster separation and is described as,

    Where |Si| is the size of cluster i. The total SSE = within-SSE + BSS. To determine the impact of particle swarm optimization algorithm (PSO) on the accuracy of clusters and in order to make a fair comparison of accuracy ahighly recommended PSO setting [20] is used for analysis. The initial swarm size considered was 100 particles, inertia weight, w = 0.72 and acceleration coefficient, c1&c2is 1.7. Let k be the number of clusters to be generated. For the given dataset the algorithm is executed for a range of k. The MR-KMeans algorithm is repeated with random initial centroids and the mean value of purity is recorded. Similarly the proposed hybrid algorithm with PSO generat-ed centroids is executed on the dataset.

    Table 3.1 Experimental environment

    Table 3.2 Characteristics of dataset for evaluation

    Table 3.3 Initial feature space dimensions using terms after preprocessing

    Fig.5 SSE for 50 clusters

    Results show that clustering solutions greater than 30 do not have substantial impact on the total SSE. Clustering solution generated is more optimal or stable due to proper initial cluster centroids provided using particle swarm optimization algorithm. Fig 5 describes the SSE for 50 clusters. Results show that clusters achieve reduction in sum of square error by 81% for the generation of 50 cluster solutions.

    Fig. 6 shows the performance comparison of the algorithms considering accuracy. Results show that the accuracy of the centralized version of the hybrid method is 77% for the generation of 50 clusters. Results for the comparison of accuracy using the internal index for both centralized and the proposed distributed method shows a comparable performance improvement for the distributed hybrid algorithm with increase in the number of clusters.

    Fig.6 Comparison of accuracy

    Fig.7 Comparison of execution time

    4.2 Analysis of execution time

    In order to analyze the execution time of proposed hybrid method, the number of documents in the test data set is increased and the execution time is plotted against the number of documents. The experiment is repeated to get the mean value of execution time for the given number of documents. Results show that for relatively small number of documents the performance of proposed MapReduce hybrid algorithm is only comparable with that of centralized version of the hybrid algorithm. There is a tremendous improvement in the speedup of the system with increase in the number of documents. Fig.7 shows that the execution time for the proposed method decreases with increase in the number of documents.

    Table 3.4 and Fig.8 shows the comparison in execution time for different datasets when considering the number of Hadoop nodes. It can be seen from the graph that there is a tremendous reduction in execution time for the proposed Hybrid clustering on different datasets compared to centralized K-Means clustering method and distributed MR-KMeans. The execution time is measured in seconds(s).

    The execution time of MR-KMeans and proposed hybrid method is only comparable. PSO algorithm normally requires large number of iterations to converge but in the proposed distributed method the maximum number of iterations of MR-PSO is limited to 25. The optimal centroids are generated within that iteration. The computational requirement required by PSO is reduced since it is implemented using distributed hardware such as HDFS (Hadoop Distributed File System) in Hadoop.

    4.3 Performance comparison with existing centralized clustering algorithms

    The performance of the efficient hybrid distributed document clustering method is compared with the performance of existing methods on a centralized node. All the centralized existing methods of clustering utilize Euclidean distance as the similarity measure to determine the closeness between the document vectors.

    4.3.1 Analysis of clustering accuracy

    The standard centralized K-Means algorithm converges within 50 iterations to a stable clustering result, but PSO requires more iteration to converge. For comparison purpose,KMeans and PSO are limited to 50 iterations for document clustering. PSO is able to assign documents to appropriate cluster based on thefitness value of document [44]. Thefitness of a particle is determined based on the average distance between document and cluster centroids. If the average distance is smaller thefitness of the document is greater. The average accuracy of clusters for the document dataset, Reuters and RCV1 is 70% and 65% for centralized KMeans clustering algorithm and 73% and 68% for centralized PSO clustering method. The KPSO clustering method adopts 25 iterations for KMeans to converge and PSO is allowed 25 iterations to generate an optimal solution. Experimental results show that KPSO clustering method is able to assign documents to clusters with the average accuracy value for the all the 50 clusters of Reuters and RCV1 to 81% and 79% respectively. The comparision of clustering quality with exiting centried clustering method is shown in fig.9.

    The proposed hybrid distributed document clustering method based on MapReduce methodology provides a tremendous increase in the clustering accuracy.

    4.3.2 Analysis of execution time

    Table 3.6 describes the comparison of the execution time of the proposed MapReduce based Hybrid method with the existing centralized methods. It demonstrates that the distributed Hybrid method shows significant improvement in execution time compared to other centralized existing clustering methods. The corresponding graph is given in Fig.10 whichdescribes the execution time analysis of the proposed hybrid method with other existing centralized clustering methods.

    Table 3.4 Comparison of execution time of proposed hybrid method

    Table 3.5 Comparison of clustering accuracy with existing centralized methods

    Table 3.6 Comparison of execution time with existing centralized methods

    Fig.8 Comparison of execution time of proposed MapReduce based (MR-PKMeans)clustering method with other clustering methods

    Fig.9 Comparison of clustering quality with existing centralized clustering method

    From Fig.10 it can be seen that there is a tremendous reduction in execution time for the proposed Hybrid method compared to centralized clustering algorithms. The proposed Hybrid method determines optimized related document clusters with improved speedup.

    Fig.10 Comparison of execution time with existing centralized clustering methods

    V. CONCLUSION AND FUTURE WORK

    In this paper, design and implementation of a hybrid PSO KMeans clustering (MR-PKMeans) algorithm using MapReduce framework was proposed. The PKMeans clustering algorithm is an effective method for document clustering; however, it takes a long time to process large data sets. Therefore, MR-PKMeans was proposed to overcome the inefficiency of PKMeans for big data sets. The proposed method can efficiently be parallelized with MapReduce to process very large data sets. In MR-PKMeans the clustering task that is formulated by KMeans algorithm utilizes the best centroids generated by PSO. The global search ability of optimization algorithm generates results with good accuracy and reduced execution time. Future work is to apply MR-PKMeans in real application domains.

    [1] Aboutbl Amal Elsayed and Elsayed Mohamed Nour, “A Novel Parallel Algorithms for Clustering Documents Based on the Hierarchical Agglomerative Approach”,Int’l Journal of Computer Science & Information Technology, vol.3, issue 2, pp.152, Apr.2011.

    [2] Andrew W. McNabb, Christopher K. Monson,and Kevin D. Seppi, “Parallel PSO Using MapReduce”,IEEE Congress on Evolutionary Computation (CEC 2007), pp.7-14, 2007.

    [3] Anna Huang, “Similarity Measures for Text Document Clustering”,Proceedings of the New Zealand Computer Science Research Student Conference,pp. 49-56, 2008.

    [4] B.A. Shboul and S.H. Myaeng. “Initializing K-Means using Genetic Algorithms”,World Academy of Science, Engineering and Technology,vol.54,2009

    [5] Cui X., & Potok T. E., “Document Clustering Analysis Based on Hybrid PSO + Kmeans Algorithm”,Journal of Computer Sciences,Special Issue, pp. 27-33, 2005.

    [6] D.D. Lewis, Reuters-21578 text categorization test collection distribution 1.0http://www.research.att.com /lewis,1999.

    [7] E. Alina, I. Sungjin, and M. Benjamin, “Fast clustering using MapReduce,” inProceedings of KDD’11. NY, USA: ACM, pp. 681–689, 2011.

    [8] Eshref Januzaj, Hans-Peter Krigel and Martin Pfei fle,” Towards Eff ective and Effi cient Distributed Clustering”,Workshop on Clustering Large Data Sets, Melbourne, 2003.

    [9] Ibrahim Aljarah and Simone A. Ludwig, “Parallel Particle Swarm Optimization Clustering Algorithm based on MapReduce Methodology”,Fourth World Congress on Nature and Biologically Inspired Computing (NaBIC), pp. 105-111,2012.

    [10] J. Wan, W. Yu, and X. Xu, “Design and implementation of distributed document clustering based on MapReduce”,Proceedings of the 2nd symposium on International Computer Science and Computational Technology, pp.278–280,2009.

    [11] J. Yang and X. Li, “MapReduce based method for big data semantic clustering,” inProceedings of the 2013 IEEE International Conference on Systems, Man, and Cybernetics, SMC’13. Washington, DC, USA: IEEE Computer Society, pp.2814–2819, 2013.

    [12] J.J. Li, X. Yang, Y.M. Ju and S.T. Wu, “Survey of PSO clustering algorithms”,Application Research of Computers,vol. 26, no.12, Dec.2009.

    [13] Kathleen Ericsson and Shrideep Pallickara, “On the performance of high dimensional data clustering and classification algorithms”,Future Generation Computer Systems,June 2012

    [14] Kennedy, J., Eberhart, R.C., “Particle swarm optimization”,IEEE International Conference on Neural Networks, pp. 1942–1948,1995.

    [15] L. Guang, W. Gong-Qing, H. Xue-Gang, Z. Jing,L. Lian, and W. Xindong, “K-means clustering with bagging and mapreduce,” inProceedings of the 2011 44th Hawaii International Conference on System Sciences. Washington, DC, USA: IEEE Computer Society, pp. 1–8, 2011.

    [16] M.F. Porter, “An algorithm for suffix stripping”,Program electronic library and information systems, vol. 14, Issue 3, pp.130 – 137, 1980.

    [17] Ping hou, Jingsheng Lei, Wenjun Ye, “Large-Scale Data Sets Clustering Based on MapReduce and Hadoop”,Journal of Computational Information Systems,vol.7, no.16, pp. 5956-5963,2011.

    [18] S. Nair and J. Mehta, “Clustering with apache Hadoop,”Proceedings of the International Conference, Workshop on Emerging Trends in Technology, ICWET’11. New York,NY, USA: ACM, pp.505–509, 2011.

    [19] S.Papadimitriou and J.Sun,“Disco.Distributed co-clustering with MapReduce:Acasestudy towards petabyte-scale end-to-end mining”,Proceeding of the eighth IEEE International Conference on Data Mining,pp.512-521,2008.

    [20] Shi,Y.H and R.C.Eberhart, “Parameter selection in particle swarm optimization”,The 7thAnnual Conf.EvolutionaryProgramming.SanDiego.C.A,1998.

    [21] SouptikDatta, K. Bhaduri, Chris Giannella, Ran Wolff and Hillol Kargupta, “Distributed Data Mining in Peer-to-Peer Networks”,IEEE Internet Computing,vol.10, no.4, August 2006, pp.18-26.[22] Y. He, H. Tan, W. Luo, S. Feng, and J. Fan,“Mr-dbscan:a scalable Mapreduce-based dbscan algorithm for heavily skewed data,”Frontiers of Computer Science, vol. 8, no. 1,pp.83–99, 2014.

    [23] Z. Weizhong, M. Huifang, and H. Qing, “Parallel kmeans clustering based on mapreduce,”Proceedings of the CloudCom’09. Berlin, Heidelberg:Springer-Verlag, pp. 674–679, 2009.

    [24] Patil Y. K., Nandedkar V. S., HADOOP: A New Approach for Document Clustering,Int’l. J. Adv.Res. in IT and Engg.,pp. 1 – 8, 2014.

    [25] Nailah Al-Madi, Ibrahim Aljarah and Simone A.Ludwig, Parallel Glowworm Swarm Optimization Clustering Algorithm based on MapReduce,IEEE Symposium on Swarm Intelligence, pp. 1 –8,2014.

    [26] C.L. Philip Chen, Chun-Yang Zhang, Data-intensive applications, challenges, techniques and technologies: A survey on Big Data,Information Sciences, 2014.

    [27] Lewis D. D., Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research,Journal of Machine Learning Research, pp. 361 – 397, 2004.

    [28] Dawen Xia, Binfeng Wang, Yantao Li, Zhuobo Rong, and Zili Zhang, An Efficient MapReduce-Based Parallel Clustering Algorithm for Distributed Traffic Subarea Division,Discrete Dynamics in Nature and Society,2015.

    [29] Cui, Yajun, et al. “Parallel Spectral Clustering Algorithm Based on Hadoop.”arXiv preprint arXiv:1506.00227,2015.

    国产亚洲精品一区二区www| 成人三级黄色视频| 91麻豆精品激情在线观看国产| 极品教师在线免费播放| 国产成人av教育| 国产aⅴ精品一区二区三区波| 757午夜福利合集在线观看| 亚洲真实伦在线观看| 国产欧美日韩精品亚洲av| 国产成人影院久久av| 国产成人啪精品午夜网站| 日韩国内少妇激情av| 国产av在哪里看| 国产真实伦视频高清在线观看 | 欧美性感艳星| 亚洲av美国av| 久久中文看片网| 亚洲av美国av| 最近视频中文字幕2019在线8| 亚洲黑人精品在线| 免费观看的影片在线观看| 亚洲成人久久爱视频| 久久久国产成人精品二区| 99久国产av精品| 男女那种视频在线观看| 国产亚洲精品综合一区在线观看| 手机成人av网站| 麻豆久久精品国产亚洲av| 欧美又色又爽又黄视频| 亚洲精品粉嫩美女一区| 午夜福利视频1000在线观看| 麻豆国产av国片精品| 日本 欧美在线| 国产成人影院久久av| 亚洲七黄色美女视频| 天美传媒精品一区二区| a在线观看视频网站| 热99re8久久精品国产| 欧美成人免费av一区二区三区| 制服人妻中文乱码| 色综合婷婷激情| 欧美不卡视频在线免费观看| 国产一区二区三区视频了| 国产三级在线视频| 精品乱码久久久久久99久播| 欧美激情在线99| 狂野欧美激情性xxxx| 国产午夜精品论理片| 午夜福利在线在线| 亚洲av一区综合| 亚洲成人一二三区av| 国产亚洲精品av在线| 亚洲精品一二三| 日韩一区二区视频免费看| 久久久久久久亚洲中文字幕| 亚洲天堂国产精品一区在线| 蜜桃亚洲精品一区二区三区| 草草在线视频免费看| 七月丁香在线播放| 免费大片18禁| 嫩草影院入口| 美女cb高潮喷水在线观看| 一级黄片播放器| av在线老鸭窝| 国产在线一区二区三区精| 看非洲黑人一级黄片| 日韩av在线免费看完整版不卡| 午夜福利高清视频| av国产免费在线观看| 亚洲精品乱久久久久久| 一级毛片久久久久久久久女| 午夜免费激情av| 午夜激情欧美在线| 精品久久久久久久久久久久久| 五月玫瑰六月丁香| 亚洲乱码一区二区免费版| 国产亚洲5aaaaa淫片| 日韩制服骚丝袜av| av网站免费在线观看视频 | 只有这里有精品99| 国产精品久久久久久精品电影| 99九九线精品视频在线观看视频| 亚洲精品成人av观看孕妇| 亚洲精华国产精华液的使用体验| 国产不卡一卡二| 91久久精品国产一区二区成人| 亚洲图色成人| 人人妻人人澡人人爽人人夜夜 | 成年免费大片在线观看| 亚洲欧美成人精品一区二区| 亚洲av.av天堂| 观看美女的网站| 婷婷色av中文字幕| 国产男女超爽视频在线观看| 国产av不卡久久| 国产麻豆成人av免费视频| 午夜免费男女啪啪视频观看| 美女主播在线视频| av国产久精品久网站免费入址| av黄色大香蕉| 男女边摸边吃奶| 久久久成人免费电影| 亚洲av男天堂| 久久久久久久久大av| 少妇丰满av| 久久99热这里只有精品18| 日韩制服骚丝袜av| 国产精品三级大全| 亚洲国产精品成人综合色| 欧美xxxx性猛交bbbb| 国产精品麻豆人妻色哟哟久久 | 成人无遮挡网站| 中文在线观看免费www的网站| 成年女人看的毛片在线观看| 亚洲最大成人av| 亚洲成人一二三区av| 成年版毛片免费区| 欧美精品国产亚洲| 又爽又黄无遮挡网站| 91精品一卡2卡3卡4卡| 亚洲成色77777| a级一级毛片免费在线观看| 国产精品人妻久久久久久| 伦精品一区二区三区| 九草在线视频观看| 免费观看在线日韩| 国产精品一区二区在线观看99 | a级毛片免费高清观看在线播放| 欧美一区二区亚洲| 久久午夜福利片| 青春草国产在线视频| 联通29元200g的流量卡| 国产黄片美女视频| 精品久久久久久电影网| 亚洲欧美一区二区三区国产| 小蜜桃在线观看免费完整版高清| 三级经典国产精品| 一个人看的www免费观看视频| 亚洲欧美成人精品一区二区| www.av在线官网国产| 亚洲精品成人久久久久久| 最后的刺客免费高清国语| 国产成人精品一,二区| 日韩精品青青久久久久久| 最近中文字幕高清免费大全6| 内地一区二区视频在线| 亚洲av日韩在线播放| 91久久精品国产一区二区三区| 欧美3d第一页| 国产熟女欧美一区二区| 国产精品精品国产色婷婷| 狠狠精品人妻久久久久久综合| 欧美三级亚洲精品| 国国产精品蜜臀av免费| 免费黄网站久久成人精品| 免费观看在线日韩| 淫秽高清视频在线观看| 精品99又大又爽又粗少妇毛片| 国产熟女欧美一区二区| 在线观看美女被高潮喷水网站| 久久精品国产亚洲av天美| 两个人的视频大全免费| 卡戴珊不雅视频在线播放| 国产成人精品一,二区| 午夜福利在线在线| 久久久久九九精品影院| 久久久欧美国产精品| 免费观看a级毛片全部| 日本免费a在线| 美女内射精品一级片tv| 欧美xxxx黑人xx丫x性爽| 一本一本综合久久| 国产片特级美女逼逼视频| 精品99又大又爽又粗少妇毛片| 伦精品一区二区三区| 亚洲精品乱码久久久久久按摩| 国产精品人妻久久久影院| 两个人视频免费观看高清| av线在线观看网站| ponron亚洲| 嫩草影院新地址| 在线观看人妻少妇| 免费av不卡在线播放| 人妻夜夜爽99麻豆av| 国产女主播在线喷水免费视频网站 | 亚洲天堂国产精品一区在线| 看免费成人av毛片| 亚洲乱码一区二区免费版| 激情五月婷婷亚洲| 亚洲不卡免费看| ponron亚洲| 亚洲aⅴ乱码一区二区在线播放| 国产成人精品一,二区| 自拍偷自拍亚洲精品老妇| 成人亚洲欧美一区二区av| 亚洲精品色激情综合| 久久久久性生活片| 精品不卡国产一区二区三区| 一级毛片久久久久久久久女| 天堂网av新在线| 少妇人妻一区二区三区视频| 亚洲精品日本国产第一区| 特级一级黄色大片| 国产亚洲午夜精品一区二区久久 | 伦精品一区二区三区| 免费电影在线观看免费观看| 国产精品伦人一区二区| 国产精品一区二区性色av| 午夜亚洲福利在线播放| 波多野结衣巨乳人妻| 亚洲怡红院男人天堂| 久久久色成人| 国产在线男女| 国产高清不卡午夜福利| 自拍偷自拍亚洲精品老妇| 美女国产视频在线观看| 日本免费a在线| 91精品国产九色| 国产有黄有色有爽视频| 欧美激情在线99| 成人一区二区视频在线观看| 亚洲综合色惰| 99久国产av精品| 边亲边吃奶的免费视频| 最后的刺客免费高清国语| 亚洲国产日韩欧美精品在线观看| 中文精品一卡2卡3卡4更新| 男人狂女人下面高潮的视频| 22中文网久久字幕| 高清欧美精品videossex| 伦精品一区二区三区| 亚洲av福利一区| 晚上一个人看的免费电影| 国产高潮美女av| 男女视频在线观看网站免费| 亚洲在久久综合| 九色成人免费人妻av| 内地一区二区视频在线| 美女大奶头视频| 久久久久网色| 欧美日韩一区二区视频在线观看视频在线 | 日韩制服骚丝袜av| 亚洲成人一二三区av| 成人毛片60女人毛片免费| 亚洲aⅴ乱码一区二区在线播放| a级一级毛片免费在线观看| 免费av毛片视频| 成人鲁丝片一二三区免费| 欧美xxⅹ黑人| 亚洲精品国产av成人精品| 天天一区二区日本电影三级| 我要看日韩黄色一级片| 80岁老熟妇乱子伦牲交| 亚洲熟妇中文字幕五十中出| 中文字幕人妻熟人妻熟丝袜美| 街头女战士在线观看网站| 久久久色成人| 亚洲电影在线观看av| 国精品久久久久久国模美| 赤兔流量卡办理| 最近最新中文字幕免费大全7| 国产v大片淫在线免费观看| 91午夜精品亚洲一区二区三区| 午夜老司机福利剧场| 人妻一区二区av| 精品午夜福利在线看| 九九爱精品视频在线观看| 人人妻人人澡欧美一区二区| 久久精品夜色国产| 久久久久久九九精品二区国产| 大片免费播放器 马上看| 国产伦理片在线播放av一区| 97在线视频观看| 天堂影院成人在线观看| 婷婷六月久久综合丁香| 如何舔出高潮| 国产精品福利在线免费观看| 人妻系列 视频| 两个人的视频大全免费| 午夜精品在线福利| 肉色欧美久久久久久久蜜桃 | 亚洲不卡免费看| 国产黄频视频在线观看| 免费黄色在线免费观看| 在线观看av片永久免费下载| 国产淫片久久久久久久久| 最近中文字幕2019免费版| 久久精品久久精品一区二区三区| 亚洲三级黄色毛片| 我的女老师完整版在线观看| 中文字幕亚洲精品专区| 97人妻精品一区二区三区麻豆| 少妇人妻精品综合一区二区| 成人毛片a级毛片在线播放| 性色avwww在线观看| 日本熟妇午夜| 18禁在线播放成人免费| 亚洲精品一二三| 床上黄色一级片| 观看免费一级毛片| 高清日韩中文字幕在线| 乱码一卡2卡4卡精品| 日韩精品有码人妻一区| 久热久热在线精品观看| 自拍偷自拍亚洲精品老妇| 国产精品美女特级片免费视频播放器| 国产av码专区亚洲av| 一个人观看的视频www高清免费观看| 国产午夜精品论理片| 91久久精品国产一区二区三区| 国产探花在线观看一区二区| 国产国拍精品亚洲av在线观看| 日韩欧美国产在线观看| 久久久久久久久大av| 蜜桃久久精品国产亚洲av| 人人妻人人澡欧美一区二区| 麻豆av噜噜一区二区三区| 日韩av在线免费看完整版不卡| 精品一区二区三区视频在线| 欧美精品一区二区大全| 亚洲欧美中文字幕日韩二区| 高清午夜精品一区二区三区| 高清欧美精品videossex| 老司机影院毛片| 亚洲aⅴ乱码一区二区在线播放| 久久久久久久久久成人| 一个人免费在线观看电影| 欧美一级a爱片免费观看看| 亚洲精品aⅴ在线观看| 夜夜爽夜夜爽视频| 伊人久久精品亚洲午夜| 婷婷六月久久综合丁香| 好男人视频免费观看在线| 中文字幕亚洲精品专区| 日本免费在线观看一区| 嫩草影院入口| 最近中文字幕2019免费版| 性插视频无遮挡在线免费观看| 精品酒店卫生间| 日日干狠狠操夜夜爽| 人妻一区二区av| 嫩草影院精品99| 亚洲人成网站在线观看播放| 久久久精品欧美日韩精品| 小蜜桃在线观看免费完整版高清| 日本猛色少妇xxxxx猛交久久| 精品99又大又爽又粗少妇毛片| 联通29元200g的流量卡| 秋霞在线观看毛片| 国产精品不卡视频一区二区| 哪个播放器可以免费观看大片| 国产精品熟女久久久久浪| 人妻制服诱惑在线中文字幕| 国产亚洲午夜精品一区二区久久 | 最近2019中文字幕mv第一页| 日韩成人伦理影院| 欧美三级亚洲精品| 最新中文字幕久久久久| 嫩草影院新地址| 免费少妇av软件| 免费高清在线观看视频在线观看| 午夜激情久久久久久久| 狠狠精品人妻久久久久久综合| 亚洲精品乱码久久久久久按摩| 欧美成人一区二区免费高清观看| 国产亚洲精品av在线| 亚洲av免费在线观看| 成人鲁丝片一二三区免费| 人体艺术视频欧美日本| 精品久久国产蜜桃| 国产久久久一区二区三区| h日本视频在线播放| 日韩大片免费观看网站| 狂野欧美白嫩少妇大欣赏| 伦理电影大哥的女人| 一级毛片电影观看| 日本三级黄在线观看| 亚洲激情五月婷婷啪啪| 好男人视频免费观看在线| 欧美日本视频| 成人漫画全彩无遮挡| 亚洲无线观看免费| 一夜夜www| 十八禁网站网址无遮挡 | 肉色欧美久久久久久久蜜桃 | 精品久久久噜噜| 亚洲av福利一区| 免费人成在线观看视频色| 少妇的逼好多水| 国产欧美日韩精品一区二区| 噜噜噜噜噜久久久久久91| 在线播放无遮挡| 国产综合精华液| 国产精品不卡视频一区二区| 91在线精品国自产拍蜜月| 伦理电影大哥的女人| 欧美xxxx黑人xx丫x性爽| 国产有黄有色有爽视频| 亚洲国产精品成人久久小说| 蜜臀久久99精品久久宅男| 建设人人有责人人尽责人人享有的 | 亚洲av电影不卡..在线观看| 美女国产视频在线观看| 青春草视频在线免费观看| 少妇的逼好多水| 久久久久久久久中文| 深爱激情五月婷婷| 国产精品.久久久| 国产精品久久久久久av不卡| 午夜福利在线观看免费完整高清在| 好男人视频免费观看在线| 国产精品99久久久久久久久| 黑人高潮一二区| av女优亚洲男人天堂| 亚洲欧美成人精品一区二区| eeuss影院久久| 国国产精品蜜臀av免费| 亚洲一区高清亚洲精品| .国产精品久久| 青春草亚洲视频在线观看| 国产精品一二三区在线看| 国产伦在线观看视频一区| 久久精品夜夜夜夜夜久久蜜豆| 嫩草影院新地址| 国产成人精品一,二区| av免费观看日本| 在线免费观看不下载黄p国产| 夜夜看夜夜爽夜夜摸| 国产亚洲最大av| 免费看av在线观看网站| 欧美日本视频| 高清av免费在线| 久久久a久久爽久久v久久| 国产不卡一卡二| 少妇熟女欧美另类| 欧美精品一区二区大全| 美女cb高潮喷水在线观看| 一个人看的www免费观看视频| 我要看日韩黄色一级片| 熟妇人妻不卡中文字幕| 免费观看在线日韩| 麻豆国产97在线/欧美| 晚上一个人看的免费电影| 男女边摸边吃奶| 好男人在线观看高清免费视频| 国产精品一区二区在线观看99 | 蜜桃亚洲精品一区二区三区| 国产精品国产三级专区第一集| 国产精品国产三级国产专区5o| 亚洲精品日韩在线中文字幕| 亚洲av免费高清在线观看| 免费无遮挡裸体视频| 激情 狠狠 欧美| 80岁老熟妇乱子伦牲交| 黑人高潮一二区| 尤物成人国产欧美一区二区三区| 国产伦精品一区二区三区视频9| 国产精品蜜桃在线观看| 夜夜爽夜夜爽视频| 高清午夜精品一区二区三区| 国产精品久久视频播放| 免费无遮挡裸体视频| 女人十人毛片免费观看3o分钟| 国产探花极品一区二区| 久久久精品94久久精品| 九九爱精品视频在线观看| 神马国产精品三级电影在线观看| 亚洲va在线va天堂va国产| 一个人看的www免费观看视频| 亚洲精品影视一区二区三区av| 成年版毛片免费区| 在线观看免费高清a一片| 99久久九九国产精品国产免费| 国产高清有码在线观看视频| 日韩,欧美,国产一区二区三区| 欧美xxⅹ黑人| 国产av国产精品国产| 国产伦在线观看视频一区| 成人特级av手机在线观看| 欧美潮喷喷水| 高清av免费在线| 国产成人精品一,二区| 日韩大片免费观看网站| av黄色大香蕉| 久久久成人免费电影| 亚洲av不卡在线观看| 国产午夜精品久久久久久一区二区三区| 久久人人爽人人片av| 亚洲精品久久久久久婷婷小说| 久久精品国产亚洲网站| 只有这里有精品99| 麻豆av噜噜一区二区三区| 22中文网久久字幕| 亚洲精品第二区| 亚洲精品久久久久久婷婷小说| 少妇熟女欧美另类| 午夜福利在线在线| 国产中年淑女户外野战色| 久久精品人妻少妇| 午夜激情欧美在线| 午夜福利在线在线| 天堂俺去俺来也www色官网 | av又黄又爽大尺度在线免费看| 国产乱人偷精品视频| 男人舔女人下体高潮全视频| 超碰97精品在线观看| 久久久午夜欧美精品| 成人性生交大片免费视频hd| 国产精品一区二区在线观看99 | 亚洲成人一二三区av| 黄片无遮挡物在线观看| 日本黄色片子视频| av在线观看视频网站免费| 干丝袜人妻中文字幕| 日日啪夜夜撸| 久久久久性生活片| 免费看a级黄色片| 欧美丝袜亚洲另类| 国产 一区精品| 精品久久久久久成人av| 亚洲伊人久久精品综合| 成人亚洲精品av一区二区| 国产伦精品一区二区三区视频9| 国产综合懂色| 亚洲av福利一区| 看黄色毛片网站| 国产精品女同一区二区软件| 国产成人精品福利久久| 一级毛片我不卡| 国产女主播在线喷水免费视频网站 | 亚洲精华国产精华液的使用体验| 偷拍熟女少妇极品色| 亚洲欧美日韩东京热| 嫩草影院入口| 寂寞人妻少妇视频99o| 亚洲成人久久爱视频| 国产午夜精品久久久久久一区二区三区| 久久精品久久久久久久性| 国产精品无大码| 又爽又黄无遮挡网站| 亚洲成人中文字幕在线播放| 亚洲精品乱码久久久v下载方式| 精品国产三级普通话版| 中文字幕制服av| 国产一区二区在线观看日韩| 久久草成人影院| 91av网一区二区| 亚洲国产精品成人久久小说| 国产女主播在线喷水免费视频网站 | 国产伦在线观看视频一区| 亚洲国产欧美人成| 免费看不卡的av| 亚洲av免费在线观看| 午夜精品一区二区三区免费看| 五月玫瑰六月丁香| 国产伦精品一区二区三区四那| 国产黄片视频在线免费观看| 中文资源天堂在线| 男女视频在线观看网站免费| 国产免费福利视频在线观看| 成人性生交大片免费视频hd| 亚洲av国产av综合av卡| 国产单亲对白刺激| 热99在线观看视频| 午夜精品一区二区三区免费看| 一级毛片电影观看| 欧美激情国产日韩精品一区| 久久午夜福利片| 午夜福利高清视频| 日韩一区二区三区影片| 美女大奶头视频| 免费看av在线观看网站| 久久99热这里只有精品18| 男女那种视频在线观看| 丰满乱子伦码专区| 国产伦一二天堂av在线观看| 一区二区三区四区激情视频| 国产精品一区二区在线观看99 | 国产欧美日韩精品一区二区| 国产黄a三级三级三级人| 欧美变态另类bdsm刘玥| 深爱激情五月婷婷| 黄色日韩在线| 亚洲国产色片| 免费看a级黄色片| 日本三级黄在线观看| 一级a做视频免费观看| 看免费成人av毛片| 免费观看的影片在线观看| 国产精品一区二区性色av| 久久久色成人| 亚洲精品aⅴ在线观看| 亚洲成人久久爱视频| 亚洲av中文av极速乱| 久久精品久久久久久久性| 婷婷六月久久综合丁香| 成年女人在线观看亚洲视频 | 日本av手机在线免费观看| 欧美三级亚洲精品| 秋霞在线观看毛片| 免费播放大片免费观看视频在线观看| 又黄又爽又刺激的免费视频.| 亚洲精品国产成人久久av| 精品酒店卫生间| 简卡轻食公司| av播播在线观看一区| 如何舔出高潮| 搡女人真爽免费视频火全软件| 九九爱精品视频在线观看| 爱豆传媒免费全集在线观看| 精品久久久久久久人妻蜜臀av| 欧美一区二区亚洲| 欧美日韩综合久久久久久| 婷婷色综合www| 国产一区二区三区av在线|