دانلود رایگان مقاله انگلیسی خوشه بندی با ثبات با استفاده از منظم سازی داده های پرت-پراکندگی به همراه ترجمه فارسی
عنوان فارسی مقاله | خوشه بندی با ثبات با استفاده از منظم سازی داده های پرت-پراکندگی |
عنوان انگلیسی مقاله | Robust Clustering Using Outlier-Sparsity Regularization |
رشته های مرتبط | مهندسی کامپیوتر و مهندسی الگوریتم ها و محاسبات |
کلمات کلیدی | نزول مختصات (بلوک)، خوشه بندی، الگوریتم بیشینه سازی امید ریاضی، گروه-لاسو، کامینز، روش های هسته ای، مدل های ترکیبی، استواری، پراکندگی |
فرمت مقالات رایگان |
مقالات انگلیسی و ترجمه های فارسی رایگان با فرمت PDF آماده دانلود رایگان میباشند همچنین ترجمه مقاله با فرمت ورد نیز قابل خریداری و دانلود میباشد |
کیفیت ترجمه | کیفیت ترجمه این مقاله متوسط میباشد |
نشریه | آی تریپل ای – IEEE |
مجله | یافته ها در حوزه سیگنال – TRANSACTIONS ON SIGNAL |
سال انتشار | 2012 |
کد محصول | F844 |
مقاله انگلیسی رایگان (PDF) |
دانلود رایگان مقاله انگلیسی |
ترجمه فارسی رایگان (PDF) |
دانلود رایگان ترجمه مقاله |
خرید ترجمه با فرمت ورد |
خرید ترجمه مقاله با فرمت ورد |
جستجوی ترجمه مقالات | جستجوی ترجمه مقالات مهندسی کامپیوتر |
فهرست مقاله: چکیده |
بخشی از ترجمه فارسی مقاله: 1- مقدمه |
بخشی از مقاله انگلیسی: I. INTRODUCTION CLUSTERING aims to partition a set of data into subsets, called clusters, such that data assigned to the same cluster are similar in some sense. Working with unlabeled data and under minimal assumptions makes clustering a challenging, yet universal tool for revealing data structures in a gamut of applications such as DNA microarray analysis and bioinformatics, (social) network analysis, image processing, and data mining [18], [35]. Moreover, clustering can serve as a pre-processing step for supervised learning in applications where labeling data one-at-a-time is costly. Multiple interpretations across disciplines of what a cluster is, have led to an abundance of application-specific algorithms [35]. Among the algorithms which cluster data represented by vectors, K-means and Gaussian mixture model (GMM-)based clustering are two popular schemes [26], [35]. K-means relies on the Euclidean distance as a similarity measure, thereby yielding partitions that minimize the within-cluster scatter [18]. Contrastingly, soft (a.k.a. fuzzy) K-means is well-suited for overlapping clusters by allowing each datum to belong to multiple clusters [2]. GMM-based clustering considers data drawn from a probability density function (pdf), where each class-conditional pdf corresponds to a cluster [35]. Clustering then arises as a by-product of a maximum likelihood (ML) estimation framework for the GMM parameters, which are typically obtained through the expectation-maximization (EM) algorithm [11]. Kernel methods have been devised to enable clustering of nonlinearly separable clusters [29], [30]. Notwithstanding their popularity, K-means and GMM-based clustering are sensitive to inconsistent data, termed outliers, due to their functional dependency on the Euclidean distance [20]. Outliers appear infrequently in the data, emerging either due to reading errors or because they belong to rarely-seen and hence, markedly informative phenomena. However, even a few outliers can render clustering unreliable: cluster centers and model parameter estimates can be severely biased, and thus the data-tocluster assignment is deteriorated. This motivates robustifying clustering approaches against outliers at affordable computational complexity in order to unravel the underlying structure in the data. Several robust clustering approaches have been investigated [16]. Those more relevant to the framework developed here include possibilistic clustering, which builds on fuzzy K-means by measuring the so-called typicality of each datum with respect to each cluster to decide whether a datum is an outlier [24], [28]. However, possibilistic clustering is sensitive to initialization, and can output the same cluster more than once. Similar to [19], the noise clustering method of [10] introduces an additional cluster intended to capture all outliers, and its centroid is heuristically assumed to be equidistant from all non-outlying data. To mitigate centroid bias, the -cut method performs K-means steps, but cluster centroids are estimated using only the -percentage of the data assigned to each cluster [36]. Other robust alternatives include sequential clustering approaches which identify a single cluster at a time, and remove its points from the dataset [22], [38]. A minimum-volume ellipsoid containing a predetermined fraction of the data is identified per step in [22]; while [38] combines Huber’s -contaminated model with a GMM [20]. However, sequentially removing points can hinder the underlying data structure.Inspired by robust statistics, clustering methods based on the -distance (K-medians), Tukey’s biweighted function, and trimmed means have been also proposed [4], [15], [23]; but they are all limited to linearly separable clusters. A clustering approach identifying clusters of arbitrary shape using kernel functions was developed in [1]. Even though resilient to outliers, this method targets density estimation, while the number of clusters identified depends critically on a grid search over a kernel parameter. Robust GMM-based clustering approaches introduce outlier-aware pdfs, and the ML problem arising is typically solved via EM-like algorithms [27], [31]. The first contribution of the present work is to introduce a data model for clustering that explicitly accounts for outliers via a deterministic outlier vector per datum (Section II). A datum is deemed an outlier if its corresponding outlier vector is nonzero. Translating the fact that outliers are rare to sparsity in the outlier vector domain leads to a neat connection between clustering and the compressed sensing (CS) paradigm [7]. Building on this model, an outlier-aware clustering methodology is developed for clustering both from the deterministic (K-means), and the probabilistic (GMMs) perspectives. The second contribution of this work comprises various iterative clustering algorithms developed for robust hard K-means, soft K-means, and GMM-based clustering (Section III). The algorithms are based on a block coordinate descent (BCD) iteration and yield closed-form updates for each set of optimization variables. In particular, estimating the outliers boils down to solving a group-Lasso problem [37], whose solution is computed in closed form. The novel robust clustering algorithms operate at an affordable computational complexity of the same order as that of their non-robust counterparts. Several contemporary applications in bioinformatics, (social) network analysis, image processing, and machine learning call for outlier-aware clustering of high-dimensional data, or involve nonlinearly separable clusters. To accommodate these clustering needs, the novel robust clustering algorithms are kernelized in Section IV; and this is the third contribution of our work. The assumed model not only enables such a kernelization for both K-means and the probabilistic setups, but it also yields iterative algorithms with closed-form updates. In Section V, the algorithms developed are tested using synthetic as well as real datasets from handwritten digit recognition systems and social networks. The results corroborate the effectiveness of the methods. Conclusions are drawn in Section VI. |