Zhipeng Gao,Yan Yang,Chen Zhao,Zijia Mo
State Key Laboratory of Networking and Switching Technology,Beijing University of Posts and Telecommunications,Beijing 100876,China
Abstract: The rapid growth of modern mobile devices leads to a large number of distributed data,which is extremely valuable for learning models.Unfortunately,model training by collecting all these original data to a centralized cloud server is not applicable due to data privacy and communication costs concerns,hindering artificial intelligence from empowering mobile devices.Moreover,these data are not identically and independently distributed (Non-IID)caused by their different context,which will deteriorate the performance of the model.To address these issues,we propose a novel Distributed Learning algorithm based on hierarchical clustering and Adaptive Dataset Condensation,named ADC-DL,which learns a shared model by collecting the synthetic samples generated on each device.To tackle the heterogeneity of data distribution,we propose an entropy topsis comprehensive tiering model for hierarchical clustering,which distinguishes clients in terms of their data characteristics.Subsequently,synthetic dummy samples are generated based on the hierarchical structure utilizing adaptive dataset condensation.The procedure of dataset condensation can be adjusted adaptively according to the tier of the client.Extensive experiments demonstrate that the performance of our ADC-DL is more outstanding in prediction accuracy and communication costs compared with existing algorithms.
Keywords: distributed learning;Non-IID data partition;hierarchical clustering;adaptive dataset condensation
With the dramatic development of the Internet of Things(IoT)in recent years,many sensor-driven intelligent systems such as smart home,smart transportation,smart building[1]are becoming more and more prevalent in our lives.These intelligent systems are superior to humans in terms of safety,timeliness and energy conservation,and have substantially improved the productivity and convenience of society [2].In most intelligent systems,machine learning is desired to be implemented to train a massive amount of data from multiple data sources.However,data deposited in various devices and locations is constrained in sharing with servers or other devices due to privacy security and expensive communication costs.[3].Therefore,there is a tendency to explore solutions for implementing machine learning into distributed data scenarios with these limitations.
There are two predominant methods that empower clients to securely share their local information[4].In the first approach represented by federated learning[5],data sources share the models or gradients learned on local data.In the other approach represented by coreset construction[4,6],data sources share the local summaries of their original data,which are aggregated to train a global model.Neither of these two methods requires the clients to share their data in a direct manner[7].Although extensive previous research on these two approaches has achieved data security and reduction of communication costs to a certain extent,there are still many deficiencies for further investigation.For federated learning,studies have demonstrated that the publicly shared gradients will leak information of the original data[8,9],which renders it inappropriate for frequent transmission of the enormous parameters in modern deep neural networks.In previous studies of the second approach,coreset is the most pervasive method for generating local summaries.However,such methods have the following vulnerabilities:1) they rely on the selections of the center of clustering and the presence of representative samples,which does not guarantee that the generated local summaries are sufficiently valid.2) some of these methods are less effective when the local datasets in clients are heterogeneous as they blindly determine the local centers[4].
In terms of accuracy,communication overhead and data privacy,each of these two approaches has its own strengths.Generally,federated learning can achieve slightly better accuracy than coreset-based learning.However,modern neural networks are usually of relatively huge parameter size.Their frequent transmission will result in a massive communication overhead.Therefore,coreset-based learning significantly outperforms federated learning in communication overhead.In terms of security,a study has been conducted to measure the degree of privacy leakage of two methods by membership Inference attack [7],which indicates that when we need a target model of the maximum accuracy,federated learning is more desirable in terms of privacy protection.However,if we can tolerate a mild loss of target model accuracy,coreset-based learning becomes preferable because it can achieve the same accuracy as federated learning with less privacy leakage and a much lower communication cost.In summary,each of these two methods has its own strengths and weaknesses,and the desirable method is different in various situations.
As mentioned above,a critical vulnerability of existing approaches is the performance degradation when data distribution is not identically and independently,which is a considerable challenge in the edge-based learning scenario[10].Numerous distributed learning algorithms have demonstrated superior performance in situations where the data in clients are independent and identically distributed(IID).However,data is typically unbalanced and Non-IID,as it is originated in different contexts.Most of the existing distributed learning algorithms(e.g.,[5,11—14])assume the data partitions are IID or have conducted extremely limited research on Non-IID data partitions.Therefore,addressing the challenge of Non-IID quagmire is substantial and imperative in distributed learning.
In this paper,we propose ADC-DL,a communication-efficient distributed learning method withhierarchical clusteringandAdaptive Dataset Condensation.ADC-DL generates the local summaries by adaptive dataset condensation,which is not dependent on the presence of representative samples,ensuring the validity of the local summaries.In ADC-DL,the edge clients train representative synthetic samples adaptively based on the hierarchical structure.Subsequently,all the clients transmit their synthetic samples to the cloud server for global model training.On the one hand,ADC-DL adjusts the size of local summaries adaptively based on our proposed comprehensive hierarchical model,which improves the model performance in the situation of Non-IID data partitions.On the other hand,it is communication-efficient and privacy-preserving by transmitting the synthetic dummy summaries instead of the original data.
We are the first to employ dataset condensation to solve the Non-IID dilemma in the scenario of distributed data.Our main contributions are illustrated as follows:
? We design a distributed learning approach with hierarchical clustering and adaptive dataset condensation,which is communication-efficient,privacy-preserving and appropriate for Non-IID data partitions.
? We propose the entropy topsis comprehensive tiering model for hierarchical clustering to distinguish clients according to their data characteristics.The implementation and purpose of our tiering model are innovative in comparison to existing tiering methods.
? We propose the adaptive dataset condensation to adjust the size of synthetic samples based on the hierarchical structure,which improves the performance of ADC-DL on Non-IID data partitions.
? We evaluate the performance of ADC-DL by conducting extensive experiments on various datasets and data distribution schemes,which confirm that our ADC-DL outperforms other mainstream methods.
The remaining of the article is organized as follows.In Section II,we introduce the related work.The principle and framework of ADC-DL we proposed,especially the entropy topsis comprehensive tiering model and adaptive dataset condensation,are presented in Section III.In Section IV,we evaluate the performance of ADC-DL compared with other distributed training methods.Section V concludes the paper.
Distributed learning is one of the most valuable and promising directions of research in large-scale machine learning [15],which avoids the necessity of gathering all the original data into a single server for central processing,saving time and energy[16].Originally,it is implemented by computing and transmitting the output of local models or models themselves[17].Kittler et al.design a series of efficient classifier combination rules (e.g.,majority vote rule) to combine all the local outputs to produce a unique result [18].Wolpert and David propose to employ a high-level model for combining the low-level classifiers to improve the accuracy of distributed learning[19].However,such methods need to define a uniform representation into which the different classifiers are translated,which is challenging to conduct.
Sharing models and sharing representative summaries are two of the most predominant approaches in recent studies.McMahan et al.propose the most prevalent algorithms in federated learning called federated averaging(FedAvg)[5].FedAvg has been certified to work well empirically,especially for nonconvex problems.However,the accuracy of FedAvg is dramatically degraded when the data distribution is Non-IID.
In attempting to solve the quagmire of Non-IID data partitions,Zhao et al.investigated the causes of global model degradation in data heterogeneity scenarios at first[20].Simultaneously they design a strategy to improve the model performance on Non-IID data through the establishment of a small subset of data that is globally shared among all the edge devices.Experiments indicate that the accuracy of global model is improved.However,this method will consequently result in additional communication costs.
Research on data summarization has inspired the approach of collecting data summaries in distributed learning.A variety of data summarization techniques have been proposed in literature and are currently implemented in practice(e.g.,coresets,sketches[21,22]).Many researchers have adopted this approach to distributed learning recently.Lu et al.develop a robust coreset construction algorithm based on k-means clustering to reduce the communication overhead in distributed learning [4].However,the model accuracy of this algorithm is not satisfactory.
Our proposed ADC-DL is a novel distributed learning algorithm based on data summarization.Effectively improve the accuracy of the central model under the premise of privacy security and low communication costs,especially in the setting of Non-iid data distribution.
In this section,we present the architecture overview of our proposed ADC-DL first.Subsequently,we illustrate the two most essential components in ADC-DL:model of hierarchical clustering and adaptive dataset condensation.
LetDkandNkindicate the original dataset and condensed dataset of clientkrespectively,where the number of samples inNkis considerably smaller than that inDk.Our ADC-DL learns a mapping Ψ:Dk →Nkby minimizing theθDandθNthat are calculated for training loss over original datasetDkand condensed datasetNkwhile adaptively adjusting the number of samples inNkaccording to the hierarchical structure.Eventually,the server aggregates all theNkand constructs the global model.
The framework of ADC-DL primarily consists of the following three parts: hierarchical clustering,adaptive data condensation,and aggregation of all the condensed data.The architecture overview is displayed in Figure 1.The main procedure of ADC-DL can be summarized as follows:
?Step 1:Hierarchical Clustering[Figure 1(a)].All the clients calculate the characteristics of their data distribution locally.The client data scoreSkand the hierarchical structure are subsequently acquired by the entropy topsis comprehensive tiering model.
Figure 1.Overview of the ADC-DL framework,which consists of three parts: (a) hierarchical clustering: evaluate and acquire the client data score and the hierarchical structure by entropy topsis comprehensive tiering model.(b) Adaptive data condensation: condense the data of client adaptively into a small set of synthetic samples.(c)Aggregation of all the condensed data: The server trains the model on the aggregated data and shares it with all the clients.
?Step 2:Global Model Downloading and Adaptive Dataset Condensation[Figure 1(b)].The server initializes the global model parameters and distributes the model to all the clients.After receiving the global model from the server,each participated client respectively condenses the original data.The size of synthetic samples is adaptively determined subject to the position of client in hierarchical structure and their data distribution characteristics.
?Step 3:Aggregation of synthetic samples and Model Training[Figure 1(c)].The server aggregates all the synthetic samples and trains the initialized model.At last,the trained model is transmitted to all the participated clients.
The first two steps will be discussed in detail in the following subsections.After completing these two steps,the server aggregate all the synthetic samples into a sequence,on which trains and distributes the global model to clients.
Algorithm 1 depicts the training process of ADCDL.At the beginning of training,the server initializes the parameters of global model and shares the model with all the clients.Lines 2—3 obtain the hierarchical structure by the entropy topsis tiering model.In lines 4—7,each client calculates the synthetic samplesNkparallelly and transmits them to the server.The implementation details of line 5 are shown in Algorithm 2.Line 8 aggregates all the synthetic samples,and line 10 is performed to update the parameters of global modelθi+1,ηis the learning rate.
Algorithm 1.Training Process of ADC-DL.The M clients are indexed by k.E is the local training numbers of server.Pθ0 is the probability distribution over randomly initialized weights.
Algorithm 2.Adaptive dataset condensation.
As shown in Figure 2,ADC-DL stratifies all the clients to facilitate the adaptive dataset condensation in the following subsection.
Figure 2.Hierarchical clustering of the clients.
Our idea of hierarchical clustering was primarily inspired by the following observations: 1) Clients with more extensive data samples contribute more to the trained global model eventually.2) Clients with diverse degrees of Non-IID and imbalance have various performances in the model training.Our method of hierarchical clustering achieves a comprehensive consideration of these factors.
There have been some approaches proposed in recent studies to layering clients.Zhang et al.stratify the clients according to the computational speed of edge devices for the implementation of asynchronous federated learning [23].Briggs et al.stratify the clients by evaluating the similarity of the global model to the local model[24].Our approach is highly distinct from these existing approaches.On the one hand,the basis of stratification is different.Our stratification is on the basis of the data distribution characteristics of each client.On the other hand,the purpose is different.The purpose of our layering is to provide the foundation for subsequent adaptive dataset condensation,reflecting the suitability of individual clients for dataset condensation.
With the above analysis,we primarily take the following three specific factors into consideration when layering the clients: (1) the Non-IID level of data in client,(2)the number of data samples in client,(3)the balance level of data in client.All these factors will have a substantial impact on the process of adaptive condensation on the client-side.
According to [20],when all the clients start from the same initialization as the centralized settings,the Non-IID degree of client k can be simplified as:
whereCis the number of classes.Earth mover’s distance(EMD)has been proved to be a good metric for quantifying the Non-IID degree of data [20].Consequently,we follow this definition and employ EMD to indicate Non-IID degree in our paper.The amount of data samples in client is an influential factor affecting the size of the condensed dataset.We apply|Dk|to represent it.The balance degree of data in client is another factor to consider.We perceive that the client with balanced data is favorable for model training.We take the distance between the data of client k and the balanced distributed data as balanced distance(BD),
Subsequently,We set up the Entropy Topsis comprehensive tiering model to consider these factors.Entropy Topsis is the combination of entropy weight method and Topsis,which can objectively weigh various factors and evaluate the quality of each sample by approximating the ideal solution[25].To construct a uniform metric across clients,we introduce the definition ofclient data score Sk,which represents the evaluation of suitability for model training.After the standardization of the three factors of EMDk,|Dk|and BDkof each client,calculate the entropy of each factor:
whereMis the amount of clients.The weight of each factor is:
We employxkito represent the weighted values of clients.Then the ideal clientC+and the negative clientC?can be determined as:
whereI={|D|}associated with the bigger the better factor,I′={EMD,BD}associated with the smaller the better factors.Then we can evaluate the data score of all the clients by calculating the relative closeness of each client as follows:
According toSk,participated clients can be partitioned into multiple tiers :{tier1,tier2,...,tierN},where the clients intier1are the most ideal clients for training global model and the clients intierNare the most unsuitable clients for training global model.
Inspired from DC [26],we propose adaptive dataset condensation,which adaptively adjusts the condensation process according to the tier of client so as to be applicable to Non-IID data.
The goal of dataset condensation is to generate a small set of synthetic samples,which can achieve comparable performance with the original dataset for a specific model,as shown in Figure 3.This goal can be formulated as a minimization problem between two sets of gradients computed for training loss over the original and synthetic datasets.
Figure 3.The purpose of dataset condensation is to generate a small set of synthetic samples,which can achieve comparable performance with original data for a specific model.
Suppose the client k has|D|pairs of training samples and labels,on which we want to acquire a neural network with parametersθ.Generally,θcan be learned by minimizing the cost function on the training set:
wherelis the loss function.Our goal is to acquire a synthetic datasetNwith a much smaller number of samples than the raw dataset.The model parametersθNtrained onNcan be learned as follows:
We expect that the test accuracy of the model withθDandθNcan be comparable.To accomplish this expectation,the model trained onNshould converge to a similar solution with the model trained onDin the parameter space.Therefore,the goal can be formulated as:whered(·,·)is the distance function.
Furthermore,to achieve the optimal solution,it is imperative that not only to make the finalθNclose toθD,but also to make all theθNin the whole optimization process follow a similar path toθD,that is,theθNof each optimization round close toθD.Therefore,the formulation in Eq.(10)can be simplified as:
where T is the total number of optimized steps.
Thus,we can acquire the condensed datasetN,which can accomplish comparable performance to the original dataset for a specific model.Note that the condensed datasetNis synthetic summaries for training the specific deep neural network rather than the subset of original samples.Consequently,the raw data of clients will never be exposed during the transmission of these synthetic samples,which adequately protects the privacy of clients.
The size of synthetic samples of each client is a crucial parameter during the process of condensation,which has a substantial impact on the performance of the model trained on the synthetic samples.It is evident that the test accuracy will be improved with the increase of|N|because the synthetic dataset will contain more information from the original dataset.However,the communication costs will increase simultaneously with the enlargement of|N|.Therefore,we propose that the size of synthetic samples should be adaptively adjusted according to the data characteris-tics and tier of client.Specifically,we propose Eq.(12)as follows to accurately calculate the number of synthetic samples.
where the parameterλiare determined according to the hierarchical structure,λi ∈[0.2,1].aandbrepresent the range of client data scoreSk,the lousiest client score isaand the most desirable client score isb.The other two parametersμ1andμ2affect the range of|Nk|for each client.The experiment in Section IV proves that this way for determining|Nk|is quite effective.
The implementation of adaptive dataset condensation is presented in Algorithm 2.Lines 2—3 initialize the synthetic samplesNk,where the size ofNkis determined by thetierkand Eq.(12).In lines 4—5,lis the number of outer-loop steps of iteration andtis the number of inner-loop steps of iteration.In lines 6—13,the stochastic gradient descent optimization is performed to updateNk.Line 15 updates the parameters of the model trained onNk.Line 18 returns the finalNk.
In this section,we compare the performance of ADCDL with a variety of other distributed training algorithms.We evaluate the prediction accuracy and communication costs respectively based on the public datasets.
Datasets.The datasets used in our experiments include: MNIST [27] with 60000 images for training and 10000 images for testing in 10 classes,Fashion-MNIST [28] with 60000 images for training and 10000 images for testing in 10 classes,SVHN[29] with more than 600000 real-world images in 10 classes,and CIFAR10 [30] with 50000 images for training and 10000 images for testing in 10 classes.In order to analyze the performance of ADC-DL under different data distribution,we investigate four schemes with different data settings of clients: (a)IID distribution,where data from each client are independently identically distributed.(b) Mildly skewed Non-IID distribution,the data are sorted by class and then divided into 70 partitions.Each client receives 1—7 partitions randomly.(c) Highly skewed Non-IID distribution,analogous to schemes b,except that dataset is divided into 20 partitions and each client receives 1 or 2 partitions.(d) Extremely skewed Non-IID distribution,where each client contains data from only one class.The major distinction in these schemes is the degree of data heterogeneity and balance.In these schemes,(a) is the scheme of independent and identical distribution among each client’s data.Most of the distributed learning algorithms can achieve satisfactory performance under this setting.The degree of data heterogeneity in scheme (b) to (d) increases,rendering it progressively tricky for distributed algorithms to achieve favorable performance.We apply our dataset setting on 30 clients for evaluation in our experiment.
Model.Regarding the models,we employ standard deep network architectures of ConvNet[31]in our approach.CNN is a prevalent architecture that is typically deployed for image recognition.It can be denoted as[W,N,A,P]×D,whereDis the number of duplicate blocks,each block has a convolutional Layer withWfilters,a normalization layerN,an activation layerAand a pooling layerP.In our setting,D is assigned to be 3.Each block has 128 filters followed by ReLU and AvgPooling modules.A linear classifier is placed at the end of ConvNet.
Experimental platform and Hyperparameters.We use python of version 3.8.3 and Pytorch of version 1.6.0 to construct our distributed learning framework.And we use a single GPU,GeForce RTX 2080 Ti,for hardware acceleration.In the process of adaptive dataset condensation for each client,we set the learning rate for updating synthetic samplesηimg=0.1,the learning rate for updating network parametersηnet=0.01,batch size for raw datasetbs1=256,batch size for training networksbs2=256,the training iterationit=500.In determining|Nk|adaptively,empirically,we set the parameters in Eq.(12)μ1=0.3 andμ2=9.7 to keep the size of each client’s synthetic dataset within a reasonable range.
In this subsection,we conduct experiments to evaluate the prediction performance under various datasets and schemes.We compare ADC-DL with two mainstream privacy-protected distributed training methods,sharing models and sharing local summaries.Subsequently,we investigate the convergence performance of ADC-DL when partial clients are out of order.At last,we perform a set of ablation experiments to verify the improvement of our ADC-DL.
Comparison to federated averaging.First,we compare ADC-DL with the distributed learning method of federated averaging(FedAvg).We employ the setting of federated averaging in[32],which conducts a global aggregation at each training epoch.All the clients transmit their local model to the server after each epoch of local gradient descent,and the server aggregates the models by weighted average.Then distributes the aggregated models to clients.In this setting of federated learning,the final aggregated model is equivalent to the model trained on the union of all the local data.[32].
The prediction accuracy and training epochs of ADC-DL and FedAvg are presented in Table 1.It can be expected that ADC-DL achieves superior performance than FedAvg in most cases in terms of test accuracy and training epochs.Particularly,ADC-DL adjusts the size of synthetic samples according to the data distribution,making it more appropriate for the scenario with data distribution skew.
Table 1.Prediction performance comparison between ADC-DL and FedAvg.
Table 2.Results of ablation experiments for the adaptive scheme.The communication costs for methods with fixed|Nk|are constant for each average EMD.ADC-DL implements an excellent trade-off between accuracy and communication overhead compare to the methods without adaptive components.
Comparison to coreset methods.Previously,data summaries in distributed learning were generally constructed by coreset.Therefore,we evaluate ADC-DL together with benchmarks including the RCC-kmeans algorithm in [4] and its distribution extensions,i.e.,CDCC[6]and DRCC on different data distribution of four datasets(MNIST,FashionMNIST,SVHN and CIFAR10)to demonstrate the advantages of ADC-DL.
Intuitively,each data sample in the coreset represents a set of data samples from the raw dataset[3].RCC-kmeans is a centralized clustering algorithm with no communication between each client during the computation of coreset points.CDCC and DRCC are the distributed clustering algorithm.They perform local cluster analysis on individual clients at first.Then transmit partial clustering results as output to other clients and aggregate them into final clustering results.Our algorithm has a remarkable superiority compared to these methods.The above algorithm calculates a generalized data summary for a distributed dataset,considering only the characteristics of the data itself but not the task model structure.Our ADC-DL calculates the data summary for a specific model,which is not valid for other models but can achieve better performance for the particular model.We parameterize CDCC with k=2 according to the evaluated k-means problem[6].Other parameters of coreset following the same setting as[4]to achieve an ideal result.Figure 4 shows the accuracy results measured by ADC-DL and benchmarks.
Figure 4.Evaluation of ADC-DL and benchmarks on different datasets.
Figure 5.(a)Convergence performance with respect to the proportion of working devices on IID data partition of MNIST.A small proportion of devices failures only have a slight impact on the prediction accuracy.(b)Convergence performance with respect to the proportion of working devices on highly-skewed Non-IID data partition of MNIST.Devices failures cause a distinct decline in prediction accuracy.
Figure 6.The model accuracy with respect to the communication costs on MNIST when data distribution is Non-IID.The X-axis is in log-scale.
As Figure 4 illustrates,the prediction performance of ADC-DL is remarkably higher than that of distributed learning based on sharing coreset in all cases and this dominance is more pronounced on the simpler dataset.In general,the model accuracy degrades as the degree of non-iid of data gradually increases in scheme (a) to (d).Compared with other algorithms,the accuracy decline of ADC-DL with the increase of Non-IID level is acceptable.RCC-kmeans is the best performing algorithm except for ADC-DL,and its accuracy is basically constant as the data distribution changes.As for CDCC algorithm,which blindly selects k coreset points to be local centers at each client,it severely suffers from the highly heterogeneous data when the classification task is challenging [4].Although DRCC can customize the configuration to the local dataset and achieve robust performance,its prediction performance is less than ideal.Summing up the above analysis,it can be demonstrated that ADCDL is relatively outstanding in terms of accuracy and the robustness of performance.
Figure 7.The model accuracy with respect to the communication costs on CIFAR10 when data distribution is Non-IID.The X-axis is in log-scale.
Performance evaluation when parts of clients malfunction.In distributed settings,there are a variety of distinct differences among clients in terms of network battery power and connectivity,which will lead to exacerbating challenges such as straggler [33].In case that parts of devices get out of order and are offline from the system,these devices will fail to calculate and share their synthetic samples.Thus,the prediction accuracy of the global model will be considerably affected.We evaluate the performance of ADC-DL in the face of this challenge,testing the accuracy for the devices dropping rate of 0%,10%,...,90% respectively.To reveal the relationship between model accuracy and dropping rate more clearly,we use 3Dgraph to observe.As shown in Figure 5,the two coordinates on the horizontal plane represent the epoch of model training and the proportion of devices working in order,the vertical coordinate represents the model accuracy.It can be seen that ADC-DL can achieve an acceptable result even if some clients fail in the case of data distribution is IID.The impact of device offline on model performance is even pronounced when the data distribution is Non-IID.The model accuracy will deteriorate dramatically to below 80%when more than 20%of the devices are out of order,which is unacceptable for the classification task.
Comparison of the accuracy vs.the communication costs.We show the comparison of the model accuracy vs.uploading bytes in communication on MNIST and CIFAR10 in Figure 6 and Figure 7.The benchmarks include FedAvg,CDCC,DRCC mentioned above and another communication-efficient distributed learning method named SAPS-PSGD [34].The experimental results show that the ADC-DL we proposed can condense data of clients effectively according to the data distribution to achieve a target model accuracy with fewer communication costs on both MNIST and CIFAR10.
On Non-IID data distribution of MNIST as shown in Figure 6,to achieve 91.6% test accuracy,SAPSPSGD requires about 92MB communication overhead and FedAvg requires about 286.6MB.While ADC-DL we proposed needs only 24.1MB,which is about 3.8×and 11.9×smaller than SAPS-PSGD and FedAvg,respectively.CDCC and DRCC can only achieve around 80% accuracy with more than 100MB.On Non-IID data distribution of CIFAR10 as shown in Figure 7,to achieve 60%test accuracy,ADC-DL requires 68MB,which is 1.64×and 5.5×smaller than SAPSPSGD(109MB) and FedAvg(372.1MB),respectively.Algorithms of CDCC and DRCC have higher communication costs and they are not able to achieve acceptable accuracy.Through the above analysis,we can conclude that the adaptive scheme based on the hierarchical structure leads to lower communication costs than the SAPS-PSGD,FedAvg and CDCC/DRCC algorithm.
Ablation experiment of our improvement.To further validate the effectiveness of our adaptive scheme,we conduct a set of ablation experiments to contrast the result obtained from ADC-DL with those obtained by directly applying DC to distributed learning.We set|Nk|of all the clients to 10,100,500 respectively for dataset condensation as the control experiment.The prediction accuracy and communication costs are computed on MNIST for each average EMD.
Table 2 presents the accuracy and communication costs of each algorithm as the average EMD rises.Generally,the increase of average EMD leads to a noticeable degradation in the performance of the algorithm based on dataset condensation,which is because of the imbalance and variable quality of each class in the union of synthetic samples.It can be observed that when simply applying DC to distributed learning,there is a remarkable improvement (about 4%?7%)in model accuracy as|Nk|rises from 10 to 100.However,when|Nk|rises from 100 to 500,the model accuracy improves just a little bit (less than 1%) while the communication costs increase approximately five times.Our ADC-DL dramatically lowers the communication overhead while achieves an acceptable accuracy.That is,our algorithm implements an excellent trade-off between accuracy and communication overhead.In general,it can be revealed that our proposed scheme of adaptively adjusting the number of synthesized samples is effective and outperforms the current algorithm in terms of accuracy and communication overhead.
In this paper,we propose a communication-efficient distributed learning algorithm (ADC-DL) with hierarchical clustering and adaptive dataset condensation.We construct a hierarchical structure to better tackle the challenge of data heterogeneous.Our proposed adaptive dataset condensation ensures data privacy and low communication cost in the process of distributed learning.Experiments on different datasets and schemes of data distribution demonstrate that our ADC-DL outperforms other algorithms in terms of prediction accuracy and communication costs.In addition,the ablation experiments are performed to verify the superiority of the adaptive scheme compared to DC.In this paper,we condense the client data by class.In future research,we plan to assign each synthetic sample a soft label to increase the condensation ratio and reduce the communication costs further.
ACKNOWLEDGEMENT
This work is supported by the General Program of National Natural Science Foundation of China(62072049).