Fengda ZHANG ,Kun KUANG? ,Long CHEN ,Zhaoyang YOU ,Tao SHEN,Jun XIAO,Yin ZHANG,Chao WU,Fei WU,Yueting ZHUANG,Xiaolin LI
1College of Computer Science and Technology,Zhejiang University,Hangzhou 310027,China
2School of Public Affairs,Zhejiang University,Hangzhou 310027,China
3Tongdun Technology,Hangzhou 310000,China
4Institute of Basic Medicine and Cancer,Chinese Academy of Sciences,Hangzhou 310018,China
5ElasticMind.AI Technology Inc.,Hangzhou 310018,China
Abstract: To leverage the enormous amount of unlabeled data on distributed edge devices,we formulate a new problem in federated learning called federated unsupervised representation learning (FURL) to learn a common representation model without supervision while preserving data privacy.FURL poses two new challenges:(1) data distribution shift (non-independent and identically distributed,non-IID) among clients would make local models focus on different categories,leading to the inconsistency of representation spaces;(2) without unified information among the clients in FURL,the representations across clients would be misaligned.To address these challenges,we propose the federated contrastive averaging with dictionary and alignment (FedCA) algorithm.FedCA is composed of two key modules:a dictionary module to aggregate the representations of samples from each client which can be shared with all clients for consistency of representation space and an alignment module to align the representation of each client on a base model trained on public data.We adopt the contrastive approach for local model training.Through extensive experiments with three evaluation protocols in IID and non-IID settings,we demonstrate that FedCA outperforms all baselines with significant margins.
Key words: Federated learning;Unsupervised learning;Representation learning;Contrastive learning
Federated learning (FL) is proposed as a paradigm that enables distributed clients to collaboratively train a shared model while preserving data privacy(McMahan et al.,2017).Specifically,in each round of FL,clients obtain the global model and update it on their own private data to generate the local models,and then the central server aggregates these local models into a new global model.Most of existing works focus on supervised FL,in which clients train their local models with supervision.However,the data generated in edge devices are typically unlabeled.Therefore,learning a common representation model for various downstream tasks from decentralized and unlabeled data while keeping private data on devices,i.e.,federated unsupervised representation learning(FURL),remains still an open problem.
It is a natural idea that we can combine FL with unsupervised approaches,which means that clients can train their local models via unsupervised methods.There are a lot of highly successful works on unsupervised representation learning.Particularly,contrastive learning methods train models by reducing the distance between representations of positive pairs (e.g.,different augmented views of the same image)and increasing the distance between negative pairs (e.g.,augmented views from different images),which have been outstandingly successful in practice (van den Oord et al.,2019;Chen T et al.,2020;Chen XL et al.,2020;He KM et al.,2020).However,their successes highly rely on their abundant data for representation training;for example,contrastive learning methods need a large number of negative samples for training (Sohn,2016;Chen T et al.,2020).Moreover,few of these unsupervised methods take the problem of data distribution shift into account,which is a common practical problem in FL.Hence,it is difficult to combine FL with unsupervised approaches for FURL.
In FL applications,however,the data collected by each client are limited and the data distribution of the client might be different from each other (Jeong et al.,2018;Yang Q et al.,2019;Sattler et al.,2020;Kairouz et al.,2021;Zhao et al.,2022).Hence,we face the following challenges in combining FL with unsupervised approaches for FURL:
1.Inconsistency of representation spaces
In FL,the limited data of each client would lead to variation of data distribution from client to client,resulting in inconsistency of representation spaces encoded by different local models (Kuang et al.,2020).For example,as shown in Fig.1a,client 1 has only images of cats and dogs,and client 2 is with only images of cars and planes.Then,the locally trained model on client 1 encodes only a feature space of cats and dogs,failing to map cars or planes to the appropriate representations,and the same goes for the model trained on client 2.Intuitively,the performance of the global model aggregated by these inconsistent local models may fall short of expectations.
2.Misalignment of representations
Even if the training data of the clients are independent and identically distributed (IID) and the representation spaces encoded by different local models are consistent,there may be misalignment between representations due to randomness in the training process.For instance,for a given input set,the representations generated by a model are equivalent to the representations generated by another model when rotated by a certain angle,as shown in Fig.1b.It should be noted that the misalignment between local models may have drastic detrimental effects on the performance of the aggregated model.
To address these challenges,we propose a contrastive loss-based FURL algorithm called the federated contrastive averaging with dictionary and alignment (FedCA),which consists of two main novel modules:a dictionary module for addressing the inconsistency of representation spaces and an alignment module for aligning the representations across clients.Specifically,the dictionary module,which is maintained by the server,aggregates the abundant representations of samples from clients and these can be shared with each client for local model optimization.In the alignment module,we first train a base model based on small public data (e.g.,a subset of STL-10 dataset) (Coates et al.,2011) and then require all local models to mimic the base model such that the representations generated by different local models can be aligned.Overall,in each round,FedCA involves two stages:(1) clients train local representation models on their own unlabeled data via contrastive learning with the two modules mentioned above,and then generate local dictionaries,and(2)the server aggregates the trained local models to obtain a shared global model and integrates the local dictionaries into a global dictionary.
To the best of our knowledge,FedCA is the first algorithm designed for the FURL problem.Our experimental results show that FedCA has better performance than those naive methods that solely combine FL with unsupervised approaches.We believe that FedCA will serve as a critical foundation in this novel and challenging problem.
FL enables distributed clients to train a shared model collaboratively while keeping private data on devices (McMahan et al.,2017).Li T et al.(2020)added a proximal term to the loss function to keep local models close to the global model.Wang HY et al.(2020) proposed a layer-wise FL algorithm to deal with the permutation invariance of neural network parameters.However,existing works focus only on the consistency of parameters,while we emphasize the consistency of representations in this study.Some works also focus on reducing the communication of FL(Kone?ny et al.,2017).To further protect the data privacy of clients,cryptography technologies have been applied to FL(Bonawitz et al.,2017).
Learning high-quality representations is important and essential for various downstream tasks(Zhou et al.,2017;Duan et al.,2018).There are two main types of unsupervised representation learning methods:generative and discriminative(Zhuang YT et al.,2017;Lei et al.,2020;Zhu et al.,2020).Generative approaches learn representations by generating pixels in the input space (Hinton and Salakhutdinov,2006;Kingma and Welling,2014;Radford et al.,2016).Discriminative approaches train a representation model by performing pretext tasks,where labels are generated for free from unlabeled data (Pathak et al.,2017;Gidaris et al.,2018).Among them,contrastive learning methods achieve excellent performance (van den Oord et al.,2019;Chen T et al.,2020;Chen XL et al.,2020;He KM et al.,2020).The contrastive loss was proposed by Hadsell et al.(2006).Wu ZR et al.(2018) proposed an unsupervised contrastive learning approach based on a memory bank to learn visual representations.Wang TZ and Isola(2020)pointed out two key properties,namely,closeness and uniformity,related to the contrastive loss.Other works also applied contrastive learning to videos(Sermanet et al.,2018;Tian et al.,2020),natural language processing (NLP) (Mikolov et al.,2013;Logeswaran and Lee,2018;Yang ZL et al.,2019),audios(Baevski et al.,2020),and graphs(Hassani and Ahmadi,2020;Qiu et al.,2020).
Before the FL was proposed,there have been some works on unsupervised representation learning in the distributed/decentralized setting,which are easily portable to the FL setting (Kempe and Mc-Sherry,2008;Liang et al.,2014;Shakeri et al.,2014;Raja and Bajwa,2016;Wu SX et al.,2018).However,different from the deep learning method,the convergence of these methods is limited by the size of the data,and it is difficult to achieve good performance on downstream tasks (Lyu,2020;Pan,2020;Zhuang YT et al.,2020).
Some concurrent works (van Berlo et al.,2020;Jin et al.,2020) also focus on FL from unlabeled data with the deep learning method.Different from these works which simply combine FL with unsupervised approaches,we explore and identify the main challenges in FURL and design an algorithm to deal with these challenges.There are some later works aiming to solve our proposed problem(Zhuang WM et al.,2021b).For example,Sattler et al.(2021)proposed to use the unlabeled auxiliary data in FL by federated distillation techniques.
To our best knowledge,our work is the first one to combine contrastive learning with FL,which has inspired some later works (He CY et al.,2021;Ji et al.,2021;Shi et al.,2022).Li QB et al.(2021)conducted contrastive learning at the model level to correct local training.Wu YW et al.(2021)proposed to exchange the features of clients to provide diverse contrastive data to each client.Zhuang WM et al.(2021a) focused on unsupervised setting in FL by designing a dynamically contrastive module with an effective communication protocol.Zhuang WM et al.(2022)proposed a new method to tackle the non-IID data problem in FL and filled in the gap between FL and self-supervised approaches based on Siamese networks.
In this section,we discuss the primitives needed for our approach.The symbols and the corresponding meanings are given in Table 1.
In FL,each clientu ∈Uhas a private datasetDuof training samples,and our aim is to train a shared model while keeping private data on devices.There are a lot of algorithms designed for aggregation in FL (Li T et al.,2020;Wang HY et al.,2020),and we point out that our approach does not depend on the way of aggregation.Here,for simplicity,we introduce a standard and popular aggregation method named FedAvg (McMahan et al.,2017).In roundtof FedAvg,the server randomly selects a subset of clientsUt ?Uand each clientu ∈Utlocally updates the global model with parametersθton datasetDuvia the stochastic gradient descent rule to generate the local model:
whereηis the stepsize andL(Du,θt)is the loss function of clientuin roundt.Then the server gathers the parameters of the local models{θut+1|u ∈Ut}and aggregates these local models via weighted average to generate a new global model:
The training process above is repeated until the global model converges.
Unsupervised contrastive representation learning methods learn representations from unlabeled data by reducing the distance between representations of positive samples and increasing the distance between representations of negative samples.Among them,SimCLR achieves outstanding performance and can be applied to FL easily(Chen T et al.,2020).SimCLR randomly samples a minibatch ofNsamples and executes twice random data augmentations for each sample to obtain 2Nviews.Typically,the views augmented from the same image are treated as positive samples and the views augmented from different images are treated as negative samples(Dosovitskiy et al.,2014).The loss function for a positive pair of samples (i,j)is defined as follows:
whereτis the temperature andif and only ifk/=i.sim(·,·) measures the similarity of two representations of samples (e.g.,cosine similarity).The model (consisting of a base encoder networkfto extract representationhfrom augmented views and a projection headgto map representationhtoz) is trained by minimizing the loss function above.Finally,we use representationhto perform downstream tasks.
In this section,we analyze the two challenges mentioned above and detail the dictionary module and alignment module designed for these challenges.Then we introduce the FedCA algorithm for FURL.
FURL aims to learn a shared model that maps data to representation vectors such that similar samples are mapped to nearby points in the representation space so that the features are well clustered by classes.However,the presence of non-IID data presents a great challenge to FURL.Since the local datasetDuof a given clientulikely contains samples of only a few classes,the local models may encode inconsistent spaces,causing bad effects on the performance of the aggregated model.
To empirically verify this,we visualize the representations of images from CIFAR-10 via thetdistributed stochastic neighbor embedding (T-SNE)method.To be specific,we split the training data of CIFAR-10 into five non-IID sets,and each set consists of 10 000 samples from two classes.Then,the FedAvg algorithm is combined solely with the unsupervised approach(SimCLR)to learn representations from these sets.We use the local model in the 20thround of the client who has only samples of class 0 and class 1 to extract features from the test set of CIFAR-10 and visualize the representations after dimensionality reduction by T-SNE (Fig.2a).We find that the scattered representations of samples from class 0 and class 1 are spread over a very large area of representation space,and it is difficult to distinguish samples of class 0 and class 1 from others.It suggests that the local model encodes a representation space of samples of class 0 and class 1,and it cannot map samples of other classes to the suitable positions.The visualization results support our hypothesis that the representation spaces encoded by different local models are inconsistent in a non-IID setting.
We argue that the cause of inconsistency is that the clients can use only their own data to train the local models but the distribution of data varies from client to client.To address this issue,we design a dictionary module (Fig.3b).Specifically,in each communication round,clients use the global model(including the encoder and the projection head) to obtain the normalized projectionsof their own samples and send the normalized projections to the server along with the trained local models.Then,the server gathers the normalized projections into a shared dictionary.For each client,the global dictionary dict withKprojections is treated as a normalized projection set of negative samples for local contrastive learning.Specifically,in the local training process,for a given minibatchxbatchwithNsamples,we randomly augment them to obtainxiandxj,and generate normalized projectionsand.Then we calculate the following:
Fig.3 Illustration of FedCA:(a) overview of FedCA (in each round,clients generate local models and dictionaries,and then the server gathers them to obtain the global model and dictionary);(b) local update of model (clients update local models by contrastive leaning with the dictionary and alignment modules);(c)local update of dictionary (clients generate local dictionaries via temporal ensembling).In (b), xother is a sample different from sample x, xalign is a sample from the additional public dataset for alignment, f is the encoder,and g is the projection head
where concat() denotes concatenation,the size oflogitsisN ×(N+K),and dim=1 means that they are concatenated in the 1stdimension.Now,we turn the unsupervised problem into an (N+K)-classification problem and define as a class indicator.Then the loss function is given as follows:
where CE denotes the cross-entropy loss andτis the temperature term.
Note that in each round,the shared dictionary is generated by the global model from the previous round,but the projections of local samples are encoded by current local models.The inconsistencies in representations may affect the function of the dictionary module,especially in a non-IID setting.We use temporal ensembling to alleviate this problem (Fig.3c).To be specific,each client maintains a local ensemble dictionary consisting of projection set.In each round,clientuuses the trained local model to obtain projectionsand accumulates them into ensemble dictionary by updating
and then the normalized ensemble projection is given as
whereα ∈[0,1)is a momentum parameter and=0.
We visualize the representations encoded by the local models trained via federated contrastive learning with the dictionary module in the same setting as the vanilla federated unsupervised approach.As shown in Fig.2b,we find that the points of class 0 and class 1 are clustered in a small subspace of the representation space,which means that the dictionary module works well as we expected.
Due to the randomness in the training process,there might be differences between the representations generated by the two models trained on the same dataset,although these two models encode consistent spaces.The misalignment of representations may have an adverse effect on model aggregation.
To verify this,we use the angle between two representation vectors of the same image encoded by different models to measure the degree of difference in representations.Then we record the angles between representations generated by different local models in FL on CIFAR-10.We split the training data of CIFAR-10 into five IID sets randomly,and each set consists of 10 000 samples from all 10 classes.We randomly select two local models trained by the vanilla federated unsupervised approach (FedSim-CLR is used as an example) and use them to obtain normalized representations on the test set of CIFAR-10.As shown in Fig.4a,there is always a large difference in the angle (beyond 20?) between the representations encoded by the local models in the learning process.
Fig.4 Box plots of the angles between the representations encoded by local models on the CIFAR-10 dataset in FL with an IID setting:(a) FedSimCLR;(b) FedCA.FL:federated learning;IID:independent and identically distributed
We introduce an alignment module to tackle this challenge.As shown in Fig.3b,we prepare an additional public dataset with a small size and train a modelgalign(falign())(called the alignment model)on it.The local models are then trained via contrastive loss with a regularization term that replicates outputs of the alignment model on an alignment dataset.For a given clientu,the loss functions are defined as follows:
We also calculate the angles between the representations of the local models trained via federated contrastive learning with the alignment module(3200 images sampled from the STL-10 dataset randomly are used for alignment) in the same setting as the vanilla federated unsupervised approach.As shown in Fig.4b,the angles can be controlled within 10?after 10 training rounds,suggesting that the alignment module can help align the local models.
From the above,the total loss function of the local model update is given as follows:
whereβis a scale factor controlling the influence of the alignment module.Now we have a complete algorithm named FedCA,which can handle the challenges of FURL well,as shown in Fig.3.
Algorithm 1 summarizes the proposed approach.In each round,clients update the local models with the contrastive loss and the alignment loss,and then generate local dictionaries.The server aggregates the local models into a global model and updates the global dictionary.
FURL aims to learn a representation model from decentralized and unlabeled data.In this section,we present an empirical study of FedCA.
5.1.1 Baselines
AutoEncoder is a generative method to learn representations in an unsupervised manner by generating a representation from the reduced encoding as close as possible to its original input (Hinton and Salakhutdinov,2006).Predicting rotation is one of the proxy tasks of self-supervised learning by rotating samples by random multiples of 90?and predicting the degrees of rotations(Gidaris et al.,2018).We solely combine FedAvg with AutoEncoder (named FedAE),predicting rotation (named FedPR),and SimCLR (named FedSimCLR),separately,and use them as baselines for FURL.
5.1.2 Datasets
TheCIFAR-10/CIFAR-100dataset(Krizhevsky,2009) consists of 60 000 32×32 color images in 10/100 classes,with 6000/600 images per class,and there are 50 000 training images and 10 000 test images in CIFAR-10 and CIFAR-100.The MiniImageNet dataset (Deng et al.,2009;Vinyals et al.,2016)is extracted from the ImageNet dataset and consists of 60 000 84×84 color images in 100 classes.We split it into a training dataset with 50 000 samples and a test dataset with 10 000 samples.We implement FedCA and the baseline methods on the three datasets above in PyTorch(Paszke et al.,2019).
5.1.3 Federated setting
We deploy our experiments under a simulated FL environment,where we set a centralized node as the server and five distributed nodes as the clients.The number of local epochs (E) is five,and in each round,all of the clients obtain the global model and execute local training,i.e.,the proportion of the selected clientsC=1.For each dataset,we consider two federated settings:IID and non-IID.Each client randomly samples 10 000 images from the entire training dataset in an IID setting,while in the non-IID setting,samples are split to clients by class,which means that each client has 10 000 samples of 2/20/20 classes of CIFAR-10/CIFAR-100/MiniImageNet.
5.1.4 Training details
We compare our approach with baseline methods on different encoders,including five-layer convolutional neural network(CNN)(Krizhevsky et al.,2012) and ResNet-50 (He KM et al.,2016).The encoder maps input samples to representations with 2048 dimensions,and then a multilayer perceptron(MLP) translates the representations to a vector with 128 dimensions used to calculate the contrastive loss.Adam is used as the optimizer,and the initial learning rate is 1×10?3with 1×10?6weight decay.We train models for 100 epochs with a minibatch size of 128.We set the dictionary sizeK=1024,the momentum term of temporal ensemblingα=0.5,and the scale factorβ=0.01.Furthermore,3200 images randomly sampled from the STL-10 dataset are used for the alignment module.Data augmentation for contrastive representation learning includes random cropping and resizing,random color distortion,random flipping,and Gaussian blurring.
5.2.1 Linear evaluation
We first study our method by linear classification on a fixed encoder to verify the representations learned in FURL.We perform FedCA and baseline methods to learn representations on CIFAR-10,CIFAR-100,and MiniImageNet without labels separately in a federated setting.Then,we fix the encoder and train a linear classifier with supervision on the entire dataset.We train this classifier with Adam as the optimizer for 100 epochs and report the top-1 classification accuracy on the test datasets of CIFAR-10,CIFAR-100,and MiniImageNet.
As shown in Table 2,federated averaging with contrastive learning works better than other unsupervised approaches.Moreover,our method outperforms all of the baseline methods due to the modules designed for FURL as we expected.
Table 2 Top-1 accuracies of algorithms for FURL on linear evaluation
5.2.2 Semi-supervised learning
In federated scenarios,the private data at the clients may be only partly labeled,so we can learn a representation model without supervision and finetune it on labeled data.We assume that the ratios of labeled data of each client are 1% and 10%,separately.First,we train a representation model in FURL setting.Then,we fine-tune it (followed by an MLP consisting of a hidden layer and a rectified linear unit (ReLU) activation function) on labeled data for 100 epochs with Adam as the optimizer and a learning rate of 1×10?3.
Table 3 reports the top-1 accuracy of various methods on CIFAR-10,CIFAR-100,and MiniImageNet.We observe that the accuracy of the global model trained by federated supervised learning on limited labeled data is significantly bad,and the use of the representation model trained in FURL as the initial model can improve performance relatively.Our method outperforms other approaches,suggesting that FURL benefits from the designed modules of FedCA,especially in a non-IID setting.
Table 3 Top-1 accuracies of algorithms for FURL on semi-supervised learning
5.2.3 Transfer learning
A main goal of FURL is to learn a representation model from decentralized and unlabeled data for personalized downstream tasks.To verify whether the features learned in FURL are transferable,we set the models trained in FURL as the initial models,and then an MLP is used to be trained along with the encoder on other datasets.The image size of CIFAR(32×32×3)is resized to be the same as that in Mini-ImageNet (84×84×3) when we fine-tune the model learned from MiniImageNet on CIFAR.We train it for 100 epochs with Adam as the optimizer and set the learning rate to be 1×10?3.
Table 4 shows that the model trained by FedCA achieves an excellent performance and outperforms all of the baseline methods in the non-IID setting.
Table 4 Top-1 accuracies of algorithms for FURL on transfer learning
5.3.1 Alignment and dictionary modules
We perform the ablation study analysis on CIFAR-10 in IID and non-IID settings to demonstrate the effectiveness of the alignment and dictionary modules (with temporal ensembling).We implement (1) FedSimCLR,(2) federated contrastive learning with only alignment module,(3) federated contrastive learning with only dictionary module,(4) federated contrastive learning with only dictionary module based on temporal ensembling,and(5)FedCA,and then a linear classifier is used to evaluate the performance of the frozen representation model with supervision.Fig.5 shows the results.
Fig.5 Ablation study of modules designed for FURL by linear classification on CIFAR-10 (ResNet-50).FURL:federated unsupervised representation learning;IID:independent and identically distributed
We observe that the alignment module improves the performance by 1.4% in both IID and non-IID settings.With the help of the dictionary module(without temporal ensembling),there are 2.5% and 2.7% increases in the accuracy under the IID and non-IID settings,respectively.Moreover,we note that the representation model learned in FURL benefits more from the temporal ensembling technique in the non-IID setting than in the IID setting,probably because the features learned in the IID setting are stable enough so that temporal ensembling plays a far less important role in the IID setting than in the non-IID setting.Fortunately,the model achieves excellent performance when we combine federated contrastive learning with the alignment and dictionary modules based on temporal ensembling,which suggests that both of these two modules can work collaboratively and help tackle the challenges in FURL.
5.3.2 Coefficient of alignment loss
To explore the effectiveness of the coefficient of alignment lossβ,we run our algorithm on the CIFAR-10 dataset(IID setting,five-layer CNN)with different values of the hyper-parameterβ.
The results are shown in Table 5.We can find that the values ofβhave a slight effect on the performance of the federated representation model.The reason for the performance differences may be that a small value ofβcannot make the local models become aligned,so that the performance of the aggregated model will be degraded.A large value ofβlimits the function of the contrastive loss,so that the model ability cannot be guaranteed.We suggest that,in practice,people should select an appropriate value forβon a subset of data with a small size before the formal federated training.
Table 5 Ablation study for coefficient of alignment loss β
We formulate a significant and challenging problem,termed federated unsupervised representation learning (FURL),and show the two main challenges (inconsistency of representation spaces and misalignment of representations).In this paper,we propose a contrastive learning based FL algorithm named FedCA,composed of the dictionary module and alignment module,to tackle the above challenges.Owing to these two modules,FedCA enables distributed local models to learn consistent and aligned representations while protecting data privacy.Our experimental results demonstrate that FedCA outperforms those algorithms that solely combine FL with unsupervised approaches and provides a stronger baseline for FURL.
In future work,we plan to extend FedCA to cross-modal scenarios where different clients may have data in different modes such as images,videos,texts,and audios.
Contributors
All authors contributed to the study conception and design.Fengda ZHANG,Chao WU,and Yueting ZHUANG proposed the motivation.Fengda ZHANG,Kun KUANG,and Long CHEN designed the method.Fengda ZHANG,Zhaoyang YOU,and Tao SHEN performed the experiments.Fengda ZHANG drafted the paper,and all authors commented on previous versions of the paper.Jun XIAO,Yin ZHANG,Fei WU,and Xiaolin LI revised the paper.All authors read and approved the final version.
Compliance with ethics guidelines
Fei WU and Yueting ZHUANG are editorial board members ofFrontiers of Information Technology&Electronic Engineering.Fengda ZHANG,Kun KUANG,Long CHEN,Zhaoyang YOU,Tao SHEN,Jun XIAO,Yin ZHANG,Chao WU,Fei WU,Yueting ZHUANG,and Xiaolin LI declare that they have no conflict of interest.
Data availability
The data that support the findings of this study are openly available in public repositories.
Frontiers of Information Technology & Electronic Engineering2023年8期