Deep Domain-Adversarial Anomaly Detection With One-Class Transfer Learning

2023-03-09 01:04:22WentaoMaoGangshengWangLinlinKouandXihuiLiang

IEEE/CAA Journal of Automatica Sinica 2023年2期

Wentao Mao,,Gangsheng Wang,Linlin Kou,and Xihui Liang,

Abstract—Despite the big success of transfer learning techniques in anomaly detection,it is still challenging to achieve good transition of detection rules merely based on the preferred data in the anomaly detection with one-class classification,especially for the data with a large distribution difference.To address this challenge,a novel deep one-class transfer learning algorithm with domain-adversarial training is proposed in this paper.First,by integrating a hypersphere adaptation constraint into domainadversarial neural network,a new hypersphere adversarial training mechanism is designed.Second,an alternative optimization method is derived to seek the optimal network parameters while pushing the hyperspheres built in the source domain and target domain to be as identical as possible.Through transferring oneclass detection rule in the adaptive extraction of domain-invariant feature representation,the end-to-end anomaly detection with one-class classification is then enhanced.Furthermore,a theoretical analysis about the model reliability,as well as the strategy of avoiding invalid and negative transfer,is provided.Experiments are conducted on two typical anomaly detection problems,i.e.,image recognition detection and online early fault detection of rolling bearings.The results demonstrate that the proposed algorithmoutperforms the state-of-the-art methods in terms of detection accuracy and robustness.

I.INTRODUCTION

ANOMALY detection is a general termused to describe the task of discovering atypical data against expected patterns [1].Various anomaly detection cases can be found in real world.For instance,anomaly detection can avoid social panic by identifying fake news in social media [2].In the field of information security,the leakage of private information could be avoided through anomaly detection,also called intrusion detection [3].And fraud detection also protects society from financial crime and other malicious acts of fraud [4],[5].There are various implementations for anomaly detection including anomaly measure-dependent feature learning,ranking models,prior-driven models,softmax likelihood models,and so on [6].Binary or multi-class classification technique can be applied to distinguish between two or more classes with the training set containing objects from all the classes.In some real-world applications,such as the monitoring of wind gearboxes,motor failure prediction or the operational status of a shield machine,there are very few,if any,examples of catastrophic system states,while only the statistics of normal operation are known.As an effective technique to solve this problem,one-class classification adopts preferred data,or normal data,to construct detection model for identifying the data that are anomalous,and has been widely used in various fields such as industrial condition monitoring,network intrusion detection,etc.To distinguish it from the other implementations,we use the termone-class anomaly detection [7],[8] to denote one-class classification for anomaly detection.In the past decade,machine learning techniques have been successfully applied to solve the problem of one-class anomaly detection.Some traditional machine learning algorithm such as support vector machine (SVM) [9],One-class SVM (OCSVM) [10],isolation forest (iForest) [11],and so on have been proven effective in construction of detection model based on handcrafted features.Recently,thanks to the good capability of self-adaptive feature extraction,deep neural networks have also shown their validity in addressing such a problem.Developing an effective deep learning model for a disparate task of one-class anomaly detection is becoming a research hotspot and receives growing attention from industry and academia.

In general,deep anomaly detection methods rely on massive data for model training.But in actual applications,this premise is too strict.For example,for industrial condition monitoring,it usually needs a long period to accumulate enough data for training a deep neural network.If the number of representative data is insufficient,model bias,as well as over-fitting,will be caused.It is certain to borrow historically accumulated data in the offline stage to facilitate the model training.However,there may exist a large difference between the offline and online working conditions,e.g.,test environment in laboratory vs.actual online environment.This phenomenon is called working condition drift [12].Even for the components with the same manufacturing specifications,the data collected in the online and offline stages easily have different distribution characteristics.Since violating the precondition of i.i.d.,the detection model built with the offline data would not be suitable for the online data.

As a kind of machine learning,transfer learning learns know ledge from a source data distribution (called source domain) and transfer them to a related but not the same task(called target domain).In recent years,transfer learning with deep neural networks,i.e.,deep transfer learning,has been becoming a promising tool in various fields like object recognition,health management and fault diagnosis [13]–[15].Presently,transfer learning techniques have been utilized to train a trustful model of one-class classification for anomaly detection when lacking of representative data,also named one-class transfer learning [16].With only one class data (normal data) available,this technology aims to adapt the former learned detection rule to the new system rather than retraining a new rule after gathering a sufficient number of data.This operation could reduce the model bias and avoid waiting for a long time of data acquisition.Here we take an example of handwritten character recognition for a better elaboration,as shown in Fig.1.Once the number of character images is insufficient while the image quality is low,direct modeling using such images will definitely reduce the detection accuracy.If we introduce some identical character images with different writing styles for auxiliary modeling,the detection accuracy could still be decreased by the drift of the shape and pixels which indicates different data distribution characteristics.Therefore,it is necessary to transfer the detection rule learnt in the source domain,rather than data information,to the target domain.For realizing this task,however,the current one-class transfer learning methods have some challenges (see Section II): 1) Weak representation of detection rule,since most of current methods work on shallow model and handcrafted features;2) Limited transfer effect for the data with large distribution difference,since current methods generally pay more attention to the transition of model parameters rather than detection rule;3) Lack of theoretical analysis about the model reliability,since the current methods merely focus on model construction.

Fig.1.Schematic diagram of deep one-class transfer learning for handwritten character recognition.The key idea is the transfer of detection rule with self-adaptive feature extraction.

Following the aforementioned discussions,the ideas to improve the performance of one-class transfer learning should be: 1) Building an appropriate adaptation model to accurately transfer one-class detection rule,which can improve the model adaptation capability of different domains;2) Extracting domain-invariant feature representation adaptively,which can improve the feature adaptation capability of the data with different distribution characteristics.From the ideas,this paper proposes a new deep one-class transfer learning algorithm with domain-adversarial training,called deep domainadversarial anomaly detection (DAAD).The target of DAAD is to learn a transferable anomaly detection model that can borrow (not directly train) detection rule from different but related datasets.By adopting domain-adversarial neural network as the baseline framework,DAAD first constructs a new one-class adaptation constraint based on the hyperspheres that encloses the network representations of the data in the source domain and target domain.Moreover,DAAD proposes a deep hypersphere adversarial training mechanism to transfer the detection rule while obtaining the optimal domain-invariant feature representation.This mechanismstill belongs to feature adaptation,but differently from traditional domain adaptation techniques that extract common features without purpose,this mechanism can realize feature adaptation under the guidance of transferring detection rules.Furthermore,this paper provides a theoretical analysis about model reliability,and presents some strategies of avoiding invalid transfer and negative transfer.Finally,the experimental results on the MNIST-USPS/Office-Home image datasets and IEEE PHM Challenges 2012/XJTU-SY rolling bearing datasets demonstrate the effectiveness of the proposed algorithm.

The main contributions of this paper can be summarized as:

1) This paper proposes a new deep transfer learning algorithm for one-class anomaly detection.Differently from current anomaly detection methods with transfer learning,this algorithmcan realize the transition of one-class detection rule in the process of adaptively extracting domain-invariant feature representation.The end-to-end detection performance can then be enhanced.To the authors’ best know ledge,the work about one-class transfer learning for anomaly detection using the adversarial-based strategy is still in its infancy.

2) This paper provides the strategies of avoiding invalid transfer and negative transfer in theory.Based on the proposed algorithm DAAD,this paper not only analyzes how an invalid transfer occurs in the process of model training,but also identifies the factors that influence the transition of detection rule.According to the literature survey,this work is the first attempt of reliability analysis of one-class transfer learning for anomaly detection.

The rest of this paper is organized as follows.Section II starts by reviewing preliminary works on anomaly detection and transfer learning.Section III elaborates the implementation of the proposed DAAD algorithm.Section IV is devoted to a detailed reliability analysis of the transfer learning process in DAAD.Section V shows the empirical results on two kinds of datasets,i.e.,image dataset and rolling bearings dataset,followed by a conclusion in the last section.

II.PRELIMINARY WORK

Since only normal data are needed,one-class anomaly detection has good practicability to deal with the problem of data imbalance in real applications.One-class anomaly detection can be divided into two problems: 1) Classification with normal and abnormal classes.Here the abnormal samples need to be extracted from unlabeled data by means of some information metrics,e.g.,mapping-convergence [17],and a bi-classifier can then be built.2) One-class classification only with normal data.Currently,most of one-class anomaly detection algorithmadopts the later strategy,while a representative algorithm is support vector domain description (SVDD) [18].Rooting from the traditional SVM,SVDD finds the smallest hypersphere in a feature space that can encloses normal data.The optimal hypersphere can be viewed as the detection rule.

In recent years,deep learning techniques provide an end-toend detection scheme for one-class anomaly detection [6],[19],[20].As one of the prior works,Ruffet al.[21] proposed Deep SVDD by utilizing a deep neural network to find the optimal hypersphere with self-adaptive features extraction.Aytekinet al.[22] adopted a stacked autoencoder network to cluster the encoded sequence by using K-means algorithm and L2 regularizer and then recognize anomalies.Zhenget al.[23]utilized generative adversarial network (GAN) to achieve anomaly detection by generating synthetic data and distinguishing abnormal data from the synthetic data via a one-class classifier.Compared to the shallow models,the deep anomaly detection algorithms can offer adaptive feature representation and does not explicitly impose a prior on the data.

When lacking of representative data,anomaly detection usually fails to train a trustful model.While many successes of transfer learning have primarily been in anomaly detection[24]?[26],they mainly adopt shallow models.Xiaoet al.[16]built an uncertain one-class transfer learning model with SVM(UOCT-SVM) to transfer domain know ledge of multiple source domains,and adopted an iterative optimization algorithm to reduce the impact of uncertain information.Xue and Beauseroy [27] used one-class SVM to extract detection rules and constructed a kernel adaptation method to implement rule transition for multiple anomaly detection tasks.To directly control the transfer effect,this method introduced a parameterμto adjust the weight ratio of the historical task and the current task in the training data.Sindagi and Srivastava [28] proposed an adaptive SVDD algorithm for one-class anomaly detection to incrementally reduce the domain shift.To select transferable features of one-class transfer learning,Chen and Liu [29] learned a regression model from the normal distribution to the abnormal distribution in source domain,which can not only distinguish the abnormal data but also predict the abnormal distribution in target domain.Xieet al.[30] proposed a transfer learning-based one-class recommendation model.This model adopted SVM and dictionary representation to learn and transfer user preference information from streaming data at different stages,thus improving recommended effect with insufficient data.However,the abovementioned methods mainly transfer the model parameters based on handcrafted features.As a result,the transfer effect of detection rule is relatively weak.Moreover,although these works can choose tunable parameters [27] or design iterative optimization algorithms [26] to control the uncertainty of transfer process,the theoretical analysis on the reliability of one-class transfer learning is not yet available,including the algorithmic performance boundary,invalid transfer and negative transfer,etc.

Deep neural network has also been proven promising in transferring the feature representation learnt on large-scale,fully-supervised data to other tasks with insufficient training data.In terms of transfer channel,deep transfer learning techniques can be divided into four strategies [31] : instance-based strategy,mapping-based strategy,network-based strategy,and adversarial-based strategy.Nowadays,the network-based strategy is popular in the transfer learning of anomaly detection,mainly by freezing and fine-tuning the network parameters [32],[33].This strategy is initially designed for the labeled data.For one-class classification which has no labeled anomalous data,it is feasible to pre-train a network with normal data by means of unsupervised learning techniques and then fine-tune an one-class classifer on the top of the pretrained network.As a typical method,Zhouet al.[34] first pre-trained a variational autoencoder (VAE) network with normal data and then fine-tuned a deep SVDD model whose model parameters come from the VAE network.For the unsupervised anomaly detection of images,Bergmannet al.[35]proposed a know ledge distillation method to extract and transfer the in-depth representation information in normal data.Specifically,the distillation model is composed of a large pretrained (teacher) network and a small (student) network.The student network is trained with the normal data from target domain to match the teacher network.Know ledge is then transferred from the teacher network to the student network by minimizing the two networks’ output.It is worth noting that the networks are not fixed to a specific form.One can certainly use deep SVDD to build a teacher network and student network for one-class classification.To tackle unlabeled anomalous data,self-supervised learning (SSL) techniques have been introduced to solve anomaly detection.Sohnet al.[36] proposed a two-stage one-class classification framework.It first employed data augmentation and contrast learning to learn supervised information directly from normal data and then builded a one-class classifer.Although this method is without transfer learning,one can easily extend it to the transfer learning case by pre-training an SSL network in source domain and fine-tuning the one-class classification network in target domain.There is another critical issue,feature deterioration,in anomaly detection with pretrained deep features.To prevent premature feature collapse,Reisset al.[37] proposed a pretrained anomaly detection adaptation (PANDA) method by designing an elastic weight consolidation (EWC) regularization into the deep SVDD architecture.This regularization utilizes the diagonal of the Fisher information matrix to measure the change of each network pretrained weight and finetuned weight.Despite being able to prevent all input samples mapped to a same point,this work requires for strong pretrained feature extractors.SSL is then recommended for learning strong features with the limited normal training data.In summary,these methods,generally seek feature adaptation and parameter fine-tuning for the transfer learning.The oneclass transfer learning is supposed to be achieved in an aligned feature space.But the obtained common features or fine-tuned parameters do not guarantee that target task can learn well the detection rules from source tasks.More measures are required to realize effective transfer of detection rules based on feature adaptation or parameter fine-tuning.Moreover,these methods heavily depend on the quality of target domain data.The detection performance will be reduced significantly with noise interference or insufficient representative samples.

Fig.2.Model structure of the proposed algorithm DAAD.

Inspired by GAN,adversarial-based strategy with domain adversarial training,is more suitable for the data with large distribution difference than the other three strategies.For example,Maoet al.[38] proposed a structured domain adversarial neural network (DANN) by transferring structured output information to improve the accuracy and stability of bearings fault diagnosis.Liet al.[39] proposed a new domain generalization framework MMD-AAE which employed the maximum mean discrepancy (MMD) and domain adversarial training to adapt the data distribution of different domains.However,traditional adversarial-based methods usually need to simultaneously evaluate the discriminative loss of all domains.It mainly tackles the transfer learning problems with multiple classes,being not directly applied to the one-class transfer learning.According to the literature research,the research on the one-class transfer learning with deep learning is still in its infancy.A typical work is Yanget al.[40] that proposed an adversarial-based anomaly detection method,named invariant representation anomaly detection (IRAD).This method adopts two encoder networks,one is private to source domain and the other one is public for the two domains.The two encoder networks are optimized in adversarial training to extract domain-invariant feature representation while making the source features extracted by the two encoders as different as possible.However,the training procedure of this model is less robust as the model’s performance could be detrimental by insufficient representative samples and over training,also observed by the authors from the reference [40] in their experiments.To detect anomalous log records,Han and Yuan [41] also proposed a transferable log anomaly detection (Log TAD) framework with adversarial domain adaptation technique.Inspired by the deep SVDD,Log TAD tried to enclose all the log sequences from both source and target systems into a hypersphere with the same center.With long short-termmemory (LSTM) encoder,Log-TAD seeks to make log data from differen systems have a similar distribution so that the model can recognize anomalies from multiple systems.Despite satisfactory results on log data,Log TAD would suffer from low-quality data in target domain.For instance,irregular noise in target domain data will easily enlarge the hypersphere,causing the detection rule biased.Obviously,direct adversarial training cannot effectively transfer detection rules.

In general,the researches of one-class transfer learning for anomaly detection have the following limitations: 1) The current deep transfer learning techniques do not work well in the transition of one-class detection rule which is supposed to be effectively integrated in the adaptive feature extraction.2)Most of current methods adopt the strategy of model parameter transfer,but they are not robust yet for the data with large distribution difference.These limitations are just the initial motivation of this paper.

III.THE PROPOSED ALGORITHM

This section proposes a new one-class transfer learning algorithm DAAD for anomaly detection.DAAD adopts the DANN network as the baseline framework,and designs a hypersphere adversarial training mechanism to transfer detection rule in the process of domain adaptation.The model structure of this algorithm is shown in Fig.2,being mainly composed of domain-shared encoder,one-class detection classifier,domain label discriminator and hypersphere adaptation loss.The essential idea is transferring one-class detection rule based on extracting domain-invariant feature representation.Since the detection rule in one-class anomaly detection task can be described in the form of hypersphere,the hypersphere adaptation loss is designed to realize the transition of oneclass detection rule.The implementation of each part will be elaborated in the following subsections.

A.Implementations

The training process of Fig.2 runs in an adversarial learning mechanism.The source domain data and target domain data are fed into the feature extractor of the encoder (blue part).For one-class classification,each domain only has normal data.Then the network’s shared weights first learn discriminative information from the input data.Then the last hidden state of the feature extractor is passed to a fully-connected layer.The output of the fully-connected layer is mapped to a one-class detection classifier (purple part) and yields hypersphere empirical loss on the two domains’ data.The output is also mapped to a discriminator (brown part) and yields adversarial loss.With a hypersphere adaptation constraint,the detection classifier also yields hypersphere adaptation loss.To promote convergence,the source features and target features from the fully-connected layer are used to calculate the maximummean discrepancy (MMD) loss.Finally a backward propagation is iteratively run to reach equilibrium,i.e,making the shared weights be most indiscriminate with respect to the shift between the two domains while most discriminative for the one-class classification.The implementation of each part in Fig.2 is as follows.

1) Domain-Shared Encoder (G):composed of feature extractor and fully-connected layer,and used to extract domain-invariant feature representation through mapping the source domain data and target domain data into a common feature space.

The input sample sets of two domains can be denoted bywhere the superscriptsSandTindicate source domain and target domain respectively.For the input space X ?Rdand output space F ?Rmthe encoderGaims to seek a neural network ?(·;Wf):X →F to map the input samples into anm-dimensional feature space F,whereWf=(w1,...,wL),wlis the network weight of thel-th hidden layer.HereGcan be any type of deep neural network such as convolutional neural network(CNN),stacked autoencoder and so on,expressed as

where “·”indicates the linear operation (e.g.,matrix multiplication or convolution),σ(·) is the activation function.The specific model ofGshould be up to the application requirement.It is worth noting that the network weight ofGdoes not contain the bias termbthat may cause invalid transfer.Please refer to Section 4 for more theoretical analysis.

2) One-Class Detection Classifier:used to obtain source hypersphere and target hypersphere based on the common featuresfromG.

With the network weightWfofG,DAAD minimizes the volumes of the source hypersphere and target hypersphere in the feature space F whose radius and center areR={RS,RT}andC={CS,CT} respectively.Then the empirical loss LEcan be built as [21]

wherevis the penalty factor that controls the trade off between the target hypersphere and source hypersphere inare the sample numbers of the source domain and target domain,respectively.It is clear that minimizing (2) can not only minimize the volumes of the two hyper spheres built with the features ofG,but also punish the samples located outside the hypersphere.

3) Domain Label Discriminator (D):used to identify which domain the data come from .

The discriminatorDis trained using the source domain data and target domain data together in an adversarial training process.Then the domain loss function LDcan be built as

whereWdis the network weight of the discriminatorD.Maximizing (3) can makeDmore discriminative.OnceDis unable to recognize which domain the data belong to,Wfcan be viewed as the domain-invariant feature representation.

4) Hypersphere Adaptation Loss:used to control the degree of transition of one-class detection rules by evaluating the distribution difference between the target hypersphere and source hypersphere in feature space.

With the hypersphere form,the transition of detection rule indicates that the constructed hyper spheres in the target domain and source domain should be as close as possible.In geometry,the radiuses and centers of the two hyper spheres should be as identical as possible.Here,by adopting the Euclidean distance to evaluate the geometric divergence between the two hyperspheres,a hypersphere adaptation loss LC,Ris built as

whereCS,CT,RS,RTare the centers and radiuses of the source hypersphere and target hypersphere respectively.Minimizing (4) can push the two hyper spheres to be overlapped in geometry.Consequently,it means the data of the two domains have an identical distribution characteristic with the feature representationW f.

Moreover,considering that the discriminatorDis not easy to reach a convergence when tackling the data with large distribution difference,we add an MMD regularizer in the fully connected layer ofG.Differently from Kullback-Leibler divergence,MMD can provide a non-parametric metric to measure the divergence of two distributions in a reproducing kernel Hilbert space (RKHS) with low computational cost[42].Specifically,we map the original data into an RKHS via the function ?(·) inG,and calculate the distance between the two domains.The corresponding loss function LMMDis

Finally,the optimization function of DAAD is

where λ and μ are the regularization parameters.Referencing from Fig.2,the solution of (6) requires minimizing LE,LC,R,LMMDand maximizing LDalternately.With the optimalWf,this process is devoted to obtaining the best one-class classification performance on the two domains while pushing the data distribution and geometric shape of the two domains to be identical.As (6) is built based on the form of one-class hypersphere,we name the optimization process of this problemas deep hypersphere adversarial training mechanism.

Obviously,the one-class detection rule can be transferred from source domain to target domain via the hypersphere adaptation.Although the baseline framework is DANN,while the domain adaptation is still realized through seeking domain-invariant feature representation,the common features can be extracted under the direction of hypersphere adaptation.Through the overlapped hyperspheres in feature space,the one-class detection rule can then be transferred.Then the target domain detector can be updated by means of such rule to have better anomaly detection performance with insufficient number of representative data.

B.Optimization

To solve (6),we adopt an alternative optimization method to optimizeWf,WdandRS,RTalternately.First,fixRS,RT,then optimizeWf,Wd;Second,fix the updatedWf,Wd,then optimizeRS,RT.Starting from initializingWf,WdandRS,RT,these two steps run iteratively until convergence.Since (6) is a min-max problem,it needs to be decomposed as

Specifically,the four variables can be optimized by using a stochastic gradient decent (SGD) algorithm,as

C.Anomaly Detection

For the test samplextest,the way to determine whether it is an anomaly or not is similar with SVDD.Here we define the a nomaly score by calculating the distance fromxtestto the hypersphere center of the target domain,as follows:

where ?(·) is the mapping function inG.S core(xtest)<0 means the samplextestlocates inside the hypersphere,indicating it is a normal sample.Otherwise,the samplextestwill be judged as an anomaly.This process merely depends onWf,CT,RT,and is straightforward enough with low computational cost.Therefore,DAAD is very suitable for the online anomaly detection with streaming data.

IV.RELIABILITY ANALYSIS

In this section,we theoretically analyze the reliability of the DAAD model,since an improperly formulated network or parameter will cause uninformative solution,thus leading to invalid or negative transfer.Because the transition of detection rule in DAAD relies on the form of hypersphere,the hypersphere radius may shrink to zero if there is a constant mapping function in the feature extractorG,named visually“hypersphere collapse”[21].DAAD model will probably converge to the two completely overlapped hyperspheres with zero network weights,even with normal parameter settings.Zero weights can serve as a feasible solution to realize hypersphere adaptation,but they are meaningless.Then we borrowed the concept of “hypersphere collapse”from [21],and extended it to the case of one-class classification with transfer learning.Moreover,the hypersphere adversarial training mechanismdoes not work if setting extreme regularization parameters.Focusing on these two typical cases and inspired by [21],we have the following propositions.

Proposition 1 (Invalid transfer with all-zero weights):If any network weightW∈{Wf,Wd} is zero,DAAD respectively maps the source domain dataXSand target domain dataXTinto the same outputs.Consequently,the optimal solution of DAAD has the hypersphere radiusRS=RT=0,which implies invalid transfer.

Proof:For every configuration ofWfandWdwe have the loss of DAADLloss≥0.Since the feature extractorGis domain-shared,the weightWfand bias term(if possible) are available forXSandXT.IfWfis zero,the output ofGis constant,i.e.,? (xS;Wf)=?(xT;Wf)=c0∈F for anyxS∈XSandxT∈XT,and the hypersphere center becomes a constant pointc0.At this time with anyW∈{Wf,Wd} being zero,the loss of DAAD keeps to be zero,i.e.,Lloss=0.Then we have the hypersphere radiusesRS=RT=0.In this case,the hypersphere does not work to transfer detection rule,and invalid transfer is inevitable.

In Proposition 1,DAAD finally obtains a trivial solution which can cause hypersphere collapse with the radius equal to zero.The zero network weightWwould cause negative transfer with hypersphere collapse.To prevent the feature deterioration,Proposition 1 implies the following strategy: keep the weightWof the feature extractorGnon-zero,and adopt a mini-batch SGD algorithm to train the DAAD model.Due to the non-zeroW,the output of the feature extractor keeps nonzero.Since the data in each batch are different from the others,the features of the batch will keep varying in the iteration,i.e.,? (xS;Wf)≠?(xT;Wf) .As the hypersphere centersC S,CTin each data batch can be calculated as the mean of the source domain features and target domain features respectively,we can guarantee thatCS≠c0andCT≠c0.It is found empirically that this strategy can effectively avoid invalid transfer and make DAAD convergence faster.

Proposition 2 (Invalid transfer with bias term):In DAAD,there are the network mappings?(xS;Wf)and ?(xT;Wf)for anyxS∈XSandxT∈XT.If the feature extractorGhas bias termB={b1,...,bL} forLhidden layers,DAAD may learn a constant solutionc,i.e.,?(xS;Wf,B)=?(xT;Wf,B)=c,which will cause hypersphere colla pse,i.e.,R S=RT=0.

Proof:For any input datax∈the output of the layerl∈{1,...,L} with the weightwl∈Wfand bias termbl∈Bcan be expressed as

where σ(·) is the activation function.Oncewl=0,the output becomeszl(x)=σ(bl).In this case,the network weight can be optimized asWf=0 to make ?(xS;Wf,B)=?(xT;Wf,B)=c,wherecis only related to the bias termB.It also means that the bias termBcan be chosen to keepcbeing a constant vector for the whole network ofG,which indicates the hypersphere radiusesR S=RT=0.

Proposition 2 implies that the existence of bias term can push DAAD to learn a constant mapping function with allzero weight.Any input data of source domain and target domain would be both mapped onto the same hypersphere center.Obviously,the detection rule is unavailable even if the loss of DAAD in (6) can reach the minimum,and then invalid transfer will occur.

Proposition 3 (Invalid and negative transfer with extreme parameter setting):In DAAD,the penalty factor ν ∈(0,+∞)in (2) controls the effect of transferring detection rule and directly determines the performance boundary of DAAD.If the value ofνapproaches 0,DAAD will raise invalid transfer.Ifνis set to+∞,a negative transfer may occur.

If the value of νTapproaches 0,the minimization of LEwill forceto decay towards zero (but not exactly zero).In this case,the value ofνapproaches +∞,while the loss of DAAD mainly relies on the data of source domain.If the data of source domain and target domain have a large distribution difference,negative transfer will happen.In contrast,if the value of νSapproaches 0,the minimization of LEcan forceto decay towards zero.In this case,the value ofνapproaches 0.Since DAAD almost only uses the target domain data for training,an invalid transfer will occur.

Besides ν,another key parameter is λ in (6) that controls the overlapping (or say,deviation) degree of the two hyperspheres.Differently fromνthat controls the volumes of source hypersphere and target hypersphere,λ merely affects the degree of transferring detection rule,not the expression of the rule itself.Too large value of λ will push the two hyperspheres to be as close and overlapped as possible.Too small value of λ will tolerate a certain deviation of hypersphere adaptation.But these two cases are not invalid transfer or negative transfer.

V.EXPERIMENTS

In this section,we run comparative experiments on two typical anomaly detection problems: image recognition anomaly detection [20] and online early fault detection of rolling bearings [43].These two problems have quite different challenges.The image recognition anomaly detection,including handwritten character anomaly detection and object recognition anomaly detection,is with image datasets and serves as a benchmark experiment for a straightforward visual comparison.The early fault detection of rolling bearings runs with vibration signals and is used to validate the proposed algorithm’s effectiveness for a practical engineering problem.For the experiments of handwritten character anomaly detection,we choose two widely-used datasets MNIST and USPS [40]for test.As the images in MNIST is easier to be recognized,the images in USPS are small-scale while each character image is too vague to be recognized easily.Because the MNIST &USPS datasets are relatively simple,few-layer network structure is enough.For the object recognition anomaly detection,we choose a prevailing and challenging benchmark dataset,i.e.,the Office-Home dataset [44],to evaluate the transfer effect of DAAD with a deeper structure.Please note that the choice of the encoder in DAAD mainly depends on the characteristic and data complexity of the applications.Detection accuracy can then be calculated on these image datasets to quantify the comparison.For the online early fault detection of rolling bearings,we adopt two widely-used bearing datasets,the IEEE PHM Challenge 2012 dataset [45] and XJTU-SY dataset [46],to verify the effectiveness of DAAD.These two datasets both contain whole-life degradation signals generated by the run-to-failure experiments.Since the specific location of fault occurrence cannot be determined,DAAD aims to not only find early detection location but also reduce false alarmrate.

The programming environment is Python3.7 running on Window s OS with RTX3090 graphics card,i7-12700K processor and 32 GB memory.

A.Results of Handwritten Character Detection

1) Competing Methods:In this section,we introduce ten anomaly detection methods for comparison,in which six with non-transfer learning and four with deep transfer learning.These ten methods include:

a) Shallow models:OC-SVM [10] and SVDD [18] are the typical one-class anomaly detection algorithms with kernel trick.Gaussian kernel is used with the width parameter 0.1.iForest [11] is an unsupervised anomaly detection algorithmbased on data cutting.The number of trees is set to 100,while each tree randomly selects 256 samples for training.For fair comparison,OC-SVM,iForest and SVDD adopt the same deep convolutional autoencoder (DCAE) features with DAAD.

b) Deep models without transfer learning:Deep SVDD [21]is the state-of-the-art (SOTA) deep one-class classification algorithm.It replaces the kernel mapping in the traditional SVDD by a deep neural network to achieve end-to-end anomaly detection.The penalty coefficientνin deep SVDD is set to 0.01.In this experiment,we build two baseline algorithms based on deep SVDD.One is the classical deep SVDD trained with only one domain data,named deep SVDD (1).The other one is deep SVDD trained with all data from source domain and target domain (as single-domain input),named deep SVDD (2).We also introduce a CNN-based self-supervised learning method for anomaly detection [36],named SSL.This method utilizes data augmentation and contrastive learning to extract discriminative information,and can be viewed as an SOTA method for the one-class classification with small-scale data.

c) Deep models with transfer learning:We also introduce four SOTA deep transfer learning-based anomaly detections for comparison.The first one is IRAD [40] that first utilizes an across-domain encoder trained by adversarial learning to extract a domain-invariant representation and then trains an anomaly detector with such representation.Since IRAD also adopts adversarial learning strategy,it can be viewed as the most related work with DAAD.The other three methods utilize autoencoder [34],know ledge distillation [35],and SSL[36] to pre-train a network and then fine-tune Deep SVDD on top of the pre-trained network.To keep line with DAAD,we replace the variational autoencoder by DCAE in the autoencoder-based method [34].For short,we name themby DCAE pre-train,KD pre-train and SSL pre-train,respectively.We think these methods can provide a comprehensive comparison.

2) Experimental Settings:The purpose of this experiment is to verify the effectiveness of DAAD in the one-class transfer learning.Both the datasets MNIST and USPS are composed of 10 classes.MNIST is composed of 70 000 pictures with the size of 28×28 pixels,while USPS is composed of 20 000 pictures with the size of 1 6×16 pixels.

For DAAD,deep convolutional autoencoder (DCAE) is employed as the feature extractorGin Fig 2,since DCAE is suitable for extracting image features with better capability of noise reduction.Specifically,DCAE consists of three convolutional modules (Conv),whose sizes are 32×(3×3×1),32×(3×3×1) and 1 6×(3×3×1),followed by two fully-connected (FC) layers.Each convolutional module consists of 32(or 16) convolutional filters with the size 3×3 and one channel.The step size of maximum pooling (Max_pool) is set 2,the activation function is ReLU (rectified linear unit),and batch normalization (BN) is employed.The structure and algorithm settings are listed in Table I.

We set up 10 sets of experiments,named MNIST～USPS for simplicity.Each set uses a number from 0?9 as the object.The MNIST dataset is set as the source domain with 5000 pictures of each character.These pictures all have clear character shape and are easy to be classified correctly.The USPS dataset is set as the target domain with 500 blurred and vague pictures of each character.Such pictures are not easy to be recognized correctly via one-class classification.In this experiment,we first choose one character (e.g.,the number 0) as normal class,and feed the pictures of this character in the source domain and target domain into DAAD for training.Furthermore,we choose some new pictures of the same character in the target domain as the test data.To simulate anomaly,we add the pictures of another character (e.g.,the number 1) into the test data.

In this experiment,AUC (area under curve) is adopted asthe evaluation metric.For the six deep learning-based methods and DAAD that are all randomly initialized,we repeat the experiment 10 times and take the average AUC as the final results.

TABLE I STRUCTURE AND ALGORITHM SETTINGS OF DAAD

3) Experimental Results:The comparative results are listed in Table II.It is clear that the AUC values of DAAD are much larger than those of the other algorithms.This is caused by the effective transition of detection rules obtained from a large number of high-quality images in MNIST dataset.The four deep transfer learning algorithms can outperform the other non-transfer learning methods,which proves the necessity of transfer learning.SSL does not always achieve better results than the other non-transfer learning methods.The reason is low-quality images in USPS dataset could not provide effective supervised information for model training.IRAD also gets lower accuracy than the two deep SVDD methods on the classes 3,6 and 8.We speculate this is caused by the adversarial training procedure that is less robust with insufficient representative samples and overtraining.We also observe that deep SVDD (2) does not obtain stable results,which indicates that the decision boundaries built with all training samples are not distinct enough from anomalies.To prove this point,we visualize the feature distribution of the target domain data by deep SVDD (2),SSL pre-trin,IRAD,and DAAD,as shown in Fig.3.It is clear that the features extracted by DAAD are more separable than those by the other comparison methods.

To visually demonstrate the detection effect,Fig.4 also shows five groups of detection results by DAAD,deep SVDD(1) and IRAD.The recognition error of DAAD is significantly less than the error of deep SVDD (1),which proves again the effectiveness of one-class transfer learning for anomaly detection.It is worth noting that,even the transfer learning strategy is adopted,DAAD still has a certain misclassification cases in Fig.4(c).For instance,the character 3 is recognized as 2 while the character 9 is recognized as 8.Also,the character 2 is misjudged as anomaly.The reasons are as follows: 1) The shapes of characters 2 and 3 are relatively close,resulting in a decrease of detection accuracy;2) In the USPS dataset,some images have low quality (see the last rowin Fig.4),which also reduces the detection accuracy.Nevertheless,compared to deep SVDD (1) and IRAD,DAAD can effectively recognize the characters with irregular shapes to a considerable extent,which proves that the classifier in the target domain has effectively learned the detection rules extracted from the clear character images.

Moreover,the randominitialization of the network weight may affect the detection accuracy and convergence of DAAD.Fig.5 shows the box plot of the AUC values listed in Table II.For each experiment,the value of AUC always keeps in a small fluctuation range.It indicates that the randominitialization of network weight has very little effect on DAAD.The model of DAAD is of good stability.

From Proposition 3,the penalty factorνand regularization parameterλboth play a vital role in hypersphere adversarial training.We select the optimal parameters via a simple grid search with cross validation.Fig.6 visualizes the performance of DAAD with different values of these two parameters.Besides AUC,we also adopt the hypersphere adaptation loss LC,Rin (4) that can directly indicate the effect of transferring one-class detection rule.It is obvious that the hypersphere adaptation losses with different values ofνall decrease quickly,while the AUC accuracy increases significantly and reaches convergence around the 200th iteration.Moreover,we find the AUC accuracy is relatively sensitive to the value ofλ.Too Large or small value ofλwould lead to a drastic decrease of AUC accuracy.Empirically,we choose ν=1 and λ=0.5 as the optimal parameters in our experiment.Also,the results can prove the effect of the hypersphere adversarial training mechanism.

B.Results of Objects Recognition Anomaly Detection

1) Experimental Settings:We further evaluate the performance of DAAD with deeper network structure.Since the characters in the MNIST and USPS datasets are relatively simple,we introduce a more complex image dataset,i.e.,the Office-Home object recognition dataset [44],for evaluation.The images in this dataset are colorful with larger image size,while the objects are all from real world and with more diversity than the characters in the MNIST and USPS datasets.We believe this dataset can provide a more comprehensive evaluation for DAAD.Specifically,we also conduct 10 sets of anomaly detection experiments with “Produce”as the source domain and “Clip”as the target domain,as shown in Fig.7.

For each set of experiment,one same class in “Produce”and“Clip”is chosen as the normal data,and the remaining classes are randomly selected as anomalies for detection.For each normal class in “Produce”,200 images are generated through data augmentation as source domain data,while 20 images in“Clip”are chosen as target domain data.Also,20 images from the remaining classes are chosen as anomalies.Differently from the experiment setting in the previous section,a deeper neural network ResNet-50,which has 50 layers with residual mapping structure,is employed as the encoder of DAAD.To keep a fair comparison,all deep SVDD models and IRAD employ ResNet-50 as the encoder.

2) Experimental Results:Table III presents the average AUC results of the 10 sets of experiments,where the first col-umn is the normal class in each experiment.With these real world objects,DAAD can still obtain the highest AUC value on eight classes in total ten classes,which proves that DAAD can effectively solve more challenging transfer task via extending the encoder to a deeper structure.

TABLE II COMPARATIVE AUC VALUES BY THE TOTAL 11 METHODS FOR THE 10 SETS OF EXPERIMENTS FROM MNIST～USPS (%)

Fig.3.Feature distribution of the target domain data for the second experiment in Table II.

Fig.8 provides the results of ablation experiment by removing LD,LC,R,LMMDmentioned in (3)?(5) in DAAD.From Fig.8(a),it is clear that the extracted features of the normal class and abnormal class are nearly distinguishable.But a certain degree of overlap always exists between the two classes in each ablation experiment,as shown in Figs.8(b)?8(d),failing to find a discriminative decision boundary.In terms of numerical results,the AUC values of all ablation experiments decrease by about 10%.The visual comparison verifies the key role of LD,LC,R,LMMDin DAAD.

We further evaluate the influence of the sample number on DAAD’s performance,as shown in Fig.9.DAAD gets much higher AUC value than the other methods with less target domain samples.Deep SVDD (1) starts to outperform DAAD only with 150 samples or more.This phenomenon is not surprising because sufficient target domain data can provide enough domain know ledge for anomaly detection.But deep SVDD (2) is always inferior to the other two methods,caused by the distribution discrepancy between the data from the two domains.In addition,DAAD performs better than other transfer anomaly detection methods with the increase of training samples.This comparison verifies again the promising capability of one-class transfer learning by hypersphere adversarial training.

Fig.4.Five groups of detection results in Table II.In each subfigure,the left column is the detected normal character,and the right column is the detected anomalous character.It is clear that the number of pictures incorrectly recognized by DAAD is much smaller than the number by deep SVDD(1) and IRAD.

Fig.5.Box plot of the AUC values obtained by DAAD in Table II.

Fig.6.Performance of DAAD with different parameter settings on MNIST～USPS ((a) and (b) are the change of hypersphere adversarial loss (in logarithmic form) and AUC with different ν values;(c) is the change of AUC with different λ values).

Fig.7.Examples of the 10 classes from “Produce”and “Clip”in the Office-Home dataset [44].

TABLE III COMPARATIVE AUC VALUES BY THE TOTAL 11 METHODS FOR THE 10 SETS OF EXPERIMENTS FROM OFFICE-HOME (%)

C.Results of Bearing Early Fault Detection

As a typical rotating element,rolling bearing generally has three running states in a whole-life degradation process: normal state,early fault state and fast degradation.Early fault detection of rolling bearings just aims to recognize early fault state from normal state,which is a typical problem of one class anomaly detection.For online early fault detection in non-stop scenarios,one can build a one-class anomaly detection model based on the initial normal state data.However,the online data are usually insufficient.And due to running-in and lubrication,some normal state data may be recognized as early fault data,i.e.,false alarm.Moreover,the online working condition tends to be different from offline working condition,i.e.,the phenomenon of working condition drift [12].In this case,the data distribution of online data and offline data would not be identical,even if the bearings have the same manufacturing specifications.Since violating the precondition of i.i.d.,the offline data are not directly applicable to train an online model.Although various transfer learning techniques have demonstrated the effectiveness in solving the problem of online early fault detection [12],[33],they are generally not robust enough to eliminate the false alarms.It is better to transfer the anomaly detection rules from offline data,i.e.,one-class transfer learning,rather than direct modeling using the offline data.The schematic diagram of the problem is illustrated in Fig.10.

Fig.8.Feature distribution of the target domain data on the Office-Home dataset((a) is from the full DAAD model;(b)?(d) are the ablation of L D,L C,R,LMMD in DAAD respectively.The term“w/o”means “without”.Principal component analysis (PCA)is utilized here for visualization).

Fig.9.Change of AUC (%) by different methods with the increase of target domain samples from the Office-Home dataset.

1) Experimental Settings:Here we briefly introduce the two datasets.The IEEE PHM Challenge 2012 dataset was collected on the PRONOSTIA test platform,shown in Fig.11(a),and provided the data of 17 bearings under three working conditions.The three conditions are with the engine speeds 1800 rpm,1650 rpm,1500 rpmand the loads 4000 N,4200 N and 5000 N,respectively.The XJTU-SY dataset is provided by the Institute of Design Science and Basic Component at Xi’an Jiaotong University (XJTU),China and the Changxing Sumyoung Technology Co.,Ltd.(SY),China.The test platform is shown in Fig.11(b),including three working conditions,each of which contains 5 bearings.The three conditions are with the engine speeds 2100 rpm,2250 rpm,2400 rpm and the loads 12 kN,11 kN,10 kN,respectively.

As stated before,the different working conditions in each dataset will lead to the different distributions of monitoring data.To simulate online detection,we set the first working condition for each dataset as the source domain (offline data),and choose one bearing (the target bearing) in the second working condition as the target domain (online data).Specifically,we select the first 500 samples of normal state from each bearing under the first working condition to construct a one-class classifier.Please note that [43] has proven that the first 500 samples of all bearings in these two bearing datasets are all in the normal data.We also select the first 100 samples of the target bearing (see Table IV) under the second working condition to construct a one-class classifier for early fault detection.To demonstrate the necessity of transfer learning,we provide the monitoring signals of source domain and target domain in Fig.12.The distinction between the normal state and early fault state in the source domain is visually clear,which means the one-class detection rule can be effectively learned.However,there are irregular fluctuations in the normal state data of the target domain.Using the first 100 samples to construct a one-class classifier would not work well.Then transferring the detection rule from the source domain is expected to be helpful for improving the detection accuracy and reduce false alarm.The specific experimental settings of the two datasets are shown in Table IV.

2) Results of One-Class Transfer Learning:We first give the effect of DAAD on the IEEE PHM Challenge 2012 dataset.Fig.13 shows the feature distribution of the training data before and after running DAAD.It is clear that the original data of the source domain and target domain have a large distribution difference.It indicates that the hypersphere trained with the source domain data cannot be directly applied to the target domain data.Through the domain adaptation of DAAD,the feature distribution tends to be identical,and the probability density function (PDF) curves also keep aligned.The identical data distribution is beneficial to reducing the hypersphere adaptation loss and facilitate the transition of the detection rules.

Based on the common features obtained in Fig.13,we feed sequentially the online data of the target bearing into DAAD to determine whether an early fault occurs or not.Fig.14 shows the effectiveness of DAAD in the transition of detection rule.Before transfer learning,we run deep SVDD (1) to build two detection models with the source domain data and target domain data individually.Obviously,the hyper spheres are not consistent.After running DAAD,the hyper spheres of the source domain and target domain tend to be overlapped,indicating that the detection rules have been effectively transferred.For the new ly-arrived samples,there is no obvious boundary between the normal data and faulty data before transfer learning.Some pink dots even appear in the hypersphere of the target domain,leading to false alarm.Through the domain adaptation of DAAD,most of the normal samples are correctly recognized (i.e.,being included in the hypersphere),while the separability of faulty samples is significantly improved.It verifies again the performance of DAAD in the transition of one-class detection rules.

Fig.15 shows the detailed detection results.To facilitate understanding,Fig.15 also shows the feature distribution of the online data.The features have been illustrated in Fig.13.It is obvious that in Fig.15(a),a clear overlap of the normal samples and faulty samples exists,indicating the features are not sensitive to early fault before the transfer.As shown in Fig.15(b),the DAAD features have better discriminative capability.The reason can be attributed to the effective transition of one-class detection rule by DAAD which can improve the feature sensitivity to faulty samples.The effect is also reflected in Fig.15(c),where the anomaly score (see (14)) has a significant jump at around the 159th sample.Obviously,DAAD can identify the early fault more accurately and avoid effectively false alarmas well.

Fig.10.Schematic diagram of one-class transfer learning for online early fault detection of rolling bearings.The distinction between normal state and early fault state in the source domain is clear while data fluctuation in the target domain exists.The transfer of one-class detection rules across different working conditions will help to improve the detection effect of target bearing.

Fig.11.Test platform.

The XJTU-SY dataset has similar detection results with the IEEE PHM Challenge 2012 dataset.Due to space limitation,the intermediate results are no longer given here.Fig.16 provides the online feature distribution and detection results of deep SVDD (1) and DAAD for comparison.Compared to the feature distribution before the transfer,the distinction between normal state samples and early fault samples is significantly improved after the transfer,with less false alarm.However,unlike Fig.15,the detection results in Fig.16 are not improved very much.The main reason is that the online data of normal state has a stable trend with no large noise interference.Therefore,direct modeling by deep SVDD (1) using the online data could achieve better detection results,but there are still a number of false alarms.Conversely,DAAD can recognize the early fault well with almost zero false alarm.The alarm location of early fault by DAAD is around the 632th sample.

3) Comparative Experiments:Besides the comparative methods used in Section V-A,we introduce 6 SOTA early fault detection methods and 8 typical anomaly detection methods for comparison.Now we have 21 representative methods for comparison,as listed in Table V.These methods include not only signal analysis-based method and anomaly detection methods but also early fault detection methods with and without deep transfer learning.We think these 21 methods can provide a comprehensive comparison.Since Methods 8?13 have been introduced in Section V-A,here we mainly give a brief summary of the new ly-added methods.

Method 1 is the SOTA early fault detection method based on signal analysis.It mainly uses Bandwidth EMD to decompose the signal and compare it with the fault characteristic frequency.

Methods 2?7 are the combination of three typical anomaly detection algorithms (LOF,SVDD and iFOREST) and two features (Kurtosis and stacked denoising autoencoder(SDAE)).Kurtosis is an effective time-domain statistical feature for early fault.

Methods 15 and 16 have been discussed in Section II and can be viewed as the SOTA anomaly detection methods with deep domain adaptation.In the implementation,Method 15 adopts SSL network [36] to generate pretrained features in order to adapt the target distribution,and the regularization parameterλof elastic weight consolidation is set to 100.To make a faire comparison,Method 16,also taking adversarial training strategy to obtain the domain-invariant feature representation,keeps encoder structure the same as DAAD.

Methods 17 and 18 are the SOTA early fault detection methods with non-transfer deep learning.Method 17 uses a sliding window to match the SDAE feature of early fault data with the sequentially-collected data block.Method 18 uses LSTM to predict the trend of online data,and realizes online detection according to the degree of deviation.

TABLE IV EXPERIMENTAL SETTINGS OF THE IEEE PHM CHALLENGE 2012 DATASET AND XJTU-SY DATASET

Fig.12.Monitoring signals in the IEEE PHM Challenge 2012 dataset.The source domain contains the seven bearings under the first working condition,and the target domain has the first bearing (Bearing2_1) under the second working condition.The left column is the raw vibration signal and the right column is the corresponding RMS curve.

Fig.13.Data distribution characteristics of the source domain and target domain in the IEEE PHM Challenge 2012 dataset ((a) and (b) are before transfer learning;(c) and (d) are after running DAAD.The left column is the scatter plot of the features,while the right column is the corresponding PDF curves).

Fig.14.Effect of transferring one-class detection rules in the IEEE PHM Challenge 2012 dataset ((a) and (b) are the features of the training data extracted by deep SVDD (1) and DAAD respectively;(c) and (d) are the detection effects on the online test data by deep SVDD (1) and DAAD respectively.It is clear that the features extracted by DAAD have enhanced separability between the normal samples and faulty samples of the target bearing).

Fig.15.Detection results on the target bearing in the IEEE PHM Challenge 2012 dataset ((a) and (b) are the feature distributions of online data extracted by deep SVDD (1) and DAAD respectively,and (c) is the corresponding detection results.The samples with the anomaly score below 0 are recognized as in normal state,otherwise in early fault state).

Fig.16.Detection results on the target bearing in the XJTU-SY dataset ((a) and (b) are the feature distributions of online data extracted by deep SVDD (1)and DAAD respectively,and (c) is the corresponding detection results).

TABLE V COMPARATIVE RESULTS OF TOTAL 22 METHODS ON THE IEEE PHM CHALLENGE 2012 BEARING DATASET AND XJTU-SY BEARING DATASET

Methods 19?21 are the SOTA early fault detection methods with deep transfer learning.Method 19 uses sparse dictionary coding to reconstruct errors,and calculates K-nearest distance to achieve fault detection across working conditions.Method 20 is a typical method that employs a cross-dataset fine-tuning strategy to achieve deep transfer learning for early fault detection.Method 21 is a typical unsupervised deep transfer learning method based on SDAE features and MMD distance for bearing fault detection.

Following [49],we adopt two evaluation metrics in Table V:1) detection location,which is the index of the signal snapshot where an early fault occurs,and 2) number of false alarms,which means the number of anomalies detected before early fault occurrence.A method with earlier detection location and smaller number of false alarms is better.Since multiple anomalies could be recognized,we need to set an alarmstrategy to determine the specific detection location.We set the location with five consecutive anomalies as the final result.The detection results of DAAD and the 21 comparative methods are also listed in Table V.DAAD can get the smallest false alarm number and obtain almost the earliest detection location on the two datasets.Although Methods 2?7 have obtained earlier detection location on the XJTU-SY dataset than DAAD,they have many false alarms and their detection results on the IEEE PHM Challenge 2012 dataset are poor.This is because the traditional anomaly detection methods only use the initial data of the target bearing for model training,and the detection results directly depend on the distribution characteristics of these data.Since the target bearing Bearing2_1 in the IEEE PHM Challenge 2012 dataset has large fluctuations in the normal state,the detection results are poor.

Fig.17.Feature distribution and the corresponding detection results on the IEEE PHM Challenge 2012 dataset.

For a straightforward understanding,we also provide the feature distributions of DAAD and deep SVDD (2),DCAE pre-train,IRAD,PANDA and LogTAD.Here deep SVDD (2)utilizes the target domain data and source domain data as the single input.Due to space limitation,we only provide the results of DCAE pre-train that are very similar to the results of KD pretrain and SSL pre-train.For a better understanding,we also plot the detection results of DAAD which is the same as Fig.15(c).FromFig.17(a),the training data from the two domains has the same hypersphere boundary in geometry,but the faulty samples for test can not be separated well.It indicates that deep SVDD (2) is incapable of extracting discriminative features and proves again the necessity of one-class transfer learning.Although DCAE pre-train and PANDA conduct domain adaptation,the decision boundary between normal samples and faulty samples is still not clear with some false alarms raised.It means that the fluctuation in the normal state of target domain data can make the extracted features not discriminative for the early fault.Although the two adversarial training-based methods,IRAD and LogTAD can get more discriminative features (see Figs.17(c) and 17(e)),the distinction is not large enough,while a small number of normal samples are mixed with faulty samples.Especially for LogTAD that can recognize the fault occurrence at the earliest location,it still raises several false alarms,being less robust than DAAD.There are several samples located in the boundary area and the distinction is not clear like DAAD,as shown in Fig.17(e).The comparison demonstrates that the adversarial learning strategy in IRAD and LogTAD is aimless.The features of normal class are not guided to shrink to the normal samples,which verifies the effectiveness of hypersphere adaptation.These results verify that only transferring model parameters or domain adaptation cannot effectively improve the detection performance,especially the detection robustness,for the data with unexpected noise interference.Conversely,as shown in Fig.17(f),DAAD can extract enough discriminative feature that is stable in normal state but very sensitive to the early fault.This proves the effectiveness of the transition of one-class detection rule.

In this comparison,the performance gap between DAAD,deep SVDD (2) and LogTAD is quite interesting since the last two methods both build a hypersphere to enclose the data in source domain and target domain.We speculate this is due to the hypersphere adaptation loss in (4).Superficially,the hypersphere adaptation merely serves as a constraint in DAAD.Actually,it acts like a conductor in an orchestra for seeking domain-invariant feature representation.By enforcing the two domains’ hypersphere to be as identical as possible,this constraint provides a targeted direction of feature extraction,which outperforms most of traditional domain adaptation techniques that learn features without purpose.As a result,the features extracted by DAAD can fit the normal class data in a more compact form than the features directly extracted from the data of source domain and target domain together,i.e.,the act by deep SVDD (2) and LogTAD.Although LogTAD also adopts hypersphere constraint plus adversarial training strategy,it merely pushes the two domains’ data to be close to the same hypersphere center,i.e.,mapping both domains’ data into the same hypersphere [41].Once the target domain data are irregularly noisy and have outliers,the volume of the hypersphere will be enlarged.Then the detection rule in the source domain would not be well transited to the target domain.That is why the features by deep SVDD (2) and LogTAD are less distinctive than the DAAD features,just as Figs.17(a) and 17(e) vs.Fig.17(f).Discriminative features certainly lead to a separable decision boundary with higher detection accuracy and lower false alarmrate.

Beyond Table V,we further provide the visual comparison between DAAD and the other SOTA methods in Fig.18.Method 10 (SSL) is much inferior to DAAD,which indicates that noise interference would generate improper supervised information for anomaly detection.We also observe Methods 11?13 perform poorer than DAAD,which demonstrates again that transferring model parameters may extract less discriminative features.Although Methods 19?21 can obtain relatively good results,they still raise high false alarmnumbers.These two methods also adopts parameter-based transfer strategy and the extracted features are not robust to the early fault.In contrast,DAAD can overcome such challenges by transferring one-class detection rule from the source domain.Compared with the deep transfer learning methods,DAAD can significantly reduce false alarm while keeping relatively earlier detection location.This is because the proposed hypersphere adversarial training mechanism in DAAD is beneficial to transfer detection rule.As a result,the learned one-class detection rule from the source domain improves the detection accuracy and robustness on the target domain.

VI.CONCLUSIONS

To achieve an effective transition of one-class detection rule,this paper proposes a new transfer learning-based anomaly detection algorithm with adversarial training.This algorithm constructs a deep hypersphere adversarial training mechanism to realize end-to-end one-class anomaly detection with deep domain adaptation.The specific conclusions are as follows:

1) Compared with the current transfer learning methods with domain adaptation,the proposed algorithm is more suitable for the problem of anomaly detection with one-class classification by integrating the idea of hypersphere adaptation.

Fig.18.Visual comparison between DAAD and typical anomaly detection methods listed in Table V ((a) and (c) are the results on the IEEE PHM Challenge 2012 dataset;(b) and (d) are the results on the XJTU-SY dataset.The sample whose y-axis label is 1 is recognized as in normal state,while the sample with label 0 means anomalous.The number in the red frame is the detection location according to a specific alarmstrategy).

2) Compared with the current transfer learning methods for one-class classification that mainly transfer model parameters,the proposed algorithm can obtain better detection performance through transferring one-class detection rule in the adaptive extraction of domain-invariant feature representation.

3) Large difference of data distribution caused by noise interference or low-quality samples in target domain reduces detection robustness and easily leads to false alarm.Differently from most of end-to-end anomaly detection methods which would not build a clear decision boundary,the proposed method can get a more robust detection model through transferring the former learned one-class detection rule.

In the next work,we plan to study the unsupervised anomaly detection with streaming data.The temporal information will be of major concern in the extraction of detection rule.The model should be also developed in an incremental form to update with sequentially-arrived data.Moreover,adaptive extraction of anomalous pattern,even the unseen pattern,is another interesting problemin the study of one-class transfer learning.

APPENDIX

The objective function in (6) has four loss functions.Here we calculate the derivative of each loss respectively.We first re-write the four loss function to an equivalent form.

IEEE/CAA Journal of Automatica Sinica2023年2期

IEEE/CAA Journal of Automatica Sinica的其它文章: The Distribution of Zeros of Quasi-Polynomials; Straight-Path Following and Formation Control of USVs Using Distributed Deep Reinforcement Learning and Adaptive Neural Network; Prescribed-Time Stabilization of Singularly Perturbed Systems; Visual Feedback Disturbance Rejection Control for an Amphibious Bionic Stingray Under Actuator Saturation; Optimal Formation Control for Second-Order Multi-Agent Systems With Obstacle Avoidance; Dynamic Target Enclosing Control Scheme for Multi-Agent Systems via a Signed Graph-Based Approach