Tao Li,Qi Qian,Yongjun Ren,Yongzhen Ren and Jinyue Xia
1College of Artificial Intelligence,Nanjing University of Information Science and Technology,Nanjing,210044,China
2College of Electronic and Information Engineering,Nanjing University of Information Science and Technology,Nanjing,210044,China
3College of Computer and Software,Nanjing University of Information Science and Technology,Nanjing,210044,China
4College of Electronic and Information Engineering,Nanjing University of Information Science and Technology,Nanjing,210044,China
5International Business Machines Corporation(IBM),NY,USA
Abstract:The application field of the Internet of Things(IoT)involves all aspects,and its application in the fields of industry,agriculture,environment,transportation,logistics,security and other infrastructure has effectively promoted the intelligent development of these aspects.Although the IoT has gradually grown in recent years,there are still many problems that need to be overcome in terms of technology,management,cost,policy,and security.We need to constantly weigh the benefits of trusting IoT products and the risk of leaking private data.To avoid the leakage and loss of various user data,this paper developed a hybrid algorithm of kernel function and random perturbation method based on the algorithm of non-negative matrix factorization,which realizes personalized recommendation and solves the problem of user privacy data protection in the process of personalized recommendation.Compared to non-negative matrix factorization privacy-preserving algorithm,the new algorithm does not need to know the detailed information of the data,only need to know the connection between each data;and the new algorithm can process the data points with negative characteristics.Experiments show that the new algorithm can produce recommendation results with certain accuracy under the premise of preserving users’personal privacy.
Keywords:IoT;kernel method;privacy-preserving;personalized recommendation;random perturbation
As an emerging product,the Internet of Things has a more complex architecture,no unified standards,and more prominent security issues.While providing various convenient services,it is inevitable that more and more detailed personal information needs to be provided in order to obtain better services.On the other hand,more and more personal information is made public,which makes human beings have no privacy.In particular,government agencies and public service agencies are increasingly publishing data containing personal information.Whether it is privacy in data distribution or location privacy in location services,the protection of users’ personal information is particularly important[1,2].To solve this problem,people use privacy protection technology to protect users’ information.But inevitably,more and more detailed personal information needs to be provided in order to get better and convenient services.Collaborative filtering can help with effective personalized recommendations[3,4].A personalized recommendation system is used to collect user ratings for projects and to search for users with similar interests to help users decide which product to buy[5–7].So far,there has been a lot of progress in personalization and privacy-preserving,and the relevant progress is as follows.
Personalized recommendation is the technical support that the Internet relies on today.In 2014,Min et al.[8]proposed a model-based collaborative filtering method recommendation algorithm and it can avoid the impact of time factors.Gupta et al.[9]proposed a collaborative filtering prediction for demographics.In 2015,Zheng et al.[10]proposed an optimization algorithm that is the neighbor’s choice and predicted audience rating.In the same year,Xing et al.[11]applied the case-based reason to the mechanism filtering of the forgetting collaborative solution to the cold start problem,inferred the score of the user’s unsuccessful project,and then recommended the TOP-N collaborative filtering project.In 2016,Qin et al.[12]proposed a collaborative filtering recommendation algorithm for weighted project categories.In 2017,Li et al.[13]proposed a list of user item type ratings based on user-rated and collaborative filtering algorithms.Zayet et al.[14]proposed a new similarity measure weighting technique.In 2018,Yu et al.[15]developed an improved algorithm relied on historical order data for restaurants.Fan et al.[16]proposed a reliability-based similarity calculation metric.The two similarity predictions are combined to predict the user’s rating more accurately and improve the reliability of the similarity calculation.In 2019,Zhang et al.[17]proposed a novel algorithm named as Fuzzy Rough Set Theory Based Collaborative Filtering Algorithm,which is used to provide efficient advisory tools for the younger generation.In this year,Kumarl et al.[18]applied the Apriori algorithm to the field of ecommerce.Through the sales data in the transaction database to mine various interesting links between the products purchased by the customers,it can help merchants to formulate marketing strategies,reasonably arranges shelf-guided sales,and attracts more customers.In the same year,Zhang[19]proposed a novel service recommendation approach,which is designed for interactive scenario of service composition,and it mines the user implicit interests and the service correlations for service recommendation.
In terms of privacy-preserving,in 2014,Wang et al.[20]proposed a privacy-preserving method that can be solved data mining.In 2015,Mochizuki et al.[21]proposed a method for updating similarity in privacy protection collaborative filtering.Zhu et al.[22]proposed a simple and effective framework for privacypreserving through application data interference technology.In 2016,Xia et al.[23]proposed a privacypreserving method that can encrypt images to avoid leaking sensitive information.Improved recommendation systems based on random interference technology is proposed to privacy-preserving and makes effective recommendations[24].In 2017,Yang et al.[25]protect user privacy by directly disrupting the original data set.Badsha et al.[26]proposed a privacy-based user-based CF technology based on homomorphic encryption,which can determine the similarity between users and then generate recommendations without revealing any private information.In 2018,Badsha et al.[27]developed a privacy protection protocol based on past history.
Based on above-mentioned development of personalized recommendation and privacy protection,the personalized recommendation privacy protection technology in the current IoT mobile service is still in an early stage,and most of the results have not been discussed in an effective recommendation system for privacy protection[28,29].To make up for the shortcomings of the recommendation method in the existing IoT mobile service,this paper proposes a new core-based effective personalization algorithm.It effectively combines random perturbation techniques so that perturbed user data not only protects user privacy,but also enables effective personalized recommendation.
The main idea of this paper is as follows.The related work section mainly introduces preliminary knowledge related to the new algorithm,including the basic idea kernel method.The third section gives the main steps of the new kernel method.The fourth section shows the results of experiment and analysis of relevant results.Finally,the fifth section summarizes this paper and briefly explains the follow-up research work arrangement.
NMF is to find a non-negative base-matrix and a coefficient-matrix,and it meet equation ofRWH.It makes all components after decomposition non-negative,and at the same time achieves a non-linear dimension reduction[30–32].It has good explanation of the local characteristics of things and it can be used to discover image features in the database,to facilitate rapid and automatic identification of applications;to discover the semantic relevance of documents for automatic indexing and extraction of information;to identify genes in DNA array analysis,and so on[33–36].
The raw matrixRnmis decomposed into the product of two matricesWntandHtm:
where the choice oftvalue usually satisfied(n+m)t<nm,the raw matrixRis compressed toWH.Each column in theRmatrix represents an observation,and each row represents a feature.By replacing the raw matrix with the coefficient matrixH,it is possible to obtain a dimensionality reduction matrix of the data features,thereby reducing the storage space[37–39].
The literature[40–43]pointed out that kernel method mapping data by a kernel function.New space,then use non-negative matrix factorization for features extraction and analysis in new spaces.On the other hand,kernel method does not require knowledge of the details of the data,only the connection between each data,and finally,the kernel method can handle those data points with negative characteristics[44,45].
Kernel method firstly performs data mapping:
Then do the normal NMF in the new space:
Because it does not know the specific mapping function,the paper need to use the kernel method to get:
It is easy to see that the left side of the above equation is the definition of the kernel.The paper does not need to know the detailed mapping definitions,just need to know their inner product.The choice of kernel can make Gaussian kernel,polynomial kernel and so on.By replacing the above formula with a nuclear function,it can get:
where,K=(?(R))T?(R),Y=(?(R))TW?
Kernel function is to map data from low-dimensional to a new space,to facilitate the conversion of data that cannot be linearly segmented into data that can be linearly segmented,usually can be described as:
It can be seen that the equation and the ordinary non-negative matrix factorization have the same representation,so the same method can be used for solving.
Given a new matrixRnew,it can get a new coefficient matrix:
where,Y-1=(W?)-1((?(R))T)-1,Knew=(?(R))T?(Rnew).
The new algorithm is mainly composed of two parts:The data privacy protection process and the data personalization recommendation process.The data protection process is mainly based on random interference technology to hide data;the personalized recommendation process of data is mainly based on the similarity between data to obtain personalized recommendation results.
The flow diagram of privacy-preserving process of user data is shown in Fig.1.Privacy-preserving hides raw data through random perturbation techniques.The specific steps are as follows:
Figure 1:Privacy protection process
First,random perturbation techniques are used in the privacy-preserving of mining.An intuitive method is to add a number ofd,so thata+dappears in the database instead of the raw dataa,wheredis a random number obeying a certain distribution.Although we can do not specific operations on a single dataa,if we care about the entire database rather than a single data,then we can still perform the corresponding operations.The method is to randomly perturb the raw data,so that the database in the server saves the changed data instead of the raw data,thereby achieving the purpose of hiding information and privacy-preserving.
Given a recommendation system withnitems,musers,for reasons of privacy-preserving,the side of server do not know the raw rating of each user and items the user rated.To do this,it uses a uniform distribution to generate random data.In a uniform distribution,all users produce a random value in the interval[μ-α,μ+α].In a normal distribution,all users produce a normal distribution random value and it combine the mean of μ with the standard deviation of α.The user hides the rating data before sending its rating data to the server.The specific data hiding steps are as follows.
The server determines the specific distribution(uniform distribution or normal distribution)of the disturbance data and the corresponding parameters(α,μ,σ)and informs each user of this information.
Each user fills the unsorted items with the mean of their existing ratings.
For each useruuse a certain distribution to generatenrandom numbers,i=1,2,…,n.Then the user data adds a random number to get a vector of hidden information:
Finally,the server uses the Eq.(8)to form a rating matrixR′of hidden information.
The flow diagram of the recommendation process of user data is shown in the Fig.2.The feature matrix is obtained by matrix decomposition of the hidden data obtained above,and then the user-related data is calculated by using the new user data and the feature matrix,and the neighbors are found by the similarity,and finally weighted to obtain the recommended data.Specific steps are as follows.
Figure 2:User data recommendation process
The steps of NMF for privacy-preserving are as follows:
1.The rating matrixR′with hidden information is factorized intoWandH.
2.Given a new userRnew,the user predicted rating for itemi.First use the Eq.(1):
3.Then use the new feature matrixHnewto compare with the NMF of theH,and using similarity to make recommendation,here you can use the Euclidean distance:Two N-dimensional vectorsx1,x2:
4.Finally,by setting the number of neighbors,some similar users are found and compared to the raw data,it can weighted to obtain the rating of the itemiby the useru.
Kernel method has improved the basic idea of NMF.The privacy-preserving step is similar to the NMF,and finally a rating matrix of hidden information is obtained.The kernel method of the hidden matrix is decomposed as follows.
1.A matrix for mapping the data to new spaces through nonlinear mapping.This paper selected the Gaussian kernel function for mapping,and then the NMF is performed to obtain the matricesYandH.
2.Given a new user and its rating information for the itemsRnew,the user’s predicted rating for itemiis calculated during the recommendation process.First use the Eq.(7):
3.Then,using the new feature matrixHnew,compare it with theH,and using the similarity to make recommendation.Here,the Euclidean distance can be used:it is consistent with the Eq.(10).
4.Finally,by setting the number of neighbors,some similar users are found and compared to the raw data,it can weighted to obtain the rating of the itemiby the useru.
This algorithm consists of two parts:offline data preparation and online data processing.The first part is offline data preparation.It mainly does the nonnegative matrix decomposition of the hidden matrix.Because this part is offline,it is prepared for the second part in advance,which does not affect the online recommendation generation,so the time complexity of this part will not be considered.The second part includes the calculation ofHnew,the calculation of Euclidean distance,neighbors search and recommendation generation.The time complexity ofHnewcalculation isO(tm),the time complexity of Euclidean distance isO(tn),the time complexity of neighbors search isO(nlogn),and the time complexity of recommendation generation isO(kn).So,the time complexity of the second part isO(tm)+O(tn)+O(nlogn)+O(kn).Sincekandtare constants,mandnhave the same order of magnitude,the time complexity of the second part isO(nlogn).In conclusion,the time complexity of the algorithm isO(nlogn).
MovieLens is the oldest recommendation system.The MovieLens dataset contains rating data for multiple movies from multiple users,as well as movie data information and user attribute information.This data set is often used as a test data set for recommendation systems,machine learning algorithms.The content in the file contains the rating of each user for each movie.The data in our experiments consists of 1,000,000 ratings for 6,040 users with 3,952 movies,and all users must rate at least 20 movies.
The mean absolute deviation is a statistic that describes the degree of data dispersion.The mean absolute error(MAE)is:
where,piindicates the prediction rating of the recommendation system for theiitem,qiindicates real ratings of theiitems,andNindicates the number of items.
It can be known from the Eq.(12)that the accuracy of the prediction is represented by calculating the MAE value between the predicted rating and the real ratings.
This paper demonstrates the effectiveness of the proposed algorithm by comparing the kernel method privacy-preserving algorithm with the non-negative matrix factorization privacy-preserving algorithm.At first,find the mean of the user’s rated movie to fill the corresponding unrated movie.Then,each user creates a new privacy-preserving data rating matrix by creatingNrandom values using a Gaussian distribution to hide the raw data.
This experiment mainly verifies the performance of the recommendation system by predicting the rating of the known movie and comparing the real rating of the movie with the predicted rating.Basically,it can be understood that the rating value of the predicted movie is set to null,and the prediction rating is performed by using the hidden matrix and the kernel method-based privacy-preserving algorithm proposed in this paper.Since the matrix decomposition results close to the hidden matrix each time,but the results of each decompose are different,this paper performs 20 matrix decomposition to narrow the difference of prediction results,and finally takes the mean value of the prediction results as the movie rating prediction value,it can narrows the difference of the results.
First,this paper examines the influence of the value oftabout the matrix decomposition on the prediction accuracy,and thetvalue of 5 to 15.Experiments are performed on 6040 users and 3952 projects using non-negative matrix factorization and kernel method respectively to examine the effect of t-value on prediction accuracy.The result is shown in Fig.3.
Figure 3:The relationship between the t and the prediction accuracy
In Fig.3,with the increasing of the value oftafter decomposition,the value of MAE shows a steady trend,indicating that the number ofthas no significant influence on the prediction accuracy,and is basically in a flat state.The reason should be that when the user or project data is sufficiently rich,the sample mean and variance of the perturbed data will converge to their expected μ.Therefore,the value of users or projects are relatively rich,the system prediction accuracy has no significant impact.In general,due to the using of hidden information for prediction,the accuracy of non-negative matrix factorization and kernel method is slightly lower than that of the direct use of the original score data.The error between the interference and the preinterference is small,and certain recommendation accuracy can be achieved.
In Fig.3,when the value oftafter decomposition is 6 and 9,the prediction accuracy is high.The following is a comparison of the MAE cases with different neighborskwhen the value oftafter decomposition is 6 and 9,respectively.
Whentafter decomposition is 6,the value of MAE of different neighbork.As shown in Fig.4.
Figure 4:The relationship between prediction accuracy under different k numbers when t is 6 after decomposition
Whentafter decomposition is 9,the value of MAE of different neighbork.As shown in Fig.5.
Figure 5:The relationship between prediction accuracy under different k numbers when t is 9 after decomposition
Because non-negative matrix factorization and kernel method predictions mainly use the similarity between new users and existing scoring users to make recommendations,the impact of the number of neighbors on prediction accuracy is investigated.Using the data set experiment of 6040 by 3952,the number of neighbors is changed separately,and the corresponding prediction result is obtained.The results of experiment are shown in Fig.6.
In Fig.6,with the increasing of the value of neighborsk,the value of MAE shows a steady trend,indicating that the prediction accuracy is not sensitive tok.The reason should be that when the user or project data is sufficiently rich,that is,the data is relatively rich,and the sample mean and variance of the disturbance data will converge to their desired μ.Therefore,as the number ofkincreases,the prediction accuracy of the system tends to be stable.It can be proved that the kernel method is slightly higher than non-negative matrix factorization.Kernel method maps data to a new space,and the data range is reduced,and the corresponding error will also be smaller.Therefore,compared to non-negative matrix factorization,the predictive performance of the kernel method is relatively better.
Figure 6:The relationship between the k and the prediction accuracy
In Fig.6,when the value ofkis 9,10,the prediction accuracy is high.The following compares the prediction accuracy of the value oftafter different decompositions whenkis 9,10,respectively.
Whenkis 9,the value of MAE oftis different after decomposition.As shown in Fig.7.
Whenkis 10,the value of MAE oftis different after decomposition.As shown in Fig.8.
Figure 7:Relationship between prediction accuracy and the number of different decompositions t when the number of neighbors k is 9
Figure 8:Relationship between prediction accuracy and the number of different decompositions t when the number of neighbors k is 10
In order to examine the influence of the dispersion degree of the disturbance data on the prediction accuracy,a 6040 by 3952 data set experiment was used to change the variance of the disturbance data and examine the results of the corresponding prediction.The results of experiment are shown in Fig.9.
Figure 9:Relationship between variance and prediction accuracy
Obviously,the degree of interference of raw data has a greater impact on the prediction.When the interference degree of the raw data is small,the accuracy of the prediction result of the recommended system is better.
From above experiments,the results of the perturbation data using kernel method are slightly better than non-negative matrix factorization.Moreover,from the previous validity analysis,no matter what kind of disturbance distribution,the mean absolute error using the same algorithm should converge to the error of the undisturbed data in the large samples.
This paper proposed a kernel method-based privacy-preserving collaborative filtering algorithm that is easy to implement and can guarantee effectiveness of recommendation.This algorithm mainly involves two parameters,one is the dimension valuetin matrix decomposition,the other is the number of neighborskof the current user.The experimental results on page 8–9 show that the algorithm is relatively insensitive totvalue,and small changes have little impact on the results;the experimental results on page 9–10 show that the algorithm is also insensitive tokvalue,and the small changes of its value can be ignored.To sum up,the settings of these two parameters can be adjusted according to the actual application scenarios.Compared to the NMF-based privacy-preserving algorithm,its corresponding prediction accuracy has been improved.And compared to the direct recommendation on non-hidden data,although the accuracy is reduced,the performance is acceptable as a compromise of privacy-preserving.The limitation of this algorithm is that when the number of users and items is large,the time required for NMF or KNMF matrix decomposition is relatively long.However,matrix decomposition can be carried out offline,it has little effect on online user neighbors search and recommendation generation.
This paper developed an algorithm of kernel nonnegative matrix factorization and random perturbation technology.The algorithm has a privacy-preserving function,which enables the Internet of Things service system to easily collect the necessary personalized recommendation data while protecting the privacy of users.The actual analysis demonstrates that the kernel method is not sensitive to k and intermediate dimension size t on the basis of preserving the user’s privacy.It can improve recommendation accuracy,achieve the effectiveness of recommendation,and meet the needs of the recommendation system.Of course,there are still some problems in the algorithm that need further research,such as the value of users or projects in kernel nonnegative matrix factorization.
Funding Statement:This research was supported by the National Natural Science Foundation of China under Grant No.61772280;by the China Special Fund for Meteorological Research in the Public Interest under Grant GYHY201306070;and by the Jiangsu Province Innovation and Entrepreneurship Training Program for College Students under Grant No.201910300122Y.
Conflicts of Interest:We declare that there is no conflict of interests regarding the publication of this article.
Computers Materials&Continua2021年1期