Liang WANG ,Shunjiu HUANG? ,Lina ZUO ,Jun LI ,Wenyuan LIU
1School of Cyber Security and Computer,Hebei University,Baoding 071000,China
2Xiong’an Intelligent City Innovation Federation,Xiong’an 071700,China
3School of Information Science and Engineering,Yanshan University,Qinhuangdao 066000,China
Abstract: The problem of data right confirmation is a long-term bottleneck in data sharing.Existing methods for confirming data rights lack credibility owing to poor supervision,and work only with specific data types because of their technical limitations.The emergence of blockchain is followed by some new data-sharing models that may provide improved data security.However,few of these models perform well enough in confirming data rights because the data access could not be fully under the control of the blockchain facility.In view of this,we propose a rightconfirmable data-sharing model named RCDS that features symbol mapping coding (SMC) and blockchain.With SMC,each party encodes its digital identity into the byte sequence of the shared data by generating a unique symbol mapping table,whereby declaration of data rights can be content-independent for any type and any volume of data.With blockchain,all data-sharing participants jointly supervise the delivery and the access to shared data,so that granting of data rights can be openly verified.The evaluation results show that RCDS is effective and practical in data-sharing applications that are conscientious about data right confirmation.
Key words: Data right confirmation;Symbol mapping coding;Blockchain;Data sharing;Traitor tracing;Access control
The growth of the digital economy relies on trusted data sharing in which data right confirmation(DRC)remains a challenge.Data sharing is not equivalent to data ownership transferring.The ownership and use rights of the shared data should be correctly confirmed;otherwise,no one will be willing to share information with others.Unfortunately,transaction repudiation (Zhang R et al.,2023) and data piracy (Barni and Bartolini,2004)are still the worst adversaries of DRC.To withstand them,we need to put into effect a dependable DRC scheme that integrates more credible methods of traitor tracing(Zhang LY et al.,2020)and access control.
However,the shortcomings of existing DRC approaches in traditional data-sharing models cannot be ignored.First,most of those models depend on trusted third parties (TTPs) (Coffey and Saidha,1996),which may not be that trustworthy.Then,some models are built on staged encryption (Ali et al.,2016),which is usually inefficient when dealing with a large volume data.Typically,digital watermarking based DRC schemes work only for sharing certain types of data (Wang HL et al.,2018),which could be detrimental to the availability of the ubiquitous undistorted data.The absence of watermark forgery supervision mechanisms is also a major limitation of those schemes.
Recently,researchers proposed using blockchain as a distributed TTP to record data-sharing processes,aiming at providing DRC services(Zha et al.,2020;Zhao et al.,2021).Those blockchain-based data-sharing models support DRC to some extent,but they have limited control over the data-sharing processes.Moreover,those models either store excessive data on blockchain or require complicated encryption computation,and thus often result in high costs and low benefits in practice.
Considering the above deficiencies,we propose a new data-sharing model,RCDS,which combines symbol mapping coding (SMC) and blockchain.SMC is a method that encodes data holders’ fingerprints into the byte sequence of the data copy.It allows data right granting to be independent of data content,so RCDS can work on any type of data.Blockchain in RCDS distinctively records the key mapping elements generated by SMC,and offers reliable evidence when confirming data rights.In RCDS,the blockchain witnesses the whole process of data sharing,and endows traitor tracing and access control with provable credibility.
Existing DRC schemes for data sharing can be divided into three groups:TTP-based schemes,phased encryption schemes,and blockchain-based schemes.We review them briefly below.
1.2.1 TTP-based schemes
TTP-based schemes usually employ third parties to supervise the communication among datasharing participants and to offer DRC proofs (Zhu ZM and Jiang,2016;Frattolillo,2017;Ganesh et al.,2017).TTPs in these schemes play the role of evidence verifier.Once disputes occur,a TTP will provide testimony for arbitration.The problem is that the so-called TTP may not be that trusted.Coffey and Saidha (1996) first proposed a TTP-based scheme for the general non-repudiation problem.In this scheme,the third party was relied upon without reserve,which may increase the possibility of collusion attacks.To solve this problem,Zhu ZM and Jiang (2016) introduced an anti-collusion attack data-sharing model based on an asymmetric cryptosystem and the Delov—Yao model.However,this model could not resist man-in-the-middle attack or data-tampering attacks because servers did not verify user registration(Ganesh et al.,2017).Moreover,the single point of failure is a problem that cannot be ignored under the centralized architecture.In summary,TTP-based schemes cannot meet the requirements of DRC because data can easily be tampered with and third parties are not fully trusted.
1.2.2 Phased encryption schemes
A phased encryption scheme implements reliable data sharing usually by encrypting shared data in phases (Ali et al.,2016;Zaghloul et al.,2020).In this scheme,shared data are divided into several parts and delivered piece by piece.Once a piece is received,an ACK must be returned from the receiver to the sender before the next piece can be sent.Then,the ACK is used as non-repudiation evidence.The drawback of this scheme is that too many rounds of communication could be needed in one data sharing instance.Ali et al.(2016) proposed a two-stage non-repudiation protocol,but the problem of low efficiency remained.Furthermore,TTPs assumed by these schemes are unrealistic.
1.2.3 Blockchain-based schemes
Blockchain has been introduced into data sharing as an infrastructure to provide irrefutable evidence of DRC(Huckle and White,2017;Gong et al.,2019;Zha et al.,2020).Blockchain’s excellent features are often used to produce traceability of data sources and their sharing histories(Saini et al.,2021;Zhao et al.,2021).Blockchain and digital watermarking are combined to improve the security of copyright protection and data source tracing(Wang HL et al.,2018;Qian et al.,2019).Wang HL et al.(2018) proposed a combinatorial model,using first a data holding proof method to audit data integrity and then a digital watermarking scheme to confirm the origin of the shared data.However,this degree of combination is still insufficient to fight against data piracy,because access control is not emphasized.
Some researchers combined blockchain with encryption systems to improve access control in data sharing(Ersoy et al.,2021;Sifah et al.,2021),but it is usually not efficient when faced with large volumes of data.Wang S et al.(2022) proposed a big-data sharing scheme that uses smart contracts to execute access rules.However,encryption and decryption require a lot of time for a large amount of data,resulting in low efficiency.
In addition,zero knowledge proof (ZKP) and non-fungible token(NFT)inspired researchers to develop some featured DRC methods.Some new studies constructed ZKP models to prove the ownership or use right of assets by third parties(Cao and Zhao,2021;Sun et al.,2021;Lin et al.,2022).Some other studies generated mainly unique digital certificates(i.e.,NFTs) to claim ownership of specific data assets and achieved secure data sharing by selling those NFTs.However,the above techniques have some intolerable shortcomings in DRC data sharing.For example,ZKP construction often has low efficiency and poor scalability (Giacomelli et al.,2016;Parno et al.,2016),and NFT is still a technically immature concept(Okonkwo,2021).
The contributions of our work to research on DRC in the process of data sharing are as follows:
1.We propose SMC to make fingerprint encoding suited to any type of data content.It prevents fingerprint forgery and enhances the credibility of DRC in the process of data sharing.
2.RCDS enables data-sharing processes to be supervised on blockchain publicly by empowering trusted traitor tracing and access control.
Here,we first introduce relevant technical concepts that will be used in the RCDS model.
1.Transaction repudiation.There are currently two types of transaction repudiation (Chen et al.,2022;Wang L et al.,2022)in the process of data sharing:repudiation of sending(RoS)and repudiation of receipt (RoR).RoS occurs in a situation like this:Alice (the data sender) fabricates a record in which she sent data to Bob(the data receiver),thereby imposing responsibility for data security on Bob.RoR is another case:Bob denies the fact that he received data from Alice,thereby shirking his responsibility for data security.For non-repudiation,all participants must deny nothing about their behaviors.
2.Data piracy.Data piracy refers to the acts of reproducing and redistributing data copies without the consent or authorization of data owners.It is tricky because of the replicability of the data.Antipiracy requires some control over access to data.
3.Traitor tracing.Traitor tracing is a common countermeasure against transaction repudiation.It does not prevent users from denying what they have done,but tracks and obtains evidence of what they have done.It is often a strategy of pre-deterrence and post-accountability rather than prevention,and DRC is the underlying logic of this strategy.
4.Access control.Access control refers to policies that prevent unauthorized access to data.Authentication and fine-grained data encryption are common methods for access control,and are generally premised on DRC.
Blockchain technology is capable of ensuring the security of data transmission and access by multi-party co-maintenance and cryptography,especially consistency achievement of data storage,tamper-resistance of records,and anti-repudiation of data delivery (Zhu LH et al.,2019).Therefore,blockchain systems are expected to facilitate data sharing in a more credible way than traditional cloud-and exchange-based systems(Gai et al.,2018).Generally,multi-party co-maintenance and redundant storage provide blockchain systems with decentralization.At the same time,tamper-resistance and anti-repudiation embody the reliability of data sharing.
SMC is the innovative underpinning of our work.For ease of understanding,Table 1 lists the notations used in this paper.
Table 1 Notations used in this paper
Given a data object,SMC works by dividing the byte sequence of the data object into symbols and recoding them using a generated symbol mapping table(SMT).SMT maps each symbol into two different types of digital codes:one is called plain code(which is used to encode symbols) and the other is called hidden code (which is used to carry fingerprints).Unlike ordinary encryption,SMC uses SMT instead of a secret key.To make SMT difficult to guess,we generate it using one-way mapping.
Specifically,an SMT includes a symbol set,a plain code set,and a hidden code set.Each symbol will be linked to at least one plain code,so will each hidden code.LetSbe the symbol set,Pthe plain code set,andHthe hidden code set.An SMT should meet the following two irreversible surjections:
Table 2 is an explanatory SMT.Given the string AAEIOU,the code “0x0041 0xE410 0xE452 0xE4C1 0xE4F0 0x004D” mapped with Table 2 carries the hidden string WORLD!.In this way,we can generate an SMT for any given data content to hide a string that is used as a fingerprint.
Table 2 Schematic symbol mapping table
RCDS is a data-sharing model that achieves reliable DRC by identifying fingerprints in data copies.Fig.1 shows the working principle of this model,and we describe its workflow in detail below.
Fig.1 Overview of RC DS
1.DistributorDgenerates an SMTDfor data objectdwith his/her fingerprintFDand private key hashh(SKD),encodesdintowith the generated SMTD,and records,FD,,and DDes onto blockchain BC.
2.Dsendsto userUthrough the off-chain channel,and uploads the transaction record to BC.
3.Authorized userUobtainsand DDes from BC.
4.Ugenerates an SMTUfor data objectwith his/her fingerprintFUand private key hashh(SKU),encodesintowith the generated SMTU,redacts queryqaccording to DDes,and recordsFU,,andqonto BC.
5.Usends SMT?UtoDthrough the off-chain channel,and uploads the transaction record to BC.
6.Dobtains,FU,andqfrom BC.
7.Dencapsulates the access control policy into a decoder and uploads the decoder to BC.
8.Uobtains the decoder from BC.
9.The arbiter judges the repudiation behavior with the interactive records on the blockchain and fingerprint verification results.
In this model,a data delivery process betweenDandUcan be formalized as the following steps:
When an authorized user wants to accessd,he/she should generate a query statementqaccording to the data description DDes and apply for a decoder ?from the distributor.The user then obtains the data he/she needs according to ?,rather than directly obtaining the whole data from.Therefore,we designφ4which encapsulates the access control policy and publishes only minimal query interfaces that do not disclose the original data.We also designφ5to ensure that only the correct user can query information from the data.
The arbiter can use the tester onto identify whether the holder ofdis an authorized user or a distributor:
Each of the above testers returns either negative or positive.Positive means that a fingerprint is detected from,and negative means the opposite.From Fig.1,it is easy to understand that each party of data delivery holds a unique copy of the data,in which the party’s own fingerprint is embedded and the user’s copy of data is confidential to the distributor.φ6is the fingerprint identification algorithm that is the core of DRC.Any node in the blockchain can act as an arbiter to identify the fingerprints fromdto check users’permissions.
In this model,we use the blockchain instead of a TTP,and use it to provide a complete evidence chain for each data-sharing process.Nodes of the blockchain network are exactly participants in data-sharing activities.Through consensus,they jointly maintain the consistency of the blockchain.The blockchain immutably records all key elements of each data delivery transaction to make the model credible.
In RCDS,the sharing of a data object involves a series of processes including SMT generation,fingerprint embedding,fingerprint identification,and data query.Based on implementing these processes,the following subsections present the construction of RCDS.
Unlike watermarking on images,fingerprints are embedded with the help of an SMT,which works to encode and decode the raw data like a codebook.Fingerprints are encoded in the form of hidden codes along with the generation of SMTs.In turn,these fingerprints can be identified by parsing the encoded data with these SMTs.An SMT must be associated,one to one,with the corresponding data,so fingerprint identification can be unambiguous.With this in mind,we designφ1as described in Algorithm 1,where an SMT is divided into two parts:the private part,SMT+,consisting of〈sθ,pρ〉,and the public part,SMT?,consisting of〈pρ,hη〉.
In Algorithm 1,the value ofθshould be a random number and meet two conditions.The first condition is SMT?≤τ,whereτis an upper limit artificially designed for SMT?in practice.The second condition is|dF|≥λ|F|,ensuring that there is enough encoding space ofdFat the side ofU.To ensure enough encoding space,we assume that the explicit code space is twice the symbol space.According to the givenτandλ,we can set the range ofθto satisfy the following conditions:
The function symbolize(·) is used to randomize the value ofθwithin the range defined by inequality (1):
Givenγ ∈(0,1],the fingerprint redundancy?should be evaluated in the following range to obtain sufficient strength of fingerprint embedding:
Inequality (2) conforms to the constraint of being over perfect;i.e.,if?>|S|were true,each symbol of the data object would match more than one fingerprint,which would add to the size of SMT.
It is clear that?|F|=η|S|,so the range ofηcan be calculated with inequality (3):
Within this range,the function customize(·) is used to obtain a random value forη.
The function obfuscate(·) increases the redundancy of SMT to makeθharder to guess.The value ofχshould not exceed the encoding space specified byρ,so it can be calculated by
The value of c900 is derived from the third fivedigit hexadecimal string inh(SK).It is obtained by adding the first four characters together and multiplying the sum with the fifth string,so that the maximum result will not exceed 900.
The parameterρspecifies the encoding space of the plain code and the expansion multiple ofd.For example,whenθ=1 andρ=2,2 GB size ofdFwill be obtained from 1 GB size ofd.Therefore,ρis usually calculated to meet the minimum requirement of the encoding space(line 3 in Algorithm 1).
Specifically,because the secrecy ofθshould be maintained to make SMT difficult to forge,h(SK)in Algorithm 1 makes the outputs of symbolize(·)(line 1 in Algorithm 1),customize(·)(line 4 in Algorithm 1),and obfuscate(·) (line 14 in Algorithm 1)user-dependent.This makes it more difficult to guess,because no users will reveal their private keys.
To facilitate authentication,we calculate the hash value of SMT?and recordh(SMT?) on the blockchain,so that everyone in the network can obtain and verify SMT?throughh(SMT?)and finally authenticate the corresponding data object through fingerprinting.
A data object should firmly carry its holder’s fingerprint before it can be used.For a single delivery,the data holders include the distributor and the authorized user of the data copy.A fingerprint must be able to uniquely bind to a publicly verifiable identity.A reasonable way to make such a fingerprint is to generate it from the public key that can uniquely identify a data holder.Another factor that should not be overlooked is the size of the fingerprint.For a data object that is not very large,a fingerprint should not be too long,to avoid losing its robustness of embedding.Therefore,a viable way of fingerprint generation is to use the hash of the data holder’s public key as a fingerprint and keep|F|≤|d|true.
To enhance the strength of fingerprint embedding,it is better for a data object to be fully overlaid with redundant fingerprints.As inequality(2)requires,anη-partition (Ptη) of the redundant fingerprints should be done to fit|S|(line 6 in Algorithm 1).
In addition,the fingerprint of each user will be recorded on the blockchain and open to the network,so that everyone in the network can use these fingerprints to check the identities of the shared data.
The embedding of redundant fingerprints is actually an encoding on data objects,and it is a twostep process.First,Dpre-codesdthroughφ2to embedFD.dFDis unable to be directly read and thus can be directly delivered toU.Then,Ufinally codesdFDthroughφ2to embedFU.dFDUallowsUto query data through a dedicated decoder.Algorithm 2 detailsφ2.
Algorithm 2 realizes a full coverage strategy in the encoding process.In the strategy,plain codes replace all the symbols of a data object,and hidden codes cover the duplicates of one fingerprint.Doing so can protect the original data from being leaked and make the cost of destroying a fingerprint unacceptable,because the data will not be recovered as usable if all the fingerprints hidden behind the data are broken.Another strategy in Algorithm 2 is randomized coding (line 6).It randomly chooses a plain code from SMT to match up with each hidden code,blocking the sequential guessing of the original symbols.
It is worth noting that the delivery ofdFDis recorded as a main part of a transaction on the blockchain.Such records will be the compelling evidence for delineating suspects.
The purpose of fingerprint identification is to test whether the data object contains the fingerprint of a specific user.Different from traditional digital watermarking,our method is based on hypothesis testing.In other words,for RCDS,it needs only to verify whether the data object contains specific fingerprints,and usually does not need to extract the fingerprints.We divide the fingerprint recognition algorithm into two parts:forward verification and backward verification.The details are as follows:
Algorithm 3 is a concrete solution to forward verification ofφ6.This solution collects the possible hidden code by processing the plain code part of the data,then divides the collected hidden code part into|F|to obtain the possible fingerprint,and finally matches it with the user’s fingerprint to obtain the match frequency.By calculating the Pearson correlation coefficient (Pan et al.,2021),at least one full match is needed to pass the verification.
When Algorithm 3 returns negative,it does not mean that there is no fingerprint match,because malicious processing on the encoded data might exist.So,we design Algorithm 4 to perform a backward verification,which estimates in reverse a set of possible plain codes of the complete redundant fingerprints,and compares the usage of each byte of these plain codes with that of the encoded data.Considering that a traitor may perform some special malicious operations on the data,such as adding 1 to each binary bit of the plain code,we analyze the difference between the bytes of the plain code and those of the malicious data to identify the embedded fingerprints.
The most complicated part of forward verification and backward verification is parameter optimization,which has a great impact on the effect of fingerprint identification.The parameterεis the lower limit ratio of fingerprint redundancy.It indicates that the minimum number of duplicate fingerprints should be detected correctly.However,the forward verification in Algorithm 3 is strict with the sequence of plain codes,so it is sensitive only to the correctdF.When the forward verification of Algorithm 3 fails,the possibility of including fingerprints is not ruled out.So,at this time,reverse verification is another way to solve this situation.It uses the bytes of the complete string of redundant fingerprints as a function and observes to what extent the data to be processed conforms to this function.The parameter?sets the upper limit of this range.Adjustingεand?will have a decisive impact on the fingerprinting.According to inequality (1),we can do the following derivation:
Obviously,if the above condition is not met,it is very likely that a deliberately manipulated piece of data is being managed.In Algorithms 3 and 4,parametersεand?are fine-tuned to increase the sensitivity of fingerprinting when the condition is not met.
For the data to be shared with RCDS,we use a secret encapsulation method to control the decoding ofdF.This method integrates data encryption and access control policies.
In Fig.1,DDes is transmitted together with the encoded data.To enable users to obtain the data they need quickly and accurately,this description should be as detailed as possible,and at least include the parts shown in Fig.2.
Fig.2 DDes format
After an authorized user obtainsand DDes,they code their fingerprints into.Then,they need to send a query statementqto the data distributor according to DDes to apply for access.To help distributors better encapsulate decoders,we design the following query primitives for the encapsulation:
First,the SELECT statement is the most frequently used and powerful query for relational databases.We use query primitives in structured query language (SQL) to query data for relational data described in DDes.See Fig.3 for the specific query statement.
Fig.3 Relational data query primitives
Second,to query file type data,we expand the query statements used in relational data by studying the literature (Wu,2009),so that it can accurately search in file type data.These extensions are manifested mainly in the WHERE clause.The WHERE clause describes the conditions that the target object should meet.The specific expression is shown in Fig.4,in which the keywords are explained in Table 3.
Fig.4 File data query primitives
Table 3 Explanation of the file query primitives
For the query statement given above,we will further illustrate the language description ability of the query statement through a query example (mainly for file type data,because relational data are similar to an SQL query).
Fig.5 shows an example of using file data query primitives.It is a query for a specific artwork and expresses the following meaning:there are two objects in the artwork;one is a dog and the features of its color are similar to those of picture dog.jpg,and the other is a house and the features of its shape are similar to those of picture house.jpg;the house is located on the left side of the dog.
Fig.5 An example of using file data query primitives
As shown in Fig.6,whenDreceivesq,he/she should judge it to determine whether it complies with the access control policy.The judging of this part is based mainly on the following:
Fig.6 Access control:(a) D phase;(b) U phase
1.Judge the origin ofq:check whetherqis sent by an authorized user.
2.Check the query ofq:judge whether all datasets can be obtained by combining all local application queriesq.If not,continue with the following operations.
3.Judge whetherqcomplies with the access rules in DDes.
The access rules in DDes define what data can be accessed and what data cannot be.For example,in the employee information table of the relational database,the user’s name belongs to privacy and cannot be accessed at the same time.Salary and department cannot be accessed at the same time.Name and diagnosis-record or name and disease involve personal privacy and cannot be accessed.
If the check ofqis not qualified,Dwill not return any decoder;otherwise,Duses functionφ3to decrypt the private part of user application data fromand reassemble it into a newThenDuses functionφ4to encapsulate the decoder?,including,public parameters,and access control policies of data transmission.The decoder provides only the necessary interface for users to obtain the data slots they need.
We need to ensure thatUcannot obtain the contents of the decoder through other means.Some basic indicators for encapsulatinginclude:(1)it should be invisible and unable to be disassembled;(2) it should be small enough to be placed on the blockchain;(3)it should be able to check whether the new query of data meets the policy.When the above encapsulation indicators are met,we can use a smart contract to act as a decoder.
However,considering that smart contracts may have data privacy security problems,data parameters during contract execution could be disclosed.We find some particular solutions from the literature,and summarize them in the following categories:
1.Split contract (Kosba et al.,2016;Kalodner et al.,2018;Li et al.,2019).The contract for designing sensitive information is a private contract or an off-chain contract,and is not disclosed to the public.
2.Define smart contract language(Steffen et al.,2019;Baumann et al.,2020).Permission control is performed on variables,functions,and other elements of sensitive information designed in the contract.
3.Build a smart contract execution framework (Yan et al.,2020).Smart contracts are encrypted,decrypted,and executed in combination with a trusted execution environment(TEE).
In RCDS,we use the certification function and the black box nature of the TEE to have smart contracts executed securely.The core idea of the TEE is to build a hardware security area,and data are calculated only in the security statement to ensure their confidentiality and integrity.The running state of a smart contract in the TEE is trusted and cannot be obtained by the outside world,to ensure parameter safety in execution processes.
As shown in Fig.6,whenUreceives the decoder sent byD,he/she will input his/herand.Then,the decoder usesto check whether the user embeds his/her fingerprint in.If the check is qualified,the decoder will returnd?.To this end,we design Algorithm 5,and implement it on the blockchain as a template for a smart contract for authorized users to query data.The first step to ensure query security is to check whether a user embeds his/her own fingerprint inthroughφ6.This largely prevents unauthorized users from using the decoder.After passing theφ6check,the decoder usesandto convertinto,and then recoversd?fromby combining ?and access control policies.
Blockchain network is important for prompting the aforementioned fingerprint identification and access control to come true,and should ultimately help realize the workflow shown in Fig.1 and the data distribution model shown in Fig.7.We design the blockchain network mainly in two parts:data model and consensus mechanism.
Fig.7 Data distribution pattern for right confirmation
To collect data-sharing records correctly,we design the transaction chain data structure shown in Fig.8 based on the following ideas:first,Dcreates an initial transaction txn1 which registers the data to be shared.At the same time,Dembeds the fingerprint in the original data to form,generates the corresponding DDes,and writes,FD,DDes,andinto txn1.When data sharing is required,Dcan transferand DDes to any authorized userU.Next,whenUreceives the data,he/she will execute SMC to obtain data,which contains his/her fingerprint.Meanwhile,q,FU,and the hash digest ofare written into a transaction txn2.Finally,for each upcoming queryqfrom any of the authorized users,the distributor will create a uniquely corresponding decoder ?and write its digest into a transaction txn3,so that the authorized user can access the data object.These transactions are linked together in chronological order to constitute a transaction chain.Different transaction chains are intertwined with the blockchain.The transactions in each transaction chain transmit consensus in the network through the shielded pool (Kappos et al.,2018),and are then packaged into different blocks.From the above transaction process,we can find that every time an authorized user wants to access a data object,he/she applies to the distributor for access permission and this behavior will be verified on chain.Before the digest of the data object is uploaded to the blockchain,a decoder has hadandFUencapsulated in the case where some attackers use fake parameters to defraud the distributor to obtain permission to access data.
Fig.8 Transaction chain over blockchain
For the consensus mechanism,we adopt the cascade consensus protocol (CCP) reported in our previous works (Wang L et al.,2020,2021).CCP organically coordinates the periodic data transmission through a consensus process,which makes the data transmission undeniable.Its working mode meets the requirements of the data transmission part of RCDS.
We use Spring Boot to build a simulation blockchain platform (http://www.hbusoftsec.org.cn/files/rcds_bc.zip),which employs the above data model and consensus protocol.Some simulations that we discuss later in this study are conducted on this platform.
In this section,we present a theoretical analysis of RCDS effectiveness in DRC and attack resistance.
BeforeUaccesses the shared data copy,he/she must perform SMC,which will inevitably leave traces on the blockchain.Suppose that the suspected datadFhave been captured byA.Acan perform traitor tracing using the following steps:
Then,the non-repudiation of data sharing with RCDS is analyzed as follows:
1.Both tasks of accessing data and tracing traitors are forced to obtain parameters from the blockchain.Doing so can form two-way containment of the misconduct of bothDandU.IfUprovides fakeorFU,qofUwill fail to pass the censorship;ifDprovides fakeorFD,there will be no way forDto authenticate the ownership ofd.In the interests of both parties,the best strategy is to honestly abide by the rules of data delivery.
2.It is impossible to guess the fingerprints fromdFand SMT?.dFis actually a kind of cipher text ofdbecausedand its SMT are separate from each other.If the identity ofDorUis unknown,the fingerprints indFare completely imperceptible.The invisibility of fingerprints leaves no room for traitors to deny their misbehavior.
3.Dcannot frameU.IfDintends to frameU,he/she has to useandto forge.However,such forgery is impossible in RCDS.First,h(SK) customizesθ.Then,χadds to the random redundancy of plain codes during the generation of SMTs (lines 14—20 in Algorithm 1),which makesθmore difficult to guess.Therefore,Dis unable to forgeat an acceptable cost.
4.Ucannot frameD.IfUintends to frameD,he/she must either delete his/her fingerprint fromor publish a fakeHowever,deleting fingerprints from a data object will only make the data object unusable.Meanwhile,the genuinehas been immutably recorded on the blockchain,and only this one can be used to unlock the data object in decoder ?.Therefore,Uis not able to frameDby breaking fingerprints or undermining the synchronization between fingerprints and SMTs.
5.Ucannot frame other users.IfUintends to frame another userW,he/she has to embedW’s fingerprint inwith greater strength.Even ifW’s fingerprint is embedded inwithφ2,the fingerprint cannot be identified withφ5.This is because the generation parameters of SMTs are fully managed by the blockchain network in a tamperresistant manner.Moreover,even ifW’s fingerprint is successfully identified by mistake,Wcan also clear his/her suspicion with the help ofDby disproving thatcan be restored todby using the corresponding SMT?on chain.
RCDS provides data sharing with distributed access control,which increases the difficulty and cost of data piracy.The effect of this access control is analyzed as follows:
1.Dsendsand DDes toUthrough the blockchain.
2.Ugeneratesaccording to the receivedand its ownFU,and then generates the required queryqaccording to DDes and sends it toD.
3.Dsets ? according to the queryqand,and sends it toU.
4.Uuses its own,,and ?to query its application data.
5.Ucannot obtain data that have not been applied for.IfUwants to obtain unapproved data,it must pass ?,butDhas set the judgment condition according toqin ?,soUcannot obtain unapproved data.
6.Other users cannot query data.If other users want to obtain data according to ?,they must haveofU.However,is the private part SMT ofU,so other users cannot obtain the corresponding data.
In this subsection,we analyze mainly the impact of various attacks on RCDS DRC capabilities.
5.3.1 Denial-of-service attacks
The risk of denial-of-service(DoS)attacks might exist in the blockchain network.However,the nodes located in the consortium of RCDS should all be approved,so there should be no motivation to actively commit DoS attacks.In addition,these nodes are usually protected by the consortium’s defense in depth,and their safety should be ensured by systematic security mechanisms in all organizations of the consortium.As a data-sharing model,RCDS features mainly the ability of DRC,and many existing DoS defense methods can be used as supporting protection measures for RCDS.
5.3.2 Spoofing attacks
In RCDS,it is necessary to use the private keys of both parties to generate SMTs,and data must be partially obtained through authoritative decoders.Dgenerates its own SMT and fingerprint based on its private key,so doesU.Beyond that,the queryqthatUwants to execute is also associated withU’s private key.Because these parameters are all immutably stored on the blockchain,the authenticity of each party is publicly verifiable,so most spoofing attacks can be avoided.
5.3.3 Sybil attacks
In RCDS,we adopt the CCP consensus protocol that we proposed in our early works(Wang L et al.,2020,2021),which provides two ways to resist sybil attacks.One is to limit the way in which nodes join the network by forcing them to register valid data assets,that is,“First Share and Then Request.” The other is to delete the suspicious nodes from the address list using cascaded message passing.To say the least,even if an RCDS system were hit by a sybil attack,the confidentiality of shared data and the privacy of honest nodes would not be compromised,because the current and future information that reaches consensus on blockchain is always desensitized and separated from the raw data.
5.3.4 Eclipse attacks
First,an eclipse attack is unlikely to succeed in a consortium blockchain network.Even if it happened,the victim nodes would simply be quarantined.The CCP consensus protocol we adopt in RCDS provides a fault tolerance rate of nearly 1/2.This means that honest nodes can still reach consensus as long as they are more than half in number.Moreover,the node detection of CCP can find failed nodes agilely during at most one consensus round and exclude them from the network,thus preventing eclipse attacks from continuing.
5.3.5 Replay attacks
CCP,again,is essentially a fork-free consensus protocol.In each stage of CCP,the journey of a transaction begins and ends only on both parties of the transaction,while other nodes are responsible only for verifying and forwarding the transaction.In this process,each node grows its own blockchain,the consistency of which is temporally independent of transactions.Therefore,there is no problem of processing multiple new blocks simultaneously in CCP networks,which means that the hard-fork will not occur.Of course,replay attacks that exploit the vulnerability of hard-fork will not work against RCDS either.
5.3.6 User identity security
To prevent user identities from being analyzed in a way,we use the shielded pool concept to hide the addresses of both parties.We put all the records of on-chain transactions in the shielded pool;that is,when a transaction is conducted,the addresses of both parties will be encrypted right after the transaction enters the shielded pool.In this way,the onchain pseudonyms and information,such as SMT?and fingerprints,do not reveal users’real identities.In fact,the scenarios we envision for RCDS are generally for consortiums,and the connection between a user’s real identity and a pseudonym is handled by the off-chain administration of the consortium,which means that the nodes in the blockchain network should have been authorized and opened,and it is meaningless for an attacker to analyze identities.
From the above analysis,we find that the credibility of the RCDS model depends mainly on the acceptable performance of the fingerprint identification and the blockchain operations.Therefore,we conducted a group of simulations to evaluate the performance of RCDS mainly in correctness,robustness,and efficiency,and assess the efficiency of the blockchain network in terms of delay and throughput.
In this subsection,we introduce the configuration for the fingerprint identification and blockchain operation tests.
6.1.1 Configuration for fingerprinting simulations
We programmed a testbed (http://www.hbusoftsec.org.cn/files/rcds_fingerprint.zip) in a Java SE v1.8 environment to simulate different fingerprint identification processes and test the fingerprint identification algorithms of RCDS on a single server(CPU:x64,2.6 GHz,6 cores,12 logic processors;RAM:16 GB;HDD:1 TB;SSD:256 GB).
To be compatible with common blockchain technologies,the testbed adopted SHA-256 as the hash digest algorithm and ECDSA as the asymmetric encryption and signature scheme.The parameters used in the testbed were preset as follows:
The parameterτrefers to the storage space limitation of SMT?;we first assume that it is 1 MB in the simulations.The parameterλrefers to the symbol length factor.Its significance is to embed enough fingerprints in the hidden code so that they cannot be easily erased.We first assumeλ=1024 in the simulations.The parameterγrefers to the number of fingerprints that can be embedded in each character.Here,we should make it no more than 1,because if it exceeds 1,which means one plain code corresponding to multiple fingerprints,it will be very easy to guess and erase these fingerprints.Allowing for this,we initially assumeγ=0.3 in the simulations.
The main performance metrics of the simulations include correctness,robustness,and efficiency.Correctness reflects the ability of RCDS to successfully identify data fingerprints.Robustness reflects the ability of RCDS to resist malicious data processing.Efficiency reflects RCDS’s executive agility.
The assessment terms we will use throughout the evaluation are defined as follows:
6.1.2 Configuration for blockchain simulations
We built a blockchain network (http://www.hbusoftsec.org.cn/files/rcds_bc.zip) by using the spring boot framework (JDK version:1.8) for the performance testing.Then,we ran it in a server machine (CPU:Intel Xeon Platinum 8269CY Cascade Lake,2.5 GHz,12 cores;bandwidth:1 Gb/s;memory:4 GB;ESSD:40 GB) by instantiating six containers and using multi-thread programming to simulate the communication between peer nodes.
We measured the ability of RCDS to successfully identify fingerprints from data objects.
1.Metrics.Accuracy reflects how correct RCDS is.Precision and Sensitivity metrics inversely correlate with the false alarm rate and the missed alarm rate,respectively.These three metrics are defined as
2.Settings.The fingerprinting effects of RCDS under different parameters were observed in the simulations.We simulated random datasets,in which the values ofεand?were variable in the test.We set the parameters for this test as follows:d,randomly generated;number of users,50;|d|,1 MB;ε,0.2—0.7;?,10—16.The test was calculated 10 times and the results were averaged.
3.Results.The results of the test are shown in Fig.9.
Fig.9 Correctness of RCDS:(a) ?=10;(b) ?=12;(c) ?=14;(d) ?=16
4.Discussion.As shown in Fig.9,the horizontal axis indicates the variation ofε,the vertical axis shows the values of the observed metrics,and the four subgraphs show the results for different values of?.
It is easy to see that RCDS has perfect sensitivity when identifying fingerprint on randomly generated datasets,free of malicious processing.In terms of Accuracy and Precision,when?was set to a maximum of 12,the results were generally better than those at?=14,which means that constraining the threshold of backward verification within a measurable range can ensure that the system works correctly.We found that whenε ≥0.4,the fingerprint identification rate is as high as 100%,which gives the baseline ofεin forward verification.
To sum up,the simulation results showed that RCDS can correctly identify users’fingerprints from encoded data which were free of malicious processing.
We measured RCDS’s ability to correctly verify fingerprints from data objects when there was potentially malicious data processing.
1.Metrics.Same as those in Section 6.2.
2.Settings.RCDS’s ability was evaluated in the simulations to resist the potential presentation attacks described below.
(1) Deletion attack.An adversary may delete a few bytes fromdFin an attempt to make the fingerprints undetectable.In the simulations,one third ofdFwere deleted at random.
(2) Swap attack.An adversary may swap some pairs of bytes indFto disarrange the order of the codes.In the simulations,we swapped every two adjacent bytes.
(3)Padding attack.An adversary may put some random bytes overdF,trying to reduce the fingerprint recall rate.In the simulations,we appended noisy bytes todFto extend the size ofdFto 2|dF|.
(4) Negation attack.An adversary may negate some bytes indFto obfuscate the codes.In the simulations,we negated half of the bytes.
(5)Reversion attack.An adversary may reverse the order in whichdFis stored,seeking to desensitize the program to correct fingerprints.In the simulations,we completely reverseddF.
We chose the above attack models because each of them represents a class of content processing.φ5of RCDS has a strong adaptability because it does not depend on the type of data content.For each attack model above,we observed the influence of〈ε,?〉(ε ∈{0.4,0.5},? ∈{12,14}) on the fingerprinting effect.The test was calculated 10 times,and the results were averaged.
3.Results.The results of the test are shown in Fig.10.
Fig.10 Robustness of RCDS:(a) ε=0.4, ?=12;(b) ε=0.5, ?=12;(c) ε=0.4, ?=14;(d) ε=0.5, ?=14
Fig.11 Symbol mapping generation runtime (SMRT) of RCDS:(a) |d|=1 MB;(b) |d|=8 MB;(c) |d|=16 MB;(d) |d|=32 MB
4.Discussion.As shown in Fig.10,the horizontal axis lists the five attack types mentioned in the settings,the vertical axis shows the values of observed metrics,and the four subgraphs show the results based on different values ofεand?.
It is clear that RCDS worked the best against those attacks whenε=0.5 and?=14,which tells us the threshold for making the system live in a malicious environment.It can be found that increasing?will increase the sensitivity of RCDS to these five types of attacks.In addition,the effect of a negative attack was not as good when?<14.The reason is that the negative attack essentially manipulates the plain code ofdF,while the forward verification phase of Algorithm 3 must be relatively rigorous to interpret the plain code.However,by raising?,the recognition rate against this attack obviously increased.It is important to note that the effect of fingerprinting is generally not influenced by the volumes and types of data objects,because the target of RCDS is the byte sequences of the data objects.All the above results support the acceptable reliability of RCDS.
We measured the efficiency of RCDS in running SMT generation and fingerprint identification.
1.Metrics.The major performance costs of RCDS come from the activities of SMT generation and fingerprint identification.The symbol mapping generation runtime(SMRT)and the fingerprint identification runtime (FIRT) were logged when observing these two types of activities.
2.Settings.The inputs that are closely related to RCDS’s efficiency includeθ,η,and|d|.The impact of these inputs’changes on SMRT and FIRT was observed in the simulations through several tests.According to inequalities (1) and (3),we calculated the ranges ofθandηbased on different assignments of|d|.Table 4 lists their values for this simulation.Each test was calculated 10 times,and the results were averaged.
Table 4 Settings for the efficiency simulation
3.Results.Figs.11 and 12 show SMRT and FIRT,respectively.
4.Discussion.As shown in Figs.11 and 12,the horizontal axis indicates the variation of the hidden code lengthη,the vertical axis shows the values of observed metrics,and the four subgraphs show the SMRT and FIRT with different amounts of data,and the series shown in each subgraph represents different values ofθ.
As seen in the results,we can adjust the value ofθto fit any size of|d|,while the efficiency of RCDS will not be affected.In other words,the size of|d|has almost no influence on the runtime,which indicates that RCDS is scalable in terms of data volume.At the same time,SMRT and FIRT decreased and tended to be stable with the increase ofθwhen the data volume was the same.The reason is that the possibility of symbol repeating indecreases with the increase ofθ.
At the same time,whenθwas constant,the results showed that a smaller value ofηusually led to less runtime,as smaller hidden code lengths always require less searching effort.
With regard to performance expansion,the values ofθandηaffected RCDS’s storage and communication performance.From|d|=θ|S|and,we know that the larger theθ,the smaller the,and the less storage and communication pressure there will be.
In addition,in Fig.12,we can notice that some values were abnormally large.This was caused by the backward verification of the fingerprint (Algorithm 4),and was consistent with the theoretical basis of RCDS’s full fingerprint coverage strategy(i.e.,?|F|=η|S|).
Fig.12 Fingerprint identification runtime (FIRT) of RCDS:(a) |d|=1 MB;(b) |d|=8 MB;(c) |d|=16 MB;(d) |d|=32 MB
We tested the average delay and throughput of the blockchain network when the RCDS model was working.
1.Metrics.The average delay indicates mainly how fast a single transaction is confirmed,and the throughput reflects the number of transactions completed per unit of time.
2.Settings.We set the numbers of nodes as{3,4,5},and the number of transactions from 5000 to 25 000.
3.Results.The results are shown in Fig.13.
Fig.13 Comparison of average latency (a) and throughput (b) (tps:transactions per second)
4.Discussion.The duration of transaction generation and the time consumed by consensus are the main factors affecting blockchain efficiency.From Fig.13,we learn that the throughput changed little with increased numbers of nodes and transactions,and that the average delay was within an acceptable range.This means that RCDS does not cost much to run if an appropriate blockchain network is deployed,and that DRC is quite feasible on such a model.
In this work,we propose RCDS—a rightconfirmable data-sharing model.By using SMC,RCDS encodes raw data in a non-distortion way,and thus is competent for DRC regardless of the types and volumes of shared data.By employing blockchain,RCDS imposes credible supervision on DRC through the whole network consensus.Furthermore,RCDS combines SMC and blockchain into a systematic mechanism,with which the data access can be fully under control during its sharing processes.Above features of RCDS make it possible to launch trusted traitor tracing and access control,better supporting the forensics on the acts of transaction repudiation and data piracy.
Of course,our work inevitably has some limitations,and further research and expansion are needed.First,RCDS does not provide a fingerprint extraction function,which might further improve the credibility of fingerprint identification.Second,it is necessary to design a unified and effective access control strategy for different data content types,which can make the decoder encapsulation more secure.Third,a more complete user identity privacy protection scheme is required to ensure the security of user identities.Finally,the encapsulation strategy in this model is implemented by smart contracts,which always need closer check in security.We hope that dealing with the above issues will lead to the emergence of more effective DRC data-sharing models.
Contributors
Liang WANG designed the research.Shunjiu HUANG conducted the simulations and drafted the paper.Lina ZUO processed the data and helped organize the paper.Jun LI performed the formal analysis.Wenyuan LIU supervised the research.Liang WANG revised and finalized the paper.
Compliance with ethics guidelines
Liang WANG,Shunjiu HUANG,Lina ZUO,Jun LI,and Wenyuan LIU declare that they have no conflict of interest.
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Frontiers of Information Technology & Electronic Engineering2023年8期