Luyan Zhang,Huihui Li,Lei Meng,Jiankang Wang*
National Key Facility for Crop Gene Resources and Genetic Improvement,Institute of Crop Sciences,Chinese Academy of Agricultural Sciences,Beijing 100081,China
ABSTRACT
Linkage analysis[1–5]and the subsequent construction of genetic linkage maps[6–8]are essential for many genetic studies,including genetic mapping,gene fine mapping,and map-based cloning[9–14].The first genetic map was constructed by Sturtevant[15]and consisted of six sexlinked factors in Drosophila.Today,linkage maps with tens to hundreds or even more markers are common(e.g.,[16]in bread wheat;[17]in rice).An accurate linkage map is also fundamental for marker-assisted or gene-based selection in animal and plant breeding[18–20].
Two general steps are involved in map construction with genetic markers.First,markers are assigned to individual linkage groups.When markers are distributed over all chromosomes of the species,the number of groups should be equal to the number of pairs of homologous chromosomes.Physical map information may be used to assign genetic markers to specified chromosomes,greatly facilitating the marker-grouping procedure.Second,the order of markers in each linkage group is estimated by an ordering algorithm.Advances in biotechnology have led to the availability of large numbers of markers,allowing the construction of highdensity genetic maps.However,map construction using large numbers of markers also requires efficient algorithms.Approximate algorithms include seriation[6],maximumlikelihood multi-locus linkage analysis[21],evolutionary strategy algorithm[22],uni-directional growth[23],minimum spanning tree of a graph,as implemented in MSTMap[24],and local changes based on a greedy initial route implemented in Lep-MAP[25].
Given n cities and the distances between them,a salesman is required to visit each city once and only once,starting from any city and returning to the original place of departure.Which route should he choose in order to minimize the total distance traveled?This problem is referred to as the travelingsalesman problem(TSP),one of the most challenging and widely studied optimization problems in mathematics[26,27].TSP is a classical non-deterministic polynomial(NP)-hard problem in combinatorial mathematics[27,28].Theoretically,the best route of a TSP can be found by comparison of all possible solutions.However,the computing time of this exact method increases either exponentially with n or according to very high-order monomial functions of n[27].Heuristic(or approximate)procedures have been developed to solve TSP with large numbers(several hundreds)of cities and can produce answers close to the optimal solution.The best among these procedures is the k-Optimal(abbreviated as k-Opt)algorithm[26–29].
The construction of a genetic linkage map can be treated as a TSP,when markers are treated as cities and estimated recombination frequencies between markers are treated as distances[22,23].But dissimilarities do occur.First,the distance between any pair of marker loci is estimated by linkage analysis in limited-size genetic populations and is thus subject to large sampling error[2,4].Second,the solution of a TSP is a closed route,whereas the solution of a linkage map is an open-ended route.The best solution for a TSP may not represent the best order of markers on a linkage map.Third,whereas the solution of a TSP can be viewed as a twodimensional graph,marker order on a genetic map is linear and one-dimensional.
Climer and Zhang[30]pointed out that optimization for an open route(such as in map construction)can be rectified by optimization for a closed route(such as for TSP)by addition of a virtual city whose distance to every other city is equal to a constant C.Monroe et al.[31]presented a tool called TSPmap that implemented both approximate and exact TSP solvers to generate linkage maps.Some algorithms have been proposed[24,25]to detect and remove missing markers and genotyping error.These algorithms substantially reduce the negative effects of missing markers and error on linkage map construction.We modified the TSP k-Opt algorithm for map construction using open route length to identify shortest maps and proposed a procedure for removing genotyping error.The algorithm has been implemented in three software packages:QTL IciMapping for 20 bi-parental populations[32],GACD for clonal F1and double-cross F1populations[8],and GAPL for multi-parental populations consisting of pure lines[33].
However,the efficiency of using TSP-solving algorithms in genetic map construction(MAP)problems has not been fully investigated.Our objectives in this study were(1)to evaluate the efficiency of the k-Opt algorithm(where k=2 or 3)in linkage map construction;(2)to compare the 2-Opt algorithm with methods in other software packages,taking missing marker and genotyping error into consideration;and(3)to develop a unified graphical user interface for constructing high-density linkage maps in a wide range of genetic mapping populations.
We have previously developed three integrated software packages for linkage map construction and quantitative trait locus(QTL)mapping with various genetic populations(Table 1).The first,QTL IciMapping,is designed for genetic analysis mainly of bi-parental populations and features ten functions([32];Table 1).The second,GACD,has four functions designed for double-cross F1and clonal F1populations([8];Table 1).The third,GAPL,also has four functions but is designed for pure-line populations derived from four to eight homozygous parents([33];Table 1).In these packages,core modules for recombination frequency estimation,marker grouping,marker ordering and QTL mapping etc.were written in Fortran 90/95,and the user interface was written in C#.The software runs on Windows XP/Vista/7/8/10,with Microsoft.NET Framework 2.0(x86)/3.0/3.5/4.0.Input data can be formatted in plain text,MS Excel 2003,and MS Excel 2007(Microsoft Corporation development).The three packages mentioned above are freely available from http://www.isbreeding.net.
Estimation of pairwise recombination frequency(REC)in bi-parental populations is described in[4].When four fixed lines are used as parents,the linkage phase in parents is known,but for F1between two clonal parents,phase must be deduced from linkage analysis of progeny.Estimation of REC in the two kinds of F1populations and haploid building in clonal F1is described in[5].Once linkage phase is determined,a clonal F1becomes identical to a double-cross F1of four fixed lines.Estimation of recombination frequency in DH and RIL populations derived from four to eight parents is described in[33].During the estimation of REC,the logarithm of odds(LOD)score and the mapping distance(DIS)in centiMorgans(cM)between any two markers are estimated simultaneously.
REC,LOD,and DIS provide the required information for map construction.Once REC,LOD,and DIS for each pair of markers have been estimated,population-specific information(population type,size,and marker type)is not needed for the next step of map construction.Accordingly,the MAP functionality in QTL IciMapping,CDM functionality in GACD,and PLM functionality in GAPL(Table 1)share the same interface for parameter setting in grouping,ordering,and rippling and the same interface for user control.The general procedures of map construction and user control have been covered in[32]and are not detailed here.
Table 1–Genetic populations and major functionalities in three integrated software packages for linkage map construction and QTL mapping.
REC is the principal measure of distance between two chromosomal loci and can be estimated in various genetic mapping populations.In genetics,REC is defined as the population proportion of crossovers between two genetic loci during one meiosis[3].In theory,REC should be in the range of[0,0.5)for two linked markers(REC=0 is called complete linkage),and should be equal to 0.5 for two unlinked markers.Owing to random sampling error,this value may be greater or less than 0.5 unlinked markers or close to 0.5 for linked markers.The LOD score is a statistic,based on a likelihood ratio test,that is normally used to test the linkage relationship between two markers[4,5].In testing linkage relationship against independent inheritance,the sampling distribution of 2ln(10)×LOD(the likelihood ratio)approaches theχ2distribution with one degree of freedom when the population size is large.Closer linkage results in a higher LOD value,given the mapping population.Thus,LOD may also be considered a measure of distance between two markers.To minimize the route length,negative LOD value was used in ordering markers.
Assume that the order of three markers on one chromosome is 1‐2-3 and that r12,r23,and r13are the pairwise REC values.Their relationship is given in Eq.(1),when there is no crossover interference between marker intervals 1–2 and 2–3.
Clearly,REC is non-additive.From the equivalent expression of Eq.(1),it can be seen that Haldane’s mapping function[1]can be used to convert the non-interference REC to the additive mapping distance(DIS),i.e.,m=-50 ln(1-2r),where r is the REC between two markers and m is DIS in cM.After the Haldane transformation,we have m13=m12+m23for the three markers in Eq.(1),and a REC value of 0.01 is about 1 cM of mapping distance.Intuitively,DIS too can be considered a measure of distance between two markers.
A detailed description of the k-Opt algorithm in TSP can be found in[2–29].Only a brief description is given below for convenience.k-Opt begins with a predefined closed route,called an initial route.2-Opt breaks the initial route in any two intervals,resulting in two segments.A new route is formed by exchanging the start and end points of the two segments.If the new route is shorter than the initial one,it will be used as a new initial route for further improvement.3-Opt breaks the initial route in any three intervals,resulting in three segments.Several new routes can be formed by exchanging start and end points of the three segments.The shortest route is selected and compared with the initial route.
A better route in MAP is determined by the open route length.An open route is formed by breaking the TSP closed route in the longest interval.During route improvement of k-Opt in MAP,the optimal algorithm using open route length to update the initial route was called 2-OptMAP when k=2,and 3-OptMAP when k=3(Table S1).The algorithm in which one virtual marker is added and the closed route length is used to identify better routes was called 2-OptTSP when k=2 and 3-OptTSP when k=3(Table S1).Distances between the virtual marker and all others were set to twice the maximum distance between any two markers.An open route was formed by removing the virtual marker from a TSP closed route.The distance between two markers can be represented by REC,LOD,or DIS.In Table S1,different names are given for different measures of marker distance.
To reduce time in route improvement,the initial route was normally constructed by the nearest neighbor(NN)algorithm,starting from each marker in MAP,and was called the NN route.Starting from one NN route,different names are given in Table S2 for the improved routes using closed or open route length in the k-Opt algorithm.Different NN routes might end with different optimal routes,from which the best optimal route was chosen(Table S2).
Three bi-parental populations were simulated(Table S3).The first consisted of 1000 doubled haploids derived from the F1hybrid between two homozygous parents,and was called the DH population;the second one consisted of 1000 F2individuals derived from selfing the F1hybrid,and was called the F2population.There were 3000 markers evenly distributed over one chromosome in both populations.In the simulation,DIS between two adjacent markers was set at 0.1 cM and the predefined map length was 300 cM.
Owing to the limited size of the mapping population,the estimated REC between two markers could be zero,even if two markers were not completely linked by definition.In the simulated DH and F2populations,the estimated REC was zero for respectively 1092 and 433 pairs of adjacent markers(Fig.1).When REC was estimated at zero between two markers,one of them was considered redundant in the population.Map construction problems were named MAP1906 in the DH population and MAP2567 in the F2population,after redundant markers were removed(Table S3).These two problems,together with their subsets,were used to estimate and compare proportion of correct order and time spent in marker ordering.To further evaluate effects of missing marker and genotyping error in map construction,two missing-value rate,5%and 15%,and two genotyping error rates,1%and 2%,were randomly assigned to marker genotype data in MAP1906(Table S3).
In the third population,5000 markers were considered to be evenly distributed on one chromosome of 500 cM in length.This population consisted of 500 doubled haploids derived from an F1hybrid(Table S3).This map construction problem was named MAP5000,which was used mainly to investigate the time spent by 2-Opt to order ultrahigh-density markers.The three bi-parental populations were generated by the simulation function in QTL IciMapping.The datasets are available from http://www.isbreeding.net/TSP4MAP/.
Genotyping error was corrected only for double crossovers observed in single individuals in the mapping population.Three-locus genotype frequency was used to determine the probability of error:for example[9]for a bi-parental DH population,[12]for a double-cross F1population,and[13]for a pure-line population from four homozygous parents.If marker type at marker locus q was different from the types at its two flanking markers,a random number between 0 and 1 was generated and compared with the conditional probability of the marker type at locus q given the marker types of the flanking markers.When the random number was larger,the marker type at locus q was treated as error,and was then assigned as missing for the estimation of recombination frequency.
Fig.1–Distribution of number of marker intervals by estimated recombination frequency in simulated DH and F2 populations.
For example,in a bi-parental DH population,supposing the genotypes at both flanking markers of locus q was AABB,the probabilities of QQ and qq at locus q were(1-r1)(1-r2)(1-r3)and 1-(1-r1)(1-r2)/(1-r),respectively,where r1,r2and r were recombination frequencies between the left marker and QTL,QTL and the right marker,two markers.If the marker type at locus q was QQ,no error was assumed.If marker type was qq,a random number between 0 and 1 was generated and compared with 1-(1-r1)(1-r2)/(1-r).If the random number was larger,qq was assumed to be an error,and replaced by a missing value to remove the putatively erroneous double crossover.
In simulation,the true marker order is predefined and every marker is nonredundant.For high-density markers located on a chromosome of fixed size,the estimated recombination frequency between two closely linked can be zero,and one of them becomes redundant in the population.Redundant markers in simulated mapping populations can be removed before map construction.In MAP1906 and MAP2567,redundant markers have been removed,so that the estimated recombination frequency is nonzero for every pair of markers.When error correction is not applied,the proportion of correct order and the computing time used can be used to compare the accuracy and efficiency of various ordering methods.
When k-Opt was compared with three other ordering methods,namely MSTMap[24],Lep-MAP[25]and TSPmap[31],missing marker and genotyping error were added in simulated populations,and error correction was applied in each ordering method.Although there were no redundant markers in MAP1906,after error correction some markers might have a value of zero for estimated recombination frequency and thereby become redundant.Redundant markers in the mapping populations were assigned to bins,and a bin map could be constructed.The marker map was the same as the bin map when each bin contained only one marker.If the bin number was lower than the marker number,a wrong correction was indicated,resulting in a binning error.The number of binning errors was defined as the number of non-redundant markers minus the bin number.Any inconsistent order in the bin map was counted and was called bin map error.For example,if the correct order was 1-2-3-4-5,two errors were counted for order 1-3-2-4-5,or 1-2-3-5-4,or 1-4-3-2-5.The total error,expressed as the sum of binning error and bin map error,was used to compare k-Opt with MSTMap,Lep-MAP,and TSPmap.The parameters in MSTMap and Lep-MAP were set to their defaults.In TSPmap,the“optimize”function was used to estimate the probability of genotyping error and the“tspOrder”function was used to order the markers.
One major purpose of linkage map construction is to locate genes affecting phenotypic traits of interest,and then to use the identified marker–gene associations in marker assisted selection or gene cloning.Thus,for MAP,correct marker order is probably more important than the length of the constructed map.The proportions of correct orders estimated for subsets of MAP1906 are presented in Table 2.A set of 50 markers randomly chosen from the 1906 were highly unevenly distributed on the predefined map of 300 cM.The proportion of correct orders ranged from 0.928 to 0.965 for the 12 ordering methods.When the number of random markers was 70,the proportions of correct order ranged from 0.995 to 0.997,higher than the values for marker subsets of 50.When the number of random markers was 100,the proportion of correct order was 100%.When 400 markers were randomly chosen,the proportion of correct order was also close to 100% for all ordering methods,indicating that correct order could be achieved from any NN initial route.
From Table 2,it can be easily seen that the proportions of correct order from k-OptMAP were equal to or slightly higher than those from k-OptTSP for most cases,indicating that open route length is a slightly better criterion than closed route length for identifying the shortest route in map construction.But the difference between k-OptMAP and k-OptTSP was minor.3-Opt and 2-Opt gave similar proportions of correct order,whether REC,LOD,or DIS was used as a distance.REC gave the highest proportions of correct order,followed by LOD and DIS.For marker numbers from 100 to 300,REC,LOD and DIS always gave the correct order.Thus,when more and more evenly distributed markers were included on a fixed-length linkage map,there was a higher chance for k-Opt to achieve correct order from one NN route,regardless of the measure of marker distance.But for marker numbers higher than 300,the proportions of correct order were slightly decreased.For a fixed-length chromosome,more markers result in higher marker density.When marker density is high,recombination frequencies between neighboring markers are smaller,making it harder to identify the correct marker order.
Table 2–Proportion of correct order estimated from 1000 random samples of markers in MAP1906.
Proportion of correct order of nine sample sizes of random markers in MAP2567 are shown in Table 3.As with MAP1906 in Table 2,k-OptMAP gave slightly higher proportions than k-OptTSP and the difference between 2-Opt and 3-Opt was minor.Again,REC gave the highest proportion of correct order,followed by LOD and DIS.For marker numbers greater than 80,LOD,REC and DIS always gave the correct order when k-OptMAP was applied and almost always when k-OptTSP was applied.For larger marker number,the proportions of correct order decreased slightly.For fewer markers,MAP2567 resulted in higher proportion of correct order than did MAP1906(see marker numbers 50 to 80 in Tables 2 and 3).This finding may be explained by the higher accuracy in recombination frequency estimation in an F2population than in a DH population[4].It can be concluded that population type(DH or F2in this study)had little effect on comparison of ordering methods.Accordingly,only the results from MAP1906 are described below.
Table S4 presents the proportion of correct order for two levels of missing marker in MAP1906,and Table S5 for two levels of genotyping error.As with MAP1906 with no missing markers or error,differences among 2-OptMAP,3-OptMAP,2-OptTSP,and 3-OptTSP were minor.When the number of random markers was below 100,REC still gave the highest proportion of correct orders,followed by LOD and DIS.However,this trend was not seen for larger marker numbers.
Clearly,the proportion of correct orders shown in Tables S4 and S5 was lower than that presented in Table 2 for any marker number and ordering method,indicating the negative effects of missing markers and genotyping error on marker ordering.The more missing markers or genotyping errors in a population,the lower was the ordering accuracy observed.The effects of missing marker and genotyping error on linkage map construction can be explained by the reduced accuracy of estimating recombination frequency.Missing markers reduce the amount of information that can be used in recombination frequency estimation.The effect of randomly missing markers may be quantified by the reduced population size,similar to the effect in QTL detection[34].Each genotyping error would introduce one additional crossover event when the marker is located at either end of the chromosome and two crossover events when the marker is located in the middle of the chromosome.Thus,recombination frequency cannot be properly estimated when erroneous markers are present.Intuitively,genotyping error will cause much larger effect than missing markers in linkage analysis and subsequent map construction.
For reduced proportion of correct order by missing marker and genotyping error(P)from one NN initial route and k-Opt improvement,the probability of arriving at the correct order is 1-(1-P)nwhen n initial routes and improvements are applied.When P=0.3,1-(1-P)nis 0.9717 for n=10,and 0.9992 for n=20.Thus,the correct order may still be identified with high probability by increasing the number of initial NN routes,when missing marker and genotyping error are present in the mapping population.
Time spent in finding the optimal solution must be taken into consideration when the number of markers is large.Table 4 shows the time spent by k-OptMAP and k-OptTSP in solving MAP1906 and subsets of random markers in MAP1906.Timewas recorded on a personal computer,Lenovo X1 Carbon(Windows 10,Intel Core i7-6600 U CPU@2.60GHz).Compared with route improvement in the k-Optimal algorithm,the time to find NN initial routes was minor.Value in Table 4 was time spent in using one marker as start point in constructing NN route,averaged from all markers for 2-Opt and 50 randomly selected markers for 3-Opt.For example,for MAP300,2-Opt was applied to 300 NN routes,and the best 2-Opt route was determined from the 300 2-Opt routes.Total spent time divided by 300 is shown in Table 4.3-Opt was applied to 50 NN routes randomly selected from 300 NN routes,and the best 3-Opt route was determined from 50 3-Opt routes.Total spent time divided by 50 is shown in Table 4.
Table 3–Proportion of correct order estimated from 1000 random samples of markers in MAP2567.
It can be seen from Table 4 that 2-OptMAP required slightly less time than 2-OptTSP and that 3-OptMAP required slightly more time than 3-OptTSP.The difference between k-OptMAP and k-OptTSP was minor.REC required the least computing time,followed by LOD and DIS.As marker number increased,time spent increased rapidly.Compared with 2-Opt,3-Opt required much longer running time,making it less likely to be useful when marker number is high.When there are more than 2000 markers in one linkage group,it may take quite a long time to run 2-Opt on all NN routes(although actually this is not necessary in MAP).But the best marker order may still be identified by running a few NN routes,due to the high proportion of correct orders for each NN route(Table 2,Tables S4 and S5).
Ordering accuracy and time spent for 2-Opt,Lep-MAP,MSTMap,and TSPmap are presented in Tables 5,6 and 7.In these tables,open route length was used to determine the best order and REC was used as the measure of marker distance in 2-Opt.2-Opt was applied to 20 NN routes randomly selected,and the best 2-Opt route was determined from 20 2-Opt routes.True map length was calculated from the predefined marker order under no missing markers and no genotyping error.Error correction was applied for each ordering method.
Table 5 shows results observed with no genotyping error for MAP1906.The map length of the correct order was close to 300 cM.2-Opt achieved the marker order closest to the true length and required the least computing time for all subsets of random markers.Time spent by Lep-MAP was 56 to 205 times as great as that for 2-Opt,time spent by TSPmap was 3 to 20 times as great,and the time spent by MSTMap was 2 to 14 times as great.For two numbers of randomly selected markers,100 and 400,bin number after error correction was the same as marker number for 2-Opt.For larger numbers of randomly selected markers,bin number was lower than marker number,indicating the presence of wrong correction.When both binning error and bin map error were considered,(total error in Table 5),2-Opt and TSPmap had the lowest error rates.
Table 6 shows results observed when 5% of marker data points were assumed to be missing in MAP1906.After random assignment missing values,more markers became redundant.MAP1781 was formed by removing additional redundant markers,and subsets randomly selected from the 1781 markers were used to compare the three ordering methods.It can be seen that missing values had little effect on true map length.2-Opt and TSPmap achieved the marker order closer to true length except for marker number 100.2-Opt required the least computing time for all subsets of random markers.Time spent by Lep-MAP and TSPmap was much longer than those by 2-Opt and MSTMap.As no genotyping error was present in MAP1781,bin number should be equal to marker number.Thus,when error correction was applied,wrong correction occurred for all three ordering methods,but the error rate from MSTMap was highest.When both binning error and bin map error were considered,2-Opt again had the lowest error rate.
Table 7 shows results observed when 1% of error was assumed in MAP1906.After the randomly assigned genotyping error,no markers became redundant,and subsets randomly selected from the 1906 markers were used to compare the three ordering methods.Genotyping error showed a great effect on the true map length of the predefined order.MSTMap achieved the marker order closest to true length,but it had the largest number of binning error.When both binning error and bin map error were considered,the performance of 2-Opt was poorer than that of Lep-MAP,but better than that of TSPmap and much better than that of MSTMap.With respect to time spent,2-Opt was much fasterthan Lep-MAP and TSPmap.Comparing Tables 5 and 6,it can be concluded that the ordering procedure took longer when genotyping error was present.
Table 5–Comparison of 2-Opt with Lep-MAP,MSTMap and TSPmap,using MAP1906 with no missing marker and no genotyping error.
Table 6–Comparison of 2-Opt with Lep-MAP,MSTMap and TSPmap,using MAP1906 with 5%missing marker.
In summary,after error correction 2-Opt achieved better order in shorter time when there were no genotyping error(Tables 5 and 6).When genotyping error was present,the map length from 2-Opt was greater than the true length,owing possibly to undercorrection of genotyping error(Table 7).The map length from Lep-MAP was shorter than the true length,and bin number from MSTMap was the least,owing possibly to overcorrection of genotyping error in the two methods.In actual mapping populations,before genetic analysis it is hard to tell whether genotyping error is present,and,if present,how high the error rate is.It remains an open question how genotyping error could be properly identified and corrected without affecting single or double crossovers that actually occurred in the mapping population.
Table 7–Comparison of 2-Opt with Lep-MAP,MSTMap and TSPmap,using MAP1906 with 1%genotyping error.
Based on major outcomes from this study,ordering methods using the TSP k-Optimal algorithm have been modified as shown in Fig.2.For three steps(grouping,ordering,and rippling)in map construction,REC,LOD and DIS can each be selected as criteria of distance,but REC is set as the default.For ordering when k-Optimality is selected,users can modify the criteria of distance,and then select one from“2-OptTSP”,“3-OptTSP”,“2-OptMAP”and“3-OptMAP”(Fig.2).“2-OptTSP”and“3-OptTSP”represent the addition of one virtual marker and the use of closed route length to identify optimal routes for 2-Opt and 3-Opt,respectively.“2-OptMAP”and“3-OptMAP”represent the use of open route length to identify optimal routes for 2-Opt and 3-Opt,respectively.
Users have three options to choose initial routes for k-Opt improvement.When the first option is selected,a number of random NN routes is needed(Fig.2).Each NN route is followed by the selected k-Opt algorithm,and the best optimal route is returned.When the second option is selected,the previous route is improved by the selected k-Opt algorithm,and the final optimal route is returned.When the third option is selected,the shortest NN route is determined,and then followed by the selected k-Opt algorithm.When all parameters have been set up,users just click the“Ordering”button in the interface to compute the order of each marker group.The user interfaces implemented in QTL IciMapping,GACD,and GAPL facilitate the efficient construction of linkage maps in many of the mapping populations commonly used in genetic studies.
The numbers of markers are greatly increased nowadays.Large numbers of markers complicate the procedure of linkage map construction,reduce construction accuracy,and require efficient algorithms for map construction.In addition,missing markers and genotyping error are commonly present in sequencing data,further complicating map construction.In this study,we showed that the algorithm used for solving TSP can be modified and used for constructing high-density linkage maps even in the presence of missing markers and genotyping errors.
Fig.2–Unified graphical user interface of linkage map construction implemented in three software packages:QTL IciMapping,GACD,and GAPL.
The k-Opt algorithm is to date the most successful approximate algorithm for solving the TSP with thousands of cities[27].When connecting the start and end points of a linkage map,or adding one virtual marker with a fixed distance from other markers,genetic map construction may be roughly treated as a TSP.Ultrahigh-density genetic maps can be constructed by this method in a reasonable period of time,each map having hundreds or several thousands of markers.In map construction,different initial routes always resulted in same optimal route,indicating that the best optimal route could be identified from just a few initial routes.As one criterion to judge better routes in k-Opt algorithm,open route length gave slightly higher proportion of correct marker order than close route length.REC and LOD gave similar proportion of correct order,and both were higher than DIS(Table 2).2-Opt took much less time than 3-Opt(Table 4).For MAP5000(Table S3),when number of initial routes was set at 50,2-OptMAP spent 4.66,4.15,and 11.73 s for one route when REC,LOD,and DIS were used as distance,respectively,on a personal computer of 2.60 GHz CPU.2-OptTSP spent 4.73,4.35,and 14.39 s for one route.Correct order was achieved no matter REC,LOD,or DIS was used as distance for both methods.In conclusion,for high-density markers,the most suitable method would be 2-Opt using open route length as the criterion to determine better routes,and using REC or LOD as the measure of distance between markers.
Different measures of marker distance achieved nonidentical orders.The numbers of correct order estimated from 1000 random samples of 50 random markers in MAP1906 are shown in Fig.3,where 2-OptMAP was used as the ordering method for example.In 1000 runs,correct orders were 964,950,and 929 by REC,LOD and DIS,respectively,and 906 orders were correct for all the three measures.Though the number of correct order from DIS was the least,12 correct orders achieved by DIS could be achieved by REC or LOD.Thus,there is no guarantee that the correct order is generated with a measure giving the highest proportion of correct orders.
MSTMap can handle backcross,DH,Haploid,RIL and advanced RIL populations,but is not suitable for F2,F3and other populations with three genotypes at each locus[24].Populations with heterozygous genotypes are essential for estimating dominance and epistatic effects[11].MSTMap also cannot accommodate multi-parental populations.Lep-MAP handles many types of population,including DH,F2,RIL,and so on,but cannot be used for multi-parental populations[25].TSPmap handles on many types of population,including DH,F2,RIL,and also multi-parental populations[31].The TSP k-Opt algorithm has been implemented for 20 bi-parental populations[32],clonal F1and double-cross F1populations[8],and multi-parental pure-line populations[33].Thus,the ordering method tested in this study can be readily used to build high-quality linkage maps with high-density markers in a wide range of genetic populations,especially in plants.
Currently,physical maps in most species are generated from limited number of individuals.If the parents used in genetic mapping populations are not the same as those being used in the physical map construction,the order of genes(and markers)in the genetic mapping population may not be exactly the same as in the physical map.We have observed that linkage maps in some bi-parental populations are unreasonably long if we assume the marker order is the same as in the physical map.This observation indicates that the two maps do not always have the same order.For breeding purposes,often we need the crossover and recombination information,rather than the physical distance in base pairs.
Fig.3–Numbers of correct order estimated from 1000 random samples of the first 50 markers in MAP1906.Measures were recombination frequency,negative logarithm of odds score,and genetic distance.2-OptMAP was used as an example.
If used properly,physical map information can help to construct better genetic maps.First,physical map information can be employed to anchor genetic markers to specified chromosomes,greatly facilitating the marker grouping procedure.Secondly,even though physical and linkage maps do not have exactly the same marker order,many markers should have the same order on both maps.If we take this as known information,it can improve the quality of the linkage map.Actually in the software user interface described in this paper,the software can take the known order of a number of markers into consideration,and perform ordering for only those markers whose orders are not known.This ordering method is called“By Anchor Order”in Fig.2.
In the other direction,a linkage map may also help to correct some errors and impute missing data points on a physical map.Accurate linkage maps and their construction will continue to be an important task in genetics.Physical and linkage maps serve different purposes in genetics.Neither will ever replace the other.
Author contributions
LZ conducted the simulation experiment and wrote the draft.HL wrote the code for recombination frequency estimation in QTL IciMapping.LM developed the software interface.JW designed the research and made revision for the manuscript.All the authors discussed the results and finalized the manuscript.
Declaration of competing interest
Authors declare that there are no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China(31861143003),and HarvestPlus(part of the CGIAR Research Program on Agriculture for Nutrition and Health,http://www.harvestplus.org/).
Appendix A. Supplementary data
Supplementary data for this article can be found online at https://doi.org/10.1016/j.cj.2020.03.005.