Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Contig names lost after FastaAlternateReferenceMaker

0

3 comments

  • Avatar
    Can Kockan

    Looking at your original reference index Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.fai , it seems that the contigs are using the 1, 2, ..., X, Y, MT convention unless I'm missing something. Using the VCF with the "chr" prefix appended might be a potential issue here. As far as I know, this tool also takes an interval list so it might be a good idea to run a quick&small test by using the original primary assembly, unaltered VCF, and a very small known interval where you want to introduce a SNP to the reference.

    0
    Comment actions Permalink
  • Avatar
    ZYF

    Don't worry, Kheira, I got you.

    The problem is the -L bed file. If u specify this argument, the output fasta will be broken. To fix this, first make sure the contigs, 1st col in vcf are all "chr1" format. Then delete the -L argument.

    And it will output fa.fai like below:

    ❯ rg '>' masked.fa
    1:>1 chr1:1-248956422
    4149260:>2 chr10:1-133797422
    6379213:>3 chr11:1-135086622
    8630653:>4 chr12:1-133275309
    10851906:>5 chr13:1-114364328
    12757978:>6 chr14:1-107043718
    14542039:>7 chr15:1-101991189
    16241891:>8 chr16:1-90338345
    17747527:>9 chr17:1-83257441
    19135146:>10 chr18:1-80373285
    20474705:>11 chr19:1-58617616
    21451662:>12 chr2:1-242193529
    25488222:>13 chr20:1-64444167
    26562301:>14 chr21:1-46709983
    27340798:>15 chr22:1-50818468
    28187778:>16 chr3:1-198295559
    31492703:>17 chr4:1-190214555
    34662931:>18 chr5:1-181538259
    37688580:>19 chr6:1-170805979
    40535367:>20 chr7:1-159345973
    43191140:>21 chr8:1-145138636
    45610118:>22 chr9:1-138394717
    47916706:>23 chrMT:1-16569
    47916984:>24 chrX:1-156040895
    50517655:>25 chrY:1-57227415
    51471447:>26 KI270728.1:1-1872759
    51502661:>27 KI270727.1:1-448248
    51510133:>28 KI270442.1:1-392061
    51516669:>29 KI270729.1:1-280839
    51521351:>30 GL000225.1:1-211173
    51524872:>31 KI270743.1:1-210658
    51528384:>32 GL000008.2:1-209709
    51531881:>33 GL000009.2:1-201709
    51535244:>34 KI270747.1:1-198735
    51538558:>35 KI270722.1:1-194050
    51541794:>36 GL000194.1:1-191469
    51544987:>37 KI270742.1:1-186739
    51548101:>38 GL000205.2:1-185591
    51551196:>39 GL000195.1:1-182896
    51554246:>40 KI270736.1:1-181920
    51557279:>41 KI270733.1:1-179772
    51560277:>42 GL000224.1:1-179693
    51563273:>43 GL000219.1:1-179198
    51566261:>44 KI270719.1:1-176845
    51569210:>45 GL000216.2:1-176608
    51572155:>46 KI270712.1:1-176043
    51575091:>47 KI270706.1:1-175055
    51578010:>48 KI270725.1:1-172810
    51580892:>49 KI270744.1:1-168472
    51583701:>50 KI270734.1:1-165050
    51586453:>51 GL000213.1:1-164239
    51589192:>52 GL000220.1:1-161802
    51591890:>53 KI270715.1:1-161471
    51594583:>54 GL000218.1:1-161147
    51597270:>55 KI270749.1:1-158759
    51599917:>56 KI270741.1:1-157432
    51602542:>57 GL000221.1:1-155397
    51605133:>58 KI270716.1:1-153799
    51607698:>59 KI270731.1:1-150754
    51610212:>60 KI270751.1:1-150742
    51612726:>61 KI270750.1:1-148850
    51615208:>62 KI270519.1:1-138126
    51617512:>63 GL000214.1:1-137718
    51619809:>64 KI270708.1:1-127682
    51621939:>65 KI270730.1:1-112551
    51623816:>66 KI270438.1:1-112505
    51625693:>67 KI270737.1:1-103838
    51627425:>68 KI270721.1:1-100316
    51629098:>69 KI270738.1:1-99375
    51630756:>70 KI270748.1:1-93321
    51632313:>71 KI270435.1:1-92983
    51633864:>72 GL000208.1:1-92689
    51635410:>73 KI270538.1:1-91309
    51636933:>74 KI270756.1:1-79590
    51638261:>75 KI270739.1:1-73985
    51639496:>76 KI270757.1:1-71251
    51640685:>77 KI270709.1:1-66860
    51641801:>78 KI270746.1:1-66486
    51642911:>79 KI270753.1:1-62944
    51643962:>80 KI270589.1:1-44474
    51644705:>81 KI270726.1:1-43739
    51645435:>82 KI270735.1:1-42811
    51646150:>83 KI270711.1:1-42210
    51646855:>84 KI270745.1:1-41891
    51647555:>85 KI270714.1:1-41717
    51648252:>86 KI270732.1:1-41543
    51648946:>87 KI270713.1:1-40745
    51649627:>88 KI270754.1:1-40191
    51650298:>89 KI270710.1:1-40176
    51650969:>90 KI270717.1:1-40062
    51651638:>91 KI270724.1:1-39555
    51652299:>92 KI270720.1:1-39050
    51652951:>93 KI270723.1:1-38115
    51653588:>94 KI270718.1:1-38054
    51654224:>95 KI270317.1:1-37690
    51654854:>96 KI270740.1:1-37240
    51655476:>97 KI270755.1:1-36723
    51656090:>98 KI270707.1:1-32032
    51656625:>99 KI270579.1:1-31033
    51657144:>100 KI270752.1:1-27745
    51657608:>101 KI270512.1:1-22689
    51657988:>102 KI270322.1:1-21476
    51658347:>103 GL000226.1:1-15008
    51658599:>104 KI270311.1:1-12399
    51658807:>105 KI270366.1:1-8320
    51658947:>106 KI270511.1:1-8127
    51659084:>107 KI270448.1:1-7992
    51659219:>108 KI270521.1:1-7642
    51659348:>109 KI270581.1:1-7046
    51659467:>110 KI270582.1:1-6504
    51659577:>111 KI270515.1:1-6361
    51659685:>112 KI270588.1:1-6158
    51659789:>113 KI270591.1:1-5796
    51659887:>114 KI270522.1:1-5674
    51659983:>115 KI270507.1:1-5353
    51660074:>116 KI270590.1:1-4685
    51660154:>117 KI270584.1:1-4513
    51660231:>118 KI270320.1:1-4416
    51660306:>119 KI270382.1:1-4215
    51660378:>120 KI270468.1:1-4055
    51660447:>121 KI270467.1:1-3920
    51660514:>122 KI270362.1:1-3530
    51660574:>123 KI270517.1:1-3253
    51660630:>124 KI270593.1:1-3041
    51660682:>125 KI270528.1:1-2983
    51660733:>126 KI270587.1:1-2969
    51660784:>127 KI270364.1:1-2855
    51660833:>128 KI270371.1:1-2805
    51660881:>129 KI270333.1:1-2699
    51660927:>130 KI270374.1:1-2656
    51660973:>131 KI270411.1:1-2646
    51661019:>132 KI270414.1:1-2489
    51661062:>133 KI270510.1:1-2415
    51661104:>134 KI270390.1:1-2387
    51661145:>135 KI270375.1:1-2378
    51661186:>136 KI270420.1:1-2321
    51661226:>137 KI270509.1:1-2318
    51661266:>138 KI270315.1:1-2276
    51661305:>139 KI270302.1:1-2274
    51661344:>140 KI270518.1:1-2186
    51661382:>141 KI270530.1:1-2168
    51661420:>142 KI270304.1:1-2165
    51661458:>143 KI270418.1:1-2145
    51661495:>144 KI270424.1:1-2140
    51661532:>145 KI270417.1:1-2043
    51661568:>146 KI270508.1:1-1951
    51661602:>147 KI270303.1:1-1942
    51661636:>148 KI270381.1:1-1930
    51661670:>149 KI270529.1:1-1899
    51661703:>150 KI270425.1:1-1884
    51661736:>151 KI270396.1:1-1880
    51661769:>152 KI270363.1:1-1803
    51661801:>153 KI270386.1:1-1788
    51661832:>154 KI270465.1:1-1774
    51661863:>155 KI270383.1:1-1750
    51661894:>156 KI270384.1:1-1658
    51661923:>157 KI270330.1:1-1652
    51661952:>158 KI270372.1:1-1650
    51661981:>159 KI270548.1:1-1599
    51662009:>160 KI270580.1:1-1553
    51662036:>161 KI270387.1:1-1537
    51662063:>162 KI270391.1:1-1484
    51662089:>163 KI270305.1:1-1472
    51662115:>164 KI270373.1:1-1451
    51662141:>165 KI270422.1:1-1445
    51662167:>166 KI270316.1:1-1444
    51662193:>167 KI270340.1:1-1428
    51662218:>168 KI270338.1:1-1428
    51662243:>169 KI270583.1:1-1400
    51662268:>170 KI270334.1:1-1368
    51662292:>171 KI270429.1:1-1361
    51662316:>172 KI270393.1:1-1308
    51662339:>173 KI270516.1:1-1300
    51662362:>174 KI270389.1:1-1298
    51662385:>175 KI270466.1:1-1233
    51662407:>176 KI270388.1:1-1216
    51662429:>177 KI270544.1:1-1202
    51662451:>178 KI270310.1:1-1201
    51662473:>179 KI270412.1:1-1179
    51662494:>180 KI270395.1:1-1143
    51662515:>181 KI270376.1:1-1136
    51662535:>182 KI270337.1:1-1121
    51662555:>183 KI270335.1:1-1048
    51662574:>184 KI270378.1:1-1048
    51662593:>185 KI270379.1:1-1045
    51662612:>186 KI270329.1:1-1040
    51662631:>187 KI270419.1:1-1029
    51662650:>188 KI270336.1:1-1026
    51662669:>189 KI270312.1:1-998
    51662687:>190 KI270539.1:1-993
    51662705:>191 KI270385.1:1-990
    51662723:>192 KI270423.1:1-981
    51662741:>193 KI270392.1:1-971
    51662759:>194 KI270394.1:1-970

    I think GATK group should add necessary interpretion in FastaAlternateReferenceMaker documentation. "If you want to generate masked fasta for general use, please skip the -L argument."

    Wait, it's still not right. We don't need 1,2,3... contigs, It will cause troubles. Although we can use sed to modify it, why dont keeping it original in FastaAlternateReferenceMaker? To use the masked fasta as a drop-in replacement, it is necessary to leave contigs unmodified.

    BTW, the sed cmdln to get contigs back in my situation is:

    sed 's/>\([0-9]*\) \(.*\):\([0-9]*\)-\([0-9]*\)/>\2/g' output.fa

    I hope someone see this.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi all. This is the default behavior for our FastaAlternateReferenceMaker therefore changing contig names after the operation is necessary depending on the purpose. 

    If you wish to avoid this behavior you can open an issue in github page of GATK and request a feature. 

    https://github.com/broadinstitute/gatk 

    Or alternatively you may try using 

    bcftools consensus

    which does not change sequence names unless a region is provided however keep in mind that it may have a different replacement behavior compared to FastaAlternateReferenceMaker so it is up to you to check the results. 

    I hope this helps. 

    EDIT: I just added a PR for this behavior. Please check back to see if this PR is merged. 

    https://github.com/broadinstitute/gatk/pull/8865 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk