This documents describes the capabilities of Sentieon DNAseq pipeline matching different versions of GATK germline pipelines. If you have any additional questions, please contact the technical support at Sentieon Inc. at support@sentieon.com.
Fastq files of NA12878 were downloaded from ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST _NA12878_HG001_HiSeq_300x/140127_D00360_0011_AHGV6ADXX/Project_RM8398/
Hg38 and other databases were downloaded from GATK resource bundle.
Arguments | File |
---|---|
fasta | Homo_sapiens_assembly38.fasta |
known_Mills | Mills_and_1000G_gold_standard.indels.hg38.vcf.gz |
known_1000G | 1000G_phase1.snps.high_confidence.hg38.vcf.gz |
known_dbsnp | dbsnp_146.hg38.vcf.gz |
calling_intervals_list | wgs_calling_regions.hg38.interval_list |
BWA 0.7.15-r1140:
bwa mem -M -Y -K 10000000 \
-R '@RG\tID:NA12878\tSM:NA12878\tPL:ILLUMINA' \
$fasta $fastq1 $fastq2 | \
samtools sort -o sorted.bam
samtools index sorted.bam
Sentieon:
sentieon bwa mem -M -Y -K 10000000 \
-R '@RG\tID:NA12878\tSM:NA12878\tPL:ILLUMINA' \
$fasta $fastq1 $fastq2 | \
sentieon util sort -i - \
-r $fasta -o sorted.bam --sam2bam
GATK3.7/3.8(Picard):
java -jar picard.jar MarkDuplicates \
I=sorted.bam \
O=deduplicated.bam \
M=duplication.metrics \
REMOVE_DUPLICATES=true \
CREATE_INDEX=true
GATK4:
gatk MarkDuplicates \
-I sorted.bam \
-O deduplicated.bam \
-M duplication.metrics \
--REMOVE_DUPLICATES true \
--CREATE_INDEX true
Sentieon:
sentieon driver -r $fasta -i sorted.bam \
--algo LocusCollector --fun score_info score.txt.gz
sentieon driver -r $fasta -i sorted.bam \
--algo Dedup --rmdup --score_info score.txt.gz deduped.bam
GATK 3.7/3.8:
java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-I deduplicated.bam \
-R $fasta \
--knownSites $known_Mills \
--knownSites $known_1000G \
--knownSites $known_dbsnp \
-o bqsr.grp
java -jar GenomeAnalysisTK.jar \
-T PrintReads \
-R $fasta \
-I deduplicated.bam \
-BQSR bqsr.grp \
-o recalibrated.bam
GATK 4:
gatk BaseRecalibrator \
-I deduplicated.bam \
-R $fasta \
--known-sites $known_Mills \
--known-sites $known_1000G \
--known-sites $known_dbsnp \
-O bqsr.grp
gatk ApplyBQSR \
-R $fasta \
-I deduplicated.bam \
--bqsr-recal-file bqsr.grp \
-O recalibrated.bam
Sentieon*:
sentieon driver -r $fasta \
-i deduped.bam \
--algo QualCal \
-k $known_dbsnp \
-k $known_1000G \
-k $known_Mills \
recal_data.table
*Sentieon variant callers can perform the recalibration on the fly using a pre-recalibration bam plus the recalibration table. Recalibrated bam can be generated by the ReadWriter algo.
# This step is optional
sentieon driver -i deduped.bam -q recal_data.table --algo ReadWriter recaled.bam
Command line:
GATK 3.7/3.8:
java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-ERC GVCF \
-R $fasta \
-L $calling_intervals_list \
-I recalibrated.bam \
-o output.g.vcf.gz
java -jar GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R $fasta \
-L $calling_intervals_list \
--variant output.g.vcf.gz \
--dbsnp $known_dbsnp \
-o output.vcf.gz
Sentieon:
sentieon driver -r $fasta \
-i deduped.bam \
-q recal_data.table \
--interval $calling_intervals_list \
--algo Haplotyper \
--emit_mode gvcf \
output.g.vcf.gz
sentieon driver -r $fasta \
--interval $calling_intervals_list \
--algo GVCFtyper \
-v output.g.vcf.gz \
--call_conf 10 \
--emit_conf 10 \
-d $known_dbsnp \
output.vcf.gz
Results:
Type | TRUTH | QUERY | METRIC | |||||
---|---|---|---|---|---|---|---|---|
TOTAL | TP | FN | TOTAL | FP | Recall | Precision | F1_Score | |
INDEL | 848723 | 848238 | 485 | 874360 | 538 | 0.999429 | 0.999385 | 0.999407 |
SNP | 4001821 | 4000797 | 1024 | 4005753 | 1033 | 0.999744 | 0.999742 | 0.999743 |
Command line:
GTK 4.0
gatk HaplotypeCaller \
-R $fasta \
-L $calling_intervals_list \
-I recalibrated.bam \
-ERC GVCF \
-O output.g.vcf.gz
gatk GenotypeGVCFs \
-R $fasta \
-L $calling_intervals_list \
-V output.g.vcf.gz \
--dbsnp $known_dbsnp \
-O output.vcf.gz
Sentieon:
sentieon driver -r $fasta \
-i deduped.bam \
-q recal_data.table \
--interval $calling_intervals_list \
--algo Haplotyper \
--emit_mode gvcf \
output.g.vcf.gz
sentieon driver -r $fasta \
--interval $calling_intervals_list \
--algo GVCFtyper \
-v output.g.vcf.gz \
--call_conf 10 \
--emit_conf 10 \
-d $known_dbsnp \
output.vcf.gz
Results:
Type | TRUTH | QUERY | METRIC | |||||
---|---|---|---|---|---|---|---|---|
TOTAL | TP | FN | TOTAL | FP | Recall | Precision | F1_Score | |
INDEL | 849960 | 846375 | 3585 | 874364 | 2434 | 0.995782 | 0.997216 | 0.996499 |
SNP | 4003643 | 3998527 | 5116 | 4005750 | 3319 | 0.998722 | 0.999171 | 0.998947 |
Command line:
GTK 4.1
gatk HaplotypeCaller \
-R $fasta \
-L $calling_intervals_list \
-I recalibrated.bam \
-ERC GVCF \
-O output.g.vcf.gz
gatk GenotypeGVCFs \
-R $fasta \
-L $calling_intervals_list \
-V output.g.vcf.gz \
--dbsnp $known_dbsnp \
-O output.vcf.gz
Sentieon*:
sentieon driver -r $fasta \
-i deduped.bam \
-q recal_data.table \
--interval $calling_intervals_list \
--algo Haplotyper \
--emit_mode gvcf \
output.g.vcf.gz
sentieon driver -r $fasta \
--interval $calling_intervals_list \
--algo GVCFtyper \
-v output.g.vcf.gz \
-d $known_dbsnp \
--genotype_model multinomial \
output.vcf.gz
*Sentieon uses the option --genotype_model multinomial to match the output of the default newQual model in GATK 4.1.
Results:
Type | TRUTH | QUERY | METRIC | |||||
---|---|---|---|---|---|---|---|---|
TOTAL | TP | FN | TOTAL | FP | Recall | Precision | F1_Score | |
INDEL | 855716 | 850790 | 4926 | 894426 | 10869 | 0.994243 | 0.987848 | 0.991035 |
SNP | 3999272 | 3990379 | 8893 | 4006624 | 11826 | 0.997776 | 0.997048 | 0.997412 |
Computing environment:
Stage | Sentieon | GATK3.8 | GATK4.0 | GATK4.1 |
---|---|---|---|---|
Alignment | 2:42:44 | 5:38:35 | 5:49:39 | 5:45:39 |
Dedup | 0:06:16 | 4:04:25 | 2:11:43 | 2:06:32 |
BQSR | 0:10:10 | 4:17:09 | 1:39:57 | 1:40:06 |
HaplotypeCaller | 0:41:02 | 3:21:37 | 6:56:53 | 5:37:52 |
GenotypeGVCFs | 0:00:55 | 2:04:08 | 2:02:55 | 2:05:22 |
Total | 3:41:07 | 19:25:54 | 18:41:07 | 17:15:31 |
Documentation on how to execute the Broad Institute GATK Best Practices using Sentieon.