预约演示
预约演示

Match the results of GATK germline pipeline with Sentieon

Introduction

This documents describes the capabilities of Sentieon DNAseq pipeline matching different versions of GATK germline pipelines. If you have any additional questions, please contact the technical support at Sentieon Inc. at support@sentieon.com.

Input and references

Fastq files of NA12878 were downloaded from ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST _NA12878_HG001_HiSeq_300x/140127_D00360_0011_AHGV6ADXX/Project_RM8398/

Hg38 and other databases were downloaded from GATK resource bundle.

ArgumentsFile
fastaHomo_sapiens_assembly38.fasta
known_MillsMills_and_1000G_gold_standard.indels.hg38.vcf.gz
known_1000G1000G_phase1.snps.high_confidence.hg38.vcf.gz
known_dbsnpdbsnp_146.hg38.vcf.gz
calling_intervals_listwgs_calling_regions.hg38.interval_list

Data preprocessing

Step 1. BWA alignment

BWA 0.7.15-r1140:

bwa mem -M -Y -K 10000000 \
-R '@RG\tID:NA12878\tSM:NA12878\tPL:ILLUMINA' \
$fasta $fastq1 $fastq2 | \
samtools sort -o sorted.bam
samtools index sorted.bam

Sentieon:

sentieon bwa mem -M -Y -K 10000000 \
-R '@RG\tID:NA12878\tSM:NA12878\tPL:ILLUMINA' \
$fasta $fastq1 $fastq2 | \
sentieon util sort -i - \
-r $fasta -o sorted.bam --sam2bam

Step 2. PCR duplicate removal

GATK3.7/3.8(Picard):

java -jar picard.jar MarkDuplicates \
I=sorted.bam \
O=deduplicated.bam \
M=duplication.metrics \
REMOVE_DUPLICATES=true \
CREATE_INDEX=true

GATK4:

gatk MarkDuplicates \
-I sorted.bam \
-O deduplicated.bam \
-M duplication.metrics \
--REMOVE_DUPLICATES true \
--CREATE_INDEX true

Sentieon:

sentieon driver -r $fasta -i sorted.bam \
--algo LocusCollector --fun score_info score.txt.gz
sentieon driver -r $fasta -i sorted.bam \
--algo Dedup --rmdup --score_info score.txt.gz deduped.bam

Step 3. Base Quality Score Recalibration

GATK 3.7/3.8:

java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-I deduplicated.bam \
-R $fasta \
--knownSites $known_Mills \
--knownSites $known_1000G \
--knownSites $known_dbsnp \
-o bqsr.grp
java -jar GenomeAnalysisTK.jar \
-T PrintReads \
-R $fasta \
-I deduplicated.bam \
-BQSR bqsr.grp \
-o recalibrated.bam

GATK 4:

gatk BaseRecalibrator \
-I deduplicated.bam \
-R $fasta \
--known-sites $known_Mills \
--known-sites $known_1000G \
--known-sites $known_dbsnp \
-O bqsr.grp
gatk ApplyBQSR \
-R $fasta \
-I deduplicated.bam \
--bqsr-recal-file bqsr.grp \
-O recalibrated.bam

Sentieon*:

sentieon driver -r $fasta \
-i deduped.bam \
--algo QualCal \
-k $known_dbsnp \
-k $known_1000G \
-k $known_Mills \
recal_data.table

*Sentieon variant callers can perform the recalibration on the fly using a pre-recalibration bam plus the recalibration table. Recalibrated bam can be generated by the ReadWriter algo.

# This step is optional
sentieon driver -i deduped.bam -q recal_data.table --algo ReadWriter recaled.bam

Germline variant caller

GATK 3.7/3.8:

Command line:

GATK 3.7/3.8:

java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-ERC GVCF \
-R $fasta \
-L $calling_intervals_list \
-I recalibrated.bam \
-o output.g.vcf.gz
java -jar GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R $fasta \
-L $calling_intervals_list \
--variant output.g.vcf.gz \
--dbsnp $known_dbsnp \
-o output.vcf.gz

Sentieon:

sentieon driver -r $fasta \
-i deduped.bam \
-q recal_data.table \
--interval $calling_intervals_list \
--algo Haplotyper \
--emit_mode gvcf \
output.g.vcf.gz
sentieon driver -r $fasta \
--interval $calling_intervals_list \
--algo GVCFtyper \
-v output.g.vcf.gz \
--call_conf 10 \
--emit_conf 10 \
-d $known_dbsnp \
output.vcf.gz

Results:

TypeTRUTHQUERYMETRIC
TOTALTPFNTOTALFPRecallPrecisionF1_Score
INDEL8487238482384858743605380.9994290.9993850.999407
SNP400182140007971024400575310330.9997440.9997420.999743

GTK 4.0

Command line:

GTK 4.0

gatk HaplotypeCaller \
-R $fasta \
-L $calling_intervals_list \
-I recalibrated.bam \
-ERC GVCF \
-O output.g.vcf.gz
gatk GenotypeGVCFs \
-R $fasta \
-L $calling_intervals_list \
-V output.g.vcf.gz \
--dbsnp $known_dbsnp \
-O output.vcf.gz

Sentieon:

sentieon driver -r $fasta \
-i deduped.bam \
-q recal_data.table \
--interval $calling_intervals_list \
--algo Haplotyper \
--emit_mode gvcf \
output.g.vcf.gz
sentieon driver -r $fasta \
--interval $calling_intervals_list \
--algo GVCFtyper \
-v output.g.vcf.gz \
--call_conf 10 \
--emit_conf 10 \
-d $known_dbsnp \
output.vcf.gz

Results:

TypeTRUTHQUERYMETRIC
TOTALTPFNTOTALFPRecallPrecisionF1_Score
INDEL849960846375358587436424340.9957820.9972160.996499
SNP400364339985275116400575033190.9987220.9991710.998947

GTK 4.1

Command line:

GTK 4.1

gatk HaplotypeCaller \
-R $fasta \
-L $calling_intervals_list \
-I recalibrated.bam \
-ERC GVCF \
-O output.g.vcf.gz
gatk GenotypeGVCFs \
-R $fasta \
-L $calling_intervals_list \
-V output.g.vcf.gz \
--dbsnp $known_dbsnp \
-O output.vcf.gz

Sentieon*:

sentieon driver -r $fasta \
-i deduped.bam \
-q recal_data.table \
--interval $calling_intervals_list \
--algo Haplotyper \
--emit_mode gvcf \
output.g.vcf.gz
sentieon driver -r $fasta \
--interval $calling_intervals_list \
--algo GVCFtyper \
-v output.g.vcf.gz \
-d $known_dbsnp \
--genotype_model multinomial \
output.vcf.gz

*Sentieon uses the option --genotype_model multinomial to match the output of the default newQual model in GATK 4.1.

Results:

TypeTRUTHQUERYMETRIC
TOTALTPFNTOTALFPRecallPrecisionF1_Score
INDEL8557168507904926894426108690.9942430.9878480.991035
SNP3999272399037988934006624118260.9977760.9970480.997412

Runtime

Computing environment:

  • Google Compute Engine
  • n1-standard-32 (32 vCPUs, 120 GB memory)
  • Local SSD Scratch Disk 2x375G
  • centos-7-v20190619
StageSentieonGATK3.8GATK4.0GATK4.1
Alignment2:42:445:38:355:49:395:45:39
Dedup0:06:164:04:252:11:432:06:32
BQSR0:10:104:17:091:39:571:40:06
HaplotypeCaller0:41:023:21:376:56:535:37:52
GenotypeGVCFs0:00:552:04:082:02:552:05:22
Total3:41:0719:25:5418:41:0717:15:31

Documentation on how to execute the Broad Institute GATK Best Practices using Sentieon.