Tutorial¶
Use Chorus2 to design oligo probes for plant genome¶
In this tutorial, we will build oligo probe set for Arabidopsis genome.
Run Chorus2¶
Run Chorus2 with Docker¶
Download Reference Genome file
$ wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas
$ docker run -v $PWD:/home/chorus -e CHORUS_USER=$USER -e CHORUS_UID=$UID \
forrestzhang/docker-chorus -i TAIR10_chr_all.fas -g TAIR10_chr_all.fas -t 12
Please wait unit all precess done. There are some logs:
forrest /home/chorus
use local user: forrest
Adding group 'forrest' (GID 1000) ...
Done.
Adding user 'forrest' ...
Adding new user 'forrest' (1000) with group 'forrest' ...
Creating home directory '/home/forrest' ...
Copying files from '/etc/skel' ...
/home/chorus exists
2.2.3
########################################
bwa version: /opt/software/bwa/bwa 0.7.12-r1044
jellyfish version: /opt/software/jellyfish/bin/jellyfish 2.2.3
genome file: TAIR10_chr_all.fas
input file: TAIR10_chr_all.fas
5' labeled R primer:
result output folder: /home/chorus/probes
threads number: 12
homology: 75
dtm: 10
########################################
...
...
14300000 / 14326857
14310000 / 14326857
14320000 / 14326857
Job finshed!!
When process done:
$ ls -lt probes/
total 1741428
-rw-r--r-- 1 root root 280927981 Aug 24 17:44 TAIR10_chr_all.fas_all.bed
-rw-r--r-- 1 root root 62050561 Aug 24 17:44 TAIR10_chr_all.fas.bed
-rw-r--r-- 1 root root 94 Aug 24 17:30 TAIR10_chr_all.fas.len
-rw-r--r-- 1 root root 1031512169 Aug 24 17:22 TAIR10_chr_all.fas_tmp_probe.fa
-rw-r--r-- 1 root root 59833928 Aug 24 17:19 TAIR10_chr_all.fas.sa
-rw-r--r-- 1 root root 7535 Aug 24 17:18 TAIR10_chr_all.fas.amb
-rw-r--r-- 1 root root 682 Aug 24 17:18 TAIR10_chr_all.fas.ann
-rw-r--r-- 1 root root 29916939 Aug 24 17:18 TAIR10_chr_all.fas.pac
-rw-r--r-- 1 root root 119667836 Aug 24 17:18 TAIR10_chr_all.fas.bwt
-rw-r--r-- 1 root root 121183059 Aug 24 17:17 TAIR10_chr_all.fas
-rw-r--r-- 1 root root 78102510 Aug 24 17:17 TAIR10_chr_all.fas_17mer.jf
TAIR10_chr_all.fas.bed is the probe file contained non-overlapped probes.
TAIR10_chr_all.fas_all.bed is the probe file contained all probes. This file can be used for ChorusNGSfilter.
TAIR10_chr_all.fas.len is the length info of the given genome chromosomes. This file can be imported into ChorusPBGUI for probe selection.
TAIR10_chr_all.fas_17mer.jf is the binary file created by jellyfish count using 17-mer.
TAIR10_chr_all.fas_tmp_probe.fa contains all candidate probe sequences filtered by jellyfish.
.bwt, .pac, .ann, .amb, .sa files are bwa index files.
$ more probes/TAIR10_chr_all.fas.bed
1 52 96 TCCCTAAATCTTTAAATCCTACATCCATGAATCCCTAAATACCTA
1 211 255 TTTGAGGTCAATACAAATCCTATTTCTTGTGGTTTTCTTTCCTTC
1 346 390 CCTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACT
1 426 470 TTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAAT
1 496 540 TCTTCGTTGTTGTTACGCTTGTCATCTCATCTCTCAATGATATGG
1 551 595 TAGCATTTATTCTGAAGTTCTTCTGCTTGATGATTTTATCCTTAG
There are four columns in each row, first column is chromosome name, second is oligo start site, third is oligo end site, the last one is oligo probe sequence. You can use excel or text editor to open this file.
Run Chorus2 in terminal¶
Make a project folder
$ cd ~
$ mkdir sampleproject
$ cd sampleproject
Download reference genome
$ wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas
Test chorus2 software
$ Chorus2 -h
usage: Chorus2 [-h] [--version] [-j JELLYFISH] [-b BWA] -g GENOME -i INPUT
[-s SAVED] [-p PRIMER] [-t THREADS] [-l LENGTH]
[--homology HOMOLOGY] [-d DTM] [--skipdtm SKIPDTM]
[--step STEP] [--docker DOCKER] [--ploidy PLOIDY]
Chorus2 Software for Oligo FISH probe design
optional arguments:
-h, --help show this help message and exit
--version show program\'s version number and exit
-j JELLYFISH, --jellyfish JELLYFISH
The path where Jellyfish software installed
-b BWA, --bwa BWA The path where BWA software installed
-g GENOME, --genome GENOME
Fasta format genome file, should include all sequences
from genome
-i INPUT, --input INPUT
Fasta format input file, can be whole genome, a
chromosome or one region from genome
-s SAVED, --save SAVED
The output folder for saving results
-p PRIMER, --primer PRIMER
A specific 5\' labeled R primer for PCR reaction. For
example: CGTGGTCGCGTCTCA. (Default is none)
-t THREADS, --threads THREADS
Number of threads or CPUs to use. (Default: 1)
-l LENGTH, --length LENGTH
The probe length. (Default: 45)
--homology HOMOLOGY The maximum homology(%) between target sequence and
probe, range from 50 to 100. (Default: 75)
-d DTM, --dtm DTM The minimum value of dTm (hybrid Tm - hairpin Tm),
range from 0 to 37. (Default: 10)
--skipdtm SKIPDTM skip calculate dtm, for oligo longer than 50.
--step STEP The step length for k-mer searching in a sliding
window, step length>=1. (Default: 5)
--docker DOCKER Only used in Docker version of Chorus
--ploidy PLOIDY The ploidy of the given genome (test version).
(Default: 2)
Example:
Chorus2 -i TAIR10_chr_all.fas -g TAIR10_chr_all.fas -t 4 \
-j /opt/software/jellyfish/bin/jellyfish -b /opt/software/bwa/bwa -s sample
Run chorus2 software
$ Chorus2 -i TAIR10_chr_all.fas -g TAIR10_chr_all.fas -t 12
When job finish, the oligo probes will output to ‘probes’ folder (Default, can be changed using -s)
$ cd sample
$ ls -lt *
total 1741428
-rw-r--r-- 1 root root 280927981 Aug 24 17:44 TAIR10_chr_all.fas_all.bed
-rw-r--r-- 1 root root 62050561 Aug 24 17:44 TAIR10_chr_all.fas.bed
-rw-r--r-- 1 root root 94 Aug 24 17:30 TAIR10_chr_all.fas.len
-rw-r--r-- 1 root root 1031512169 Aug 24 17:22 TAIR10_chr_all.fas_tmp_probe.fa
-rw-r--r-- 1 root root 59833928 Aug 24 17:19 TAIR10_chr_all.fas.sa
-rw-r--r-- 1 root root 7535 Aug 24 17:18 TAIR10_chr_all.fas.amb
-rw-r--r-- 1 root root 682 Aug 24 17:18 TAIR10_chr_all.fas.ann
-rw-r--r-- 1 root root 29916939 Aug 24 17:18 TAIR10_chr_all.fas.pac
-rw-r--r-- 1 root root 119667836 Aug 24 17:18 TAIR10_chr_all.fas.bwt
-rw-r--r-- 1 root root 121183059 Aug 24 17:17 TAIR10_chr_all.fas
-rw-r--r-- 1 root root 78102510 Aug 24 17:17 TAIR10_chr_all.fas_17mer.jf
TAIR10_chr_all.fas.bed is the probe file contained non-overlapped probes.
TAIR10_chr_all.fas_all.bed is the probe file contained all probes. This file can be used for ChorusNGSfilter.
TAIR10_chr_all.fas.len is the length info of the given genome chromosomes. This file can be imported into ChorusPBGUI for probe selection.
TAIR10_chr_all.fas_17mer.jf is the binary file created by jellyfish count using 17-mer.
TAIR10_chr_all.fas_tmp_probe.fa contains all candidate probe sequences filtered by jellyfish.
.bwt, .pac, .ann, .amb, .sa files are bwa index files.
$ more probes/TAIR10_chr_all.fas.bed
1 52 96 TCCCTAAATCTTTAAATCCTACATCCATGAATCCCTAAATACCTA
1 211 255 TTTGAGGTCAATACAAATCCTATTTCTTGTGGTTTTCTTTCCTTC
1 346 390 CCTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACT
1 426 470 TTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAAT
1 496 540 TCTTCGTTGTTGTTACGCTTGTCATCTCATCTCTCAATGATATGG
1 551 595 TAGCATTTATTCTGAAGTTCTTCTGCTTGATGATTTTATCCTTAG
There are four columns in each row, first column is chromosome name, second is oligo start site, third is oligo end site, the last one is oligo probe sequence. You can use excel or text editor to open this file.
Further filter using ChorusNGSfilter
To further filter putative repetitive sequences, a kmer-based method can be performed to detect repeats by running ChorusNGSfilter. Before running ChorusNGSfilter, a set of whole-genome shotgun sequencing data is required. Here we download the shotgun reads of Arabidopsis with the accession number SRR5658649.
$ wget -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR565/009/SRR5658649/SRR5658649_1.fastq.gz
$ wget -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR565/009/SRR5658649/SRR5658649_2.fastq.gz
$ ChorusNGSfilter -i SRR5658649_1.fq.gz,SRR5658649_2.fq.gz -z gz \
-g TAIR10_chr_all.fas -t 12 \
-p probes/TAIR10_chr_all.fas_all.bed -o probes/TAIR10_chr_all_SRR5658649.bed
After running NGS filtering, three files (*.jf, *.bw, *.bed) will output to working directory.
TAIR10_chr_all_SRR5658649.bed.jf is the binary file created by jellyfish count using given k-mer (Default is 17).
TAIR10_chr_all_SRR5658649.bed.bw is a bigwig file contained all score infomation generated from NGS library.
TAIR10_chr_all_SRR5658649.bed is the the probe file contained all probes as well as k-mer score and strand. This file should be further selected by ChorusNGSselect.
$ more probes/TAIR10_chr_all_SRR5658649.bed
1 12 56 AAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCT 455128 +
1 18 62 TAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCT 346 +
1 24 68 CTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAAT 343 +
1 36 80 ATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCATG 319 +
1 42 86 AATCCCTAAATCCCTAAATCTTTAAATCCTACATCCATGAATCCC 315 +
1 48 92 TAAATCCCTAAATCTTTAAATCCTACATCCATGAATCCCTAAATA 294 +
There are six columns in each row, first four columns are the same as TAIR10_chr_all.fas_all.bed. The fifth column is the k-mer score, last column is target strand of probes.
Automatic probe selection using ChorusNGSselect
Probes should be filtered by kmer score, the process can be done by ChorusNGSselect.
$ ChorusNGSselect -i probes/TAIR10_chr_all_SRR5658649.bed \
-o probes/TAIR10_chr_all_SRR5658649_filter.bed
ChorusNGSselect will generate a final filtered probe file, it looks this:
$ more probes/TAIR10_chr_all_SRR5658649_filter.bed
1 36 80 ATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCATG 319 +
1 66 110 CGGGTTTAGGGAATTAGGTATTTAGGGATTCATGGATGTAGGATT 221 -
1 215 259 AGGTCAATACAAATCCTATTTCTTGTGGTTTTCTTTCCTTCACTT 293 +
1 245 289 ATAACAAATGAAGATAAACCATCCATAGCTAAGTGAAGGAAAGAA 291 -
1 347 391 CTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTG 237 +
1 425 469 TTTTCCCACAAAAGATAAATTGATAGAACAAACATTTCCACAAAG 360 -
The final probes can be synthesized directly for oligo-FISH or imported into ChorusPBGUI for further selection.
Run Chorus2 with GUI¶
Make a project folder
$ cd ~
$ mkdir sampleproject
$ cd sampleproject
Download reference genome
$ wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas
Run ChorusGUI
$ ChorusGUI
Set your own parameters and click Run to start the design process.
When job finish, the oligo probes will output to Sample Folder where you set.
Further filter using ChorusNGSfilter
The same process as “Run Chorus2 in terminal”
Automatic probe selection using ChorusNGSselect
The same process as “Run Chorus2 in terminal”
Run ChorusPBGUI
After filtering the probes, users can select suitable number of probes in specific regions for their FISH experiments using ChorusPBGUI easily.
$ ChorusPBGUI
Use ChorusNoRef to design oligo probes without a reference genome¶
In this tutorial, we will design oligo probes for two wild potato species, S. etuberosum and S. jamesii, the two species do not have reference genomes.
Run ChorusNoRef¶
Make a project folder
$ cd ~
$ mkdir sampleproject
$ cd sampleproject
Download Genome file of close related species
$ wget http://solanaceae.plantbiology.msu.edu/data/potato_dm_v404_all_pm_un.fasta.zip
$ unzip potato_dm_v404_all_pm_un.fasta.zip
Download shotgun sequences of all species (at least 5x reads)
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR534/006/SRR5349606/SRR5349606_1.fastq.gz
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR534/006/SRR5349606/SRR5349606_2.fastq.gz
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR534/003/SRR5349573/SRR5349573_1.fastq.gz
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR534/003/SRR5349573/SRR5349573_2.fastq.gz
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR534/004/SRR5349574/SRR5349574_1.fastq.gz
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR534/004/SRR5349574/SRR5349574_2.fastq.gz
SRR5349606 is from S. tuberosum (DM404), SRR5349573 is from S. etuberosum, SRR5349574 is from S. jamesii.
Run Chorus2, ChorusNGSfilter and ChorusNGSselect to design probes in related species
Run Chorus2
$ Chorus2 -i potato_dm_v404_all_pm_un.fasta -g potato_dm_v404_all_pm_un.fasta -t 12
Run ChorusNGSfilter
$ ChorusNGSfilter -g potato_dm_v404_all_pm_un.fasta -i SRR5349606_1.fastq.gz,SRR5349606_2.fastq.gz -t 12 \
-p probes/potato_dm_v404_all_pm_un.fasta_all.bed -o potato_dm_v404_all_pm_un.fasta_kmer.bed
Run ChorusNGSselect
$ ChorusNGSselect -i potato_dm_v404_all_pm_un.fasta_kmer.bed -o potato_dm_v404_all_pm_un.fasta_kmerfiltered.bed
Run ChorusNoRef to design probes in target species
$ ChorusNoRef -g potato_dm_v404_all_pm_un.fasta -p potato_dm_v404_all_pm_un.fasta_kmerfiltered.bed \
-r1 SRR5349573_1.fastq.gz,SRR5349574_1.fastq.gz -r2 SRR5349573_2.fastq.gz,SRR5349574_2.fastq.gz \
-n etuberosum,jamesii -t 12
Check the designed probes
Output files will be saved to “noRefprobes” folder. 5 files generated.
$ ls -lh noRefprobes
-rw-rw-r-- 1 liu liu 5.2M 6月 28 17:14 etuberosum_indel_probe.txt
-rw-rw-r-- 1 liu liu 104M 6月 28 17:14 etuberosum_jamesii_cns_probe.csv
-rw-rw-r-- 1 liu liu 62M 6月 28 17:14 etuberosum_probe.txt
-rw-rw-r-- 1 liu liu 5.2M 6月 28 17:14 jamesii_indel_probe.txt
-rw-rw-r-- 1 liu liu 62M 6月 28 17:14 jamesii_probe.txt
drwxrwxr-x 2 liu liu 4.0K 6月 28 16:25 tmp
etuberosum_probe.txt and jamesii_probe.txt are probes with SNPs or identical compared to DM. etuberosum_indel_probe.txt and jamesii_indel_probe.txt are probes with indels compared to DM. etuberosum_jamesii_cns_probe.csv is consensus probes among three species after quality filter.
For probes with SNPs:
$ head -n 3 etuberosum_probe.txt
chr00 130544 130588 AGATTTTGCCCATTCTCATGACGCTTTTGTGATTTCAAAACTTTG 366 + AGATTTAGCTCATTTTCATGGCGATTTTGTGATTTCAAGACTTTG 4
chr00 129321 129365 AATACTATTAGATGATGACTAAGAGTAATGCTAGTGTATATAAAT 262 - CTTTATATACACTAGCATTACTCTTAGTCATCATCTAATATTATT 3
chr00 174138 174182 TTATAGTTGTCTAGGATGGAAGGGTTCTTGATTCACTGGTGTTGA 341 - TCAACACTAGCGAATCAAGAACCCTTCCATCCTAGACAACTATAA 2
column 1-3 is location of this probe base on reference genome, column 4 is probe from reference genome, column 5 is kmer score, column 6 is strand, column 7 is probe for S. etuberosum, column 8 is how many copy can be found in etuberosum illumina reads.
For probes with indels:
$ head -n 3 etuberosum_probe.txt
chr00 298161 298205 TGATGAAGGTGAAAGTAGCATAGATCATGGGGAGTTGTTTGGATT 456 + etuberosum TGATGAAGGTGAAAGTAGCATAGTGCATAGATCATGGGGAGTTGTTTGGATT
chr00 298247 298291 GAATGATGAGTCAATCTGATAATTCATAGAATCAAATTTGTATGA 281 + etuberosum GAGTGATGAGTCAATCCATAAAGGCACCTGATAATTCATAGAATCAAATTTGTATTA
chr00 298193 298237 TCTTTAATTTACACCATAAAGTTTACTCACAAAATCCAAACAACT 495 - etuberosum AGTTGTTTGGATTTTGTGAAGAGAGCAGTAAACTTTATGGTGTAAATAAAAGA
column 1-3 is location of this probe based on reference genome, column 4 is probe from reference genome, column 5 is kmer score, column 6 is strand, column 7 is sample name, column 8 is probe for S. etuberosum.
For consensus probes:
$ head -n 3 etuberosum_jamesii_cns_probe.csv
chrom,start,end,refseq,etuberosum,jamesii,consensusprobe,consensusscore,consensussite,consensusdiff
chr00,130544,130588,AGATTTTGCCCATTCTCATGACGCTTTTGTGATTTCAAAACTTTG,AGATTTAGCTCATTTTCATGGCGATTTTGTGATTTCAAGACTTTG,AGATTTAACCCATTTTCATGGCGCTTTTGTAATTTCAAGACTTTG,
AGATTTAGCCCATTTTCATGGCGCTTTTGTGATTTCAAGACTTTG,0.9407407407407408,37,8
chr00,129321,129365,AATACTATTAGATGATGACTAAGAGTAATGCTAGTGTATATAAAT,CTTTATATACACTAGCATTACTCTTAGTCATCATCTAATATTATT,CTTTATATACACTAGCATTACTCTTAGTCATCATCTAATATTGCT,
CTTTATATACACTAGCATTACTCTTAGTCATCATCTAATATTAAT,0.7407407407407407,11,35
column 1-3 is location of probe based on reference genome, column 4-6 are probes in DM, etuberosum and jamesii, respectively. consensusprobe means the consensus probe among three species. consensusscore is calculated with fomula:
(probe length * number of species - number of difference) / (probe length * number of species)
Consensussite means all identical nt in cns probe. consensusdiff means how many nt different compare with cns probe.