GuideScan

A generalized CRISPR guideRNA design tool.

Help


FAQs

  1. Where can I go to learn more about the GuideScan software?
  2. How can I cite GuideScan?
  3. How do I input genomic coordinates into GuideScan's input box?
  4. What gene symbols do you support for the specific genomic assemblies?
  5. What type of files can I upload to the GuideScan site?
  6. What happens if I upload a BED file to GuideScan?
  7. What happens if I upload a GTF file to GuideScan?
  8. What happens if I upload a Fasta file to GuideScan?
  9. What happens if I upload a TXT file to GuideScan?
  10. How can I convert my raw sequence data into Fasta file format for upload to GuideScan?
  11. What does the score reported by GuideScan for Cas9 databases signify?
  12. How are the specificity scores computed?
  13. What are the characteristics of the GuideScan databases?
  14. How does GuideScan search for gRNAs?
  15. How does GuideScan sort results?
  16. How does GuideScan display results?
  17. What's contained in the GuideScan site's Cas9 output table?
  18. What's contained in the GuideScan site's Cpf1 output table?
  19. What's contained in the GuideScan download?
  20. How do I interpet the GuideScan off-target information?
  21. I have a sequence, how do I convert my sequence to coordinates for input into GuideScan?
  22. I have coordinates for an older genome assembly (hg19, mm9), how can I convert these coordiantes to newest assembly?
  23. What is represented with the GuideScan annotations?
  24. What does it mean if GuideScan returned no gRNAs for query region(s)?
  25. I get a newline character error when I enter a list input inside the input box, how do I get around this?
  26. Where can I find the GuideScan source code?
  27. I don't want to install the dependencies required by GuideScan. Where can I find the Docker image?
    • This sounds appealing, but I haven't used Docker images before. Can you show me how to use the Docker image?
  28. How can I get the databases the GuideScan site runs off of?
  29. How can I query the databases locally using the GuideScan software?

Citation

GuideScan software for improved single and paired CRISPR guide RNA design.
Perez AR, Pritykin Y, Vidigal JA, Chhangawala S, Zamparo L, Leslie CS, Ventura A.
Nat Biotechnol. 2017 Mar 6. doi: 10.1038/nbt.3804. PMID: 28263296

Textbox Input

Coordinates must be of form: chr#:StartCoordinate-EndCoordinate

Enter multiple coordinates one per line such as:

chr4:312000-315000
chr4:313000-317000
chr4:315000-319000

Gene symbols appropiate for a specified organism are also acceptable. After coordinates and/or gene symbols are entered, select Parameters and press 'guide me'

hg38 is the genome assembly for Homo sapiens (human)
mm10 is the genome assembly for Mus musculus (mouse)
danRer10 is the genome assembly for Danio rerio (zebrafish)
dm6 is the genome assembly for Drosophila melanogaster (fruit fly)
ce11 is the genome assembly for Caenorhabditis elegans (worm)
SacCerv is the genome assembly (SacCer3) for Saccharomyces cerevisiae (yeast)

Only 1000 lines can be inputted and query regions cannot be larger than 2500 kilobases. These restrictions are not present in the command line version of GuideScan.

Gene Symbols

GuideScan allows the query of a genome by gene symbols. In order to facilitate these type of queries we release the BED files with the gene symbols and coordinates of each gene for the respective genome assemblies. An example of such a query, with a mixed of genomic coordinates and gene symbols, for the hg38 assembly of the human genome would be as follows:


chr4:310000-320000
TP53

BED files:
hg38 mm10 dm6 danRer10 ce11 SacCerv

Upload file

Upload a BED file, GFF/GTF file, or TXT file to be processsed. Standard BED and GTF formats are expected. For a TXT file, the site expects a file composed of a single column of genomic coordinates of the form chr#:StartCoordinate-EndCoordinate. This TXT file should have one genomic coordinate per line. Once the file is uploaded, press 'guide me' to process. The processing will be done according to the parameters detailed in the 'Parameters' section.

Only 1000 lines can be inputted and query regions cannot be larger than 100 kilobases. These restrictions are not present in the command line version of GuideScan

If the file does not conform to BED, GFF/GTF, or the specified TXT format, it will not be processed.

BED file format

BED files have three required fields and nine optional fields. GuideScan needs only the required fields to process the file. The fourth field of a BED file corresponds to the name of the line in the BED file. This fourth field can serve as a unique identifier for the line and GuideScan will assess whether all lines in the fourth field are unique. If the fourth field is composed of unique elements then these elements are used as unique identifiers in the output file. However, if there is even one instance of duplicity, then arbitrary naming will be enacted. If only the three required fields are present, then arbitrary naming will be enacted.

GFF/GTF file format

GFF files have nine required tab-delimited fields. GTF files have the same nine required fields plus two more that associate the line sequence with its genomic source. GFF files are not required to have any unique fields and the additional two fields present in the GTF file are not required to be unique to the line they sit on. Consequently, if GuideScan determines a file is a GFF or GTF file (by verifying that field one has chromosome information, field four has the start coordinate, and field five has the end coordinate) it will enact arbitrary labeling.

Fasta file format

Fasta files begin with a > symbol followed by a description string all sitting on one line. Below the > delimiter raw sequence data is listed. Multiple sequence regions can exist in a Fasta file provided each region is delimted by a > descriptor. If GuideScan determines an uploaded file is a Fasta file it will locally align the sequence to the specificed genome using the UCSC Genome Browser BLAT tool. If, and only if, BLAT returns a perfect alignment(s) to a target genome will the resulting targets be processed further by GuideScan.

TXT file format

TXT files can also be accepted by GuideScan provided they consist of one genomic coordinate per line. The genomic coordinates must take the form chr#:start-end. If more than one column is present in the TXT file it will not be processed, even if the first column corresponds to the prescribed format. Arbitrary labeling of gRNAs will always be enacted if GuideScan determines a TXT file has been uploaded.

Transform Raw Sequence data into a Fasta file

Raw sequence data can be processed by GuideScan provided it is given in the format of a Fasta file. Sequence data can be converted to Fasta format manually or through the following web tool. The resulting Fasta file may then be upload and processed by GuideScan.

Specificity Score (Cas9)

The specificity of each Cas9 gRNA is determined using the aggregation of cutting frequency determination (CFD) scores described in Doench et al. Nature Biotechnology, 2016. Each gRNA in the web interface had its offtargets enumerated out to a distance of three mismatches. For a specific gRNA the CFD score was computed for each offtarget within three mismatches to the target site. The resulting CFD scores were aggregated into a single composite score with an aggregate summation procedure proposed by Hsu et al. CFD scores were originally defined for Cas9 gRNAs with a complementary region of 20nt and hence specificity scores are only avaliable for Cas9 databases with gRNAs of length 20nt.

Score (Cas9)

The scores reported by GuideScan for Cas9 databases are Rule Set 2 scores described in Doench et al. Nature Biotechnology, 2016. Rule Set 2 scores were originally defined for Cas9 gRNAs with a complementary region of 20nt and hence scores are only avaliable for Cas9 databases with gRNAs of length 20nt. While rare, some gRNAs in the Cas9 databases have an * as a value for score. These gRNAs have predicted cutting efficiency scores below zero and likely would be inefficiencty transcribed from U6 promoters due to premature termination.

Parameters

Genome

GuideScan processed genome databases for Cas9 and Cpf1.

  • All gRNAs are unique in the genome (no instances of gRNA occurring > 1 in the genome with NGG (Cas9) or TTTN (Cpf1) PAMs and 0 instances of gRNA occuring in genome with NAG (Cas9) PAM).
  • All gRNAs have no potential offtarget cut sites 1 mismatch away.
  • All gRNAs are distinct from any other potential gRNA by at least 2 mismatches.

How to Search for Genomic Regions

Search for gRNAs within or flanking a queried region.

  • Within reports gRNAs found within the queried coordinates
  • Flanking reports gRNAs found within a specified flanking distance upstream and downstream of queried coordinates. Flanking distance can be set to a maximum of 100kb. If a flanking distance >100kb is inputted, then GuideScan site will default the flanking distance to 100kb. This parameter can be altered to any flanking distance in the GuideScan package.

Sort Results by

Order all gRNAs by either Fewest Offtargets, Coordinates Closest to Query Boundary, Specificity, or Cutting Efficiency Score computed by Rule Set 2 described in Doench et al. Nature Biotechnology, 2016 .

  • Order by Fewest Offtargets reports the gRNAs with the fewest offtargets at 2 and 3 mismatches first and the most offtargets last.
  • Order by Coordinates Closest to Query Boundary with the Within parameter selected reports the gRNA appearing closest to the start coordinate first and the gRNA closest to the end coordinate last. If order by Coordinates Closest to Query Boundary is selected with the Flanking parameter then the gRNAs left flanking the query region are reported with the gRNA appearing closest to the start coordinate first and the gRNAs right flanking the query region are reported with the gRNA appearing closest to the end coordinate first.
  • Order by Cutting Efficiency Score reports the gRNAs with the highest cutting efficiency scores first and the gRNAs with the lowest cutting efficiency score are reported last. This option is only avaliable to Cas9 databases.
  • Order by Specificity reports the gRNAs with the highest specificity scores first and the gRNAs with the lowest specificity scores are reported last. This option is only avaliable to Cas9 databases.

Results Display

Display all found gRNAs or have GuideScan choose n optimal gRNAs based on Sort Results by parameter.

  • Display All Guides displays all gRNAs ordered by the Sort Result by parameter.
  • Display Top will automatically select the top n gRNAs for each query region using a double sort technique according to the following algorithm.
    • If Top is selected with Sort Results by set to Fewest Offtargets then the gRNA list is first sorted by Fewest Offtargets and the top n are taken and then resorted by Cutting Efficiency Score (Cas9). If Cutting Efficiency Scores are not present (Cpf1) then the resort is done according to Coordinates Closest to Query Boundary. This second sorted list is reported.
    • If Top is selected with Sort Results by set to Coordinates Closest to Query Boundary then the gRNA list is first sorted by Coordinates Closest to Query Boundary and the top n are taken and then resorted by Fewest Offtargets. This second sorted list is reported.
    • If Top is selected with Sort Results by set to Cutting Efficiency Score then the gRNA list is first sorted by Cutting Efficiency Score and the top n are taken and then resorted by Fewest Offtargets. This second sorted list is reported. This option is only avaliable to Cas9 databases.
    • If Top is selected with Sort Results by set to Specificity then the gRNA list is first sorted by Specificity score and the top n are taken and then resorted by Cutting Efficiency Score. This second sorted list is reported. This option is only avaliable to Cas9 databases.



gRNA Output Table (Cas9)

Offtarget table produced from textbox query lists first 500 gRNAs. To access all the gRNAs download the accompanying file.

  1. The first column represents the genomic coordinates of the target sites in the queried coordinate region.
  2. The second column displays the gRNA.
  3. The third column represents the amount of offtargets for the gRNA that were enumerated at 2 and 3 mismatches. No gRNA has duplicate sites (with NGG and NAG) anywhere in the genome and has no offtarget 1 mismatch away. In other words all gRNAs are unique with the NGG PAM, do not occur with the NAG PAM and are distinct from all other gRNAs in the genome by at least 2 mismatches.
  4. The fourth column shows how many offtargets for the gRNA are found at 2 and 3 mismatches respectively and is read as (2 mismatches: number of enumerated offtargets at 2 mismatches | 3 mismatches : number of enumerated offtargets at 3 mismatches).
  5. The fifth column displays on-target cutting efficiency score.
  6. The sixth column displays on-target cutting specificity score.
  7. The seventh column reports exonic annotation if the gRNA overlaps an exon. If the gRNA does not overlap an exon then a * is reported.


gRNA Output Table (Cpf1)

Offtarget table produced from textbox query lists first 500 gRNAs. To access all the gRNAs download the accompanying file.

  1. The first column represents the genomic coordinates of the target sites in the queried coordinate region.
  2. The second column displays the gRNA.
  3. The third column represents the amount of offtargets for the gRNA that were enumerated at 2 and 3 mismatches. No gRNA has duplicate sites (with NGG and NAG) anywhere in the genome and has no offtarget 1 mismatch away. In other words all gRNAs are unique with the NGG PAM, do not occur with the NAG PAM and are distinct from all other gRNAs in the genome by at least 2 mismatches.
  4. The fourth column shows how many offtargets for the gRNA are found at 2 and 3 mismatches respectively and is read as (2 mismatches: number of enumerated offtargets at 2 mismatches | 3 mismatches : number of enumerated offtargets at 3 mismatches).
  5. The fifth column reports exonic annotation if the gRNA overlaps an exon. If the gRNA does not overlap an exon then a * is reported.

Download

After pressing 'guide me' either the uploaded file or the coordinates inputed to the textbox will be processed and an excel file of all the output will be generated for download.

Offtargets

The offtargets for a given gRNA display a list of all enumerated offtargets at 2 and 3 mismatches. Up to 1000 offtargets and their coordinates are displayed. If the gRNA offtarget overlaps an exon then the RefSeq exon annotation is reported, otherwise a * is reported.

The offtargets are reported in the format 2:a|3:b this is to be interpreted as at 2 mismatches there are 'a' enumerated potential offtargets and at 3 mismatches there are 'b' enumerated potential offtargets

For a particular offtarget sequence, at most 10 coordinates of such an offtarget sequence are reported.


DNA Sequence to Genomic Coordinates

To convert a DNA sequence to genomic coordinates accepted by GuideScan:

  1. Go to UCSC Genome Browser BLAT tool
  2. Insert or upload genomic sequence into BLAT. Ensure the proper organism and genome assembly are selected. In this example we investigate a human DNA sequence from the hg38 assembly.



  3. Click 'submit'
  4. Select the result with the highest (ideally 100%) IDENTITY percentage. Construct the genomic coordinate by:
    • begin coordinate by writing chr
    • append value in CHRO column to chr
    • append a colon
    • append value in START column
    • append a hyphen
    • append value in END column
    • Overall result for our example would be chr9:133279212-133280861


Converting Genomic Coordinates to Newest Assembly

To convert genomic coordinates from an older genome assembly (ie: hg19) to a newer genome assembly (ie: hg38)

  • Go to UCSC LiftOver tool
  • Ensure your Original Genome and Original Assembly correspond to the older genomic assembly and your New Genome and New Assembly correspond to the newer genome assembly
  • Paste or upload a file with your genomic coordinates
  • Enter Submit or Submit File respectively

Annotations

Annotations represent exonic annotations for a given organism taken from UCSC Table Browser tool. If the coordinates of a gRNA or its offtargets overlap the coordinates of an exon, then the RefSeq name(s) of the overlapping exon are reported in the gRNA Output Table or Offtargets page for the given gRNA respectively. If no exon is overlapped then '*' is returned.

No gRNAs in Queried Region

No gRNAs will be returned for a queried region when:

  • no NGG PAM sequences are found in the queried region thereby yielding no gRNAs
  • all potential gRNAs have perfect occurrences elsewhere with the NGG and/or NAG PAM sequence
  • all potential gRNAs in a queried region have offtargets within 1 mismatches
  • the queried region has uncertain sequence composition (ie: NNNNNNN)

Newline Errors from Input Box

The input box is sensative to newline characters which unfortunately are hard to detect from the user end. If a user encounters newline character errors that are seemingly nonsensical, please convert your query into a .txt file with one coordinate or gene symbol per line and upload the resulting file. The upload function is able to better handle newline characters and equally as fast.

GuideScan code repository

The GuideScan source code can be found in an online repository. New versions of the code will periodically be pushed. The source code is under an MIT software license.

GuideScan Docker Image

A GuideScan docker image is also provided for command line usage at xerez/guidescan from the Docker Hub. New docker instances will be created for new versions of GuideScan. GuideScan functionality can be accessed through the docker by calling the package entry points.

Using the Docker Image

  1. Download the appropiate Docker package and Docker Toolbox for your system.
  2. Pull xerez/guidescan from Docker Hub.


  3. Inspect the Image file.


  4. Run the Image file, this will create an object known as a Container where the software from the Image file is running. Running the command below will drop you inside the Container.


  5. Get the Container ID.


  6. Copy data from your local system into the Container. Use the Container ID for this copy and use the filepath shown in the picture below inside the Container.


  7. Verify the data was copied into the Container.


  8. Call the GuideScan software through its entry points.
    • guidescan_processer -h ----generate CRISPR databases from FASTA file(s)----
    • guidescan_bamdata -h ----modify existing CRISPR GuideScan database----
    • guidescan_guidequery -h ----query a GuideScan database----
    • guidescan_cutting_efficiency_processer -h ----compute and insert Rule Set 2 cutting efficiency scores into a Cas9 20mer GuideScan database----
    • guidescan_cutting_specificity_processer -h ----compute and insert specificity scores into a Cas9 20mer GuideScan database----


  9. The output will be written to the directory where the entry point command was run.


  10. The output can be copied back to the local system from the Container. Note that closing the Container will cause a loss of data. To save the data in the Container refer to next step.


  11. The Container with data can be saved as a new Image file. This will allow a user to refer to GuideScan outputs by simply running the new Image file in a new Container.


  12. This will create a new Image file with a new Image ID.


  13. For completeness we can drop into a new Container using the new Image file and verify that our data is still there.


Download Cas9 databases

Given the size of the posted databases, HTTP download protocal may be insufficient and can lead to corrupted database files. As an alternative:

  1. Right click the database BAM and Index hyperlinks
  2. Select 'Copy Link Address' Option


  3. Open a terminal
  4. Change to directory where you desire to host the database files
  5. Type: wget into the terminal command line and paste the copied link


  6. Ensure the md5sum between the downloaded file and the listed sum on the site are equivilent for a given BAM file database


GuideScan Cas9 site databases

ce11 BAM Index md5 BAM: 3780137f73a1d4aaaf6ac78220d6e099
dm6 BAM Index md5 BAM: f1d820ef3671a43a4e1b6d3dfbea5447
hg38 BAM Index md5 BAM: 82b24c3fa8deeb21c4913f89674c5304
mm10 BAM Index md5 BAM: af3287c3ca8b856d207f7463d4c010dd
SacCerv BAM Index md5 BAM: fc80c926ba2c6bc9edceabbbab7b3552
danRer10 BAM Index md5 BAM: 9bbf0620779ccb0123f3910a9c977efe

GuideScan Cpf1 site databases

ce11 BAM Index md5 BAM: 42337ad8a0659004dce41c7e1829a174
dm6 BAM Index md5 BAM: 47864fac778a540a7d788e0063bfa147
hg38 BAM Index md5 BAM: 8893be82aba772b01638652c8bfeeb51
mm10 BAM Index md5 BAM: b45fc0ecf4cefdb457d2e2cb70c9807e
SacCerv BAM Index md5 BAM: 8ce9400f0ce078db490fc0fe770a0e8a
danRer10 BAM Index md5 BAM: c10946f304bdb47bd5b1c74ac20f743b

Query GuideScan databases locally

To query the databases using the GuideScan software, use the package entry point guidescan_guidequery