Download

Frequently asked questions

Publications

Mailing lists

User's manual

Acknowledgments

Development

Wiki

Manual

NAME
Ray - assemble genomes in parallel using the message-passing interface

SYNOPSIS
mpiexec -n 80 Ray -k 31 -p l1_1.fastq l1_2.fastq -p l2_1.fastq l2_2.fastq -o test

mpiexec -n 80 Ray Ray.conf # with commands in a file

mpiexec -n 80 Ray -k 31 -detect-sequence-files SampleDirectory # auto-detection

mpiexec -n 10 Ray -mini-ranks-per-rank 7 Ray.conf # with mini-ranks

DESCRIPTION:

The Ray genome assembler is built on top of the RayPlatform, a generic plugin-based
distributed and parallel compute engine that uses the message-passing interface
for passing messages.

Ray targets several applications:

- de novo genome assembly (with Ray vanilla)
- de novo meta-genome assembly (with Ray Méta)
- de novo transcriptome assembly (works, but not tested a lot)
- quantification of contig abundances
- quantification of microbiome consortia members (with Ray Communities)
- quantification of transcript expression
- taxonomy profiling of samples (with Ray Communities)
- gene ontology profiling of samples (with Ray Ontologies)

- compare DNA samples using words (Ray -run-surveyor ...; see Ray Surveyor options)

-help
Displays this help page.

-version
Displays Ray version and compilation options.

Run Ray in pure MPI mode

mpiexec -n 80 Ray ...

Run Ray with mini-ranks on 10 machines, 8 cores / machine (MPI and IEEE POSIX threads)

mpiexec -n 10 Ray -mini-ranks-per-rank 7 ...

Run Ray on one core only (still needs MPI)

Ray ...

Using a configuration file

Ray can be launched with
mpiexec -n 16 Ray Ray.conf
The configuration file can include comments (starting with #).

K-mer length

-k kmerLength
Selects the length of k-mers. The default value is 21.
It must be odd because reverse-complement vertices are stored together.
The maximum length is defined at compilation by CONFIG_MAXKMERLENGTH
Larger k-mers utilise more memory.

Inputs

-detect-sequence-files SampleDirectory
Detects files in a directory automatically.
This option can generate these commands automatically for you: LoadPairedEndReads (-p) and LoadSingleEndReads (-s)

-p leftSequenceFile rightSequenceFile [averageOuterDistance standardDeviation]
Provides two files containing paired-end reads.
averageOuterDistance and standardDeviation are automatically computed if not provided.
LoadPairedEndReads is equivalent to -p

-i interleavedSequenceFile [averageOuterDistance standardDeviation]
Provides one file containing interleaved paired-end reads.
averageOuterDistance and standardDeviation are automatically computed if not provided.

-s sequenceFile
Provides a file containing single-end reads.
LoadSingleEndReads is equivalent to -s

Outputs

-o outputDirectory
Specifies the directory for outputted files. Default is RayOutput
Other name: -output

Ray Surveyor options

-run-surveyor
Runs Ray Surveyor to compare samples.
See Documentation/Ray-Surveyor.md
This workflow generates:
RayOutput/Surveyor/SimilarityMatrix.tsv is a similarity Gramian matrix based on shared DNA words
RayOutput/Surveyor/DistanceMatrix.tsv is a distance matrix (kernel-based).
-read-sample-graph SampleName SampleGraphFile
Reads a sample graph (generated with -write-kmers)

Assembly options (defaults work well)

-disable-recycling
Disables read recycling during the assembly
reads will be set free in 3 cases:
1. the distance did not match for a pair
2. the read has not met its mate
3. the library population indicates a wrong placement
see Constrained traversal of repeats with paired sequences.
Sébastien Boisvert, Élénie Godzaridis, François Laviolette & Jacques Corbeil.
First Annual RECOMB Satellite Workshop on Massively Parallel Sequencing, March 26-27 2011, Vancouver, BC, Canada.

-debug-recycling
Debugs the recycling events

-ignore-seeds
Disables assembly by ignoring seeds.

-merge-seeds
Merges seeds initially to reduce running time.

-disable-scaffolder
Disables the scaffolder.

-minimum-seed-length minimumSeedLength
Changes the minimum seed length, default is 100 nucleotides

-minimum-contig-length minimumContigLength
Changes the minimum contig length, default is 100 nucleotides

-color-space
Runs in color-space
Needs csfasta files. Activated automatically if csfasta files are provided.

-use-maximum-seed-coverage maximumSeedCoverageDepth
Ignores any seed with a coverage depth above this threshold.
The default is 4294967295.

-use-minimum-seed-coverage minimumSeedCoverageDepth
Sets the minimum seed coverage depth.
Any path with a coverage depth lower than this will be discarded. The default is 0.

Distributed storage engine (all these values are for each MPI rank)

-bloom-filter-bits bits
Sets the number of bits for the Bloom filter
Default is auto bits (adaptive), 0 bits disables the Bloom filter.

-hash-table-buckets buckets
Sets the initial number of buckets. Must be a power of 2 !
Default value: 268435456

-hash-table-buckets-per-group buckets
Sets the number of buckets per group for sparse storage
Default value: 64, Must be between >=1 and <= 64

-hash-table-load-factor-threshold threshold
Sets the load factor threshold for real-time resizing
Default value: 0.75, must be >= 0.5 and < 1

-hash-table-verbosity
Activates verbosity for the distributed storage engine

Biological abundances

-search searchDirectory
Provides a directory containing fasta files to be searched in the de Bruijn graph.
Biological abundances will be written to RayOutput/BiologicalAbundances
See Documentation/BiologicalAbundances.txt

-one-color-per-file
Sets one color per file instead of one per sequence.
By default, each sequence in each file has a different color.
For files with large numbers of sequences, using one single color per file may be more efficient.

Taxonomic profiling with colored de Bruijn graphs

-with-taxonomy Genome-to-Taxon.tsv TreeOfLife-Edges.tsv Taxon-Names.tsv
Provides a taxonomy.
Computes and writes detailed taxonomic profiles.
See Documentation/Taxonomy.txt for details.

-gene-ontology OntologyTerms.txt Annotations.txt
Provides an ontology and annotations.
OntologyTerms.txt is fetched from http://geneontology.org
Annotations.txt is a 2-column file (EMBL_CDS handle & gene ontology identifier)
See Documentation/GeneOntology.txt
Other outputs

-enable-neighbourhoods
Computes contig neighborhoods in the de Bruijn graph
Output file: RayOutput/NeighbourhoodRelations.txt

-amos
Writes the AMOS file called RayOutput/AMOS.afg
An AMOS file contains read positions on contigs.
Can be opened with software with graphical user interface.

-write-kmers
Writes k-mer graph to RayOutput/kmers.txt
The resulting file is not utilised by Ray.
The resulting file is very large.

-graph-only
Exits after building graph.

-write-read-markers
Writes read markers to disk.

-write-seeds
Writes seed DNA sequences to RayOutput/Rank.RaySeeds.fasta

-write-extensions
Writes extension DNA sequences to RayOutput/Rank.RayExtensions.fasta

-write-contig-paths
Writes contig paths with coverage values
to RayOutput/Rank.RayContigPaths.txt

-write-marker-summary
Writes marker statistics.

Memory usage

-show-memory-usage
Shows memory usage. Data is fetched from /proc on GNU/Linux
Needs __linux__

-show-memory-allocations
Shows memory allocation events

Algorithm verbosity

-show-extension-choice
Shows the choice made (with other choices) during the extension.

-show-ending-context
Shows the ending context of each extension.
Shows the children of the vertex where extension was too difficult.

-show-distance-summary
Shows summary of outer distances used for an extension path.

-show-consensus
Shows the consensus when a choice is done.

Checkpointing

-write-checkpoints checkpointDirectory
Write checkpoint files

-read-checkpoints checkpointDirectory
Read checkpoint files

-read-write-checkpoints checkpointDirectory
Read and write checkpoint files

Message routing for large number of cores

-route-messages
Enables the Ray message router. Disabled by default.
Messages will be routed accordingly so that any rank can communicate directly with only a few others.
Without -route-messages, any rank can communicate directly with any other rank.
Files generated: Routing/Connections.txt, Routing/Routes.txt and Routing/RelayEvents.txt
and Routing/Summary.txt

-connection-type type
Sets the connection type for routes.
Accepted values are debruijn, hypercube, polytope, group, random, kautz and complete. Default is debruijn.
torus: a k-ary n-cube, radix: k, dimension: n, degree: 2*dimension, vertices: radix^dimension
polytope: a convex regular polytope, alphabet is {0,1,...,B-1} and the vertices is a power of B
hypercube: a hypercube, alphabet is {0,1} and the vertices is a power of 2
debruijn: a full de Bruijn graph a given alphabet and diameter
kautz: a full de Kautz graph, which is a subgraph of a de Bruijn graph
group: silly model where one representative per group can communicate with outsiders
random: Erdős–Rényi model
complete: a full graph with all the possible connections
With the type debruijn, the number of ranks must be a power of something.
Examples: 256 = 16*16, 512=8*8*8, 49=7*7, and so on.
Otherwise, don't use debruijn routing but use another one
With the type kautz, the number of ranks n must be n=(k+1)*k^(d-1) for some k and d

-routing-graph-degree degree
Specifies the outgoing degree for the routing graph.
See Documentation/Routing.txt

Hardware testing

-test-network-only
Tests the network and returns.

-write-network-test-raw-data
Writes one additional file per rank detailing the network test.

-exchanges NumberOfExchanges
Sets the number of exchanges

-disable-network-test
Skips the network test.

Debugging

-verify-message-integrity
Checks message data reliability for any non-empty message.
add '-D CONFIG_SSE_4_2' in the Makefile to use hardware instruction (SSE 4.2)

-write-scheduling-data
Writes RayPlatform scheduling information to RayOutput/Scheduling/

-write-plugin-data
Writes data for plugins registered with the RayPlatform API to RayOutput/Plugins

-run-profiler
Runs the profiler as the code runs. By default, only show granularity warnings.
Running the profiler increases running times.

-with-profiler-details
Shows number of messages sent and received in each methods during in each time slices (epochs). Needs -run-profiler.

-debug
Turns on -run-profiler and -with-profiler-details for debugging

-show-communication-events
Shows all messages sent and received.

-show-read-placement
Shows read placement in the graph during the extension.

-debug-bubbles
Debugs bubble code.
Bubbles can be due to heterozygous sites or sequencing errors or other (unknown) events

-debug-seeds
Debugs seed code.
Seeds are paths in the graph that are likely unique.

-debug-fusions
Debugs fusion code.

-debug-scaffolder
Debug the scaffolder.

FILES

Input files

Note: file format is determined with file extension.

.fasta
.fa
.fasta.gz (needs HAVE_LIBZ=y at compilation)
.fa.gz (needs HAVE_LIBZ=y at compilation)
.fasta.bz2 (needs HAVE_LIBBZ2=y at compilation)
.fa.bz2 (needs HAVE_LIBBZ2=y at compilation)
.fastq
.fq
.fastq.gz (needs HAVE_LIBZ=y at compilation)
.fq.gz (needs HAVE_LIBZ=y at compilation)
.fastq.bz2 (needs HAVE_LIBBZ2=y at compilation)
.fq.bz2 (needs HAVE_LIBBZ2=y at compilation)
export.txt
qseq.txt
.sff (paired reads must be extracted manually)
.csfasta (color-space reads)
.csfa (color-space reads)

Outputted files

Scaffolds

RayOutput/Scaffolds.fasta
The scaffold sequences in FASTA format
RayOutput/ScaffoldComponents.txt
The components of each scaffold
RayOutput/ScaffoldLengths.txt
The length of each scaffold
RayOutput/ScaffoldLinks.txt
Scaffold links

Contigs

RayOutput/Contigs.fasta
Contiguous sequences in FASTA format
RayOutput/ContigLengths.txt
The lengths of contiguous sequences

Summary

RayOutput/OutputNumbers.txt
Overall numbers for the assembly

de Bruijn graph

RayOutput/CoverageDistribution.txt
The distribution of coverage values
RayOutput/CoverageDistributionAnalysis.txt
Analysis of the coverage distribution
RayOutput/degreeDistribution.txt
Distribution of ingoing and outgoing degrees
RayOutput/kmers.txt
k-mer graph, required option: -write-kmers
The resulting file is not utilised by Ray.
The resulting file is very large.

Assembly steps

RayOutput/SeedLengthDistribution.txt
Distribution of seed length
RayOutput/Rank.OptimalReadMarkers.txt
Read markers.
RayOutput/Rank.RaySeeds.fasta
Seed DNA sequences, required option: -write-seeds
RayOutput/Rank.RayExtensions.fasta
Extension DNA sequences, required option: -write-extensions
RayOutput/Rank.RayContigPaths.txt
Contig paths with coverage values, required option: -write-contig-paths

Paired reads

RayOutput/LibraryStatistics.txt
Estimation of outer distances for paired reads
RayOutput/LibraryData.xml
Frequencies for observed outer distances (insert size + read lengths)

Partition

RayOutput/NumberOfSequences.txt
Number of reads in each file
RayOutput/SequencePartition.txt
Sequence partition

Ray software

RayOutput/RayVersion.txt
The version of Ray
RayOutput/RayCommand.txt
The exact same command provided
RayOutput/RaySmartCommand.txt
The smart command generated by Ray

AMOS

RayOutput/AMOS.afg
Assembly representation in AMOS format, required option: -amos

Communication

RayOutput/NetworkTest.txt
Latencies in microseconds
RayOutput/RankNetworkTestData.txt
Network test raw data

DOCUMENTATION

- mpiexec -n 1 Ray -help|less (always up-to-date)
- This help page (always up-to-date)
- The directory Documentation/
- Manual (Portable Document Format): InstructionManual.tex (in Documentation)
- Mailing list archives: http://sourceforge.net/mailarchive/forum.php?forum_name=denovoassembler-users

AUTHOR
Written by Sébastien Boisvert.

REPORTING BUGS
Report bugs to denovoassembler-users@lists.sourceforge.net
Home page:

COPYRIGHT
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, version 3 of the License.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You have received a copy of the GNU General Public License
along with this program (see LICENSE).

Ray 2.3.1

Ray is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 of the License.
This website is also available at sebhtml.github.io/Ray.web.