Created by bryan on 4/17/15.
A wrapper around the attrTuple (key) and value pair.
A wrapper around the attrTuple (key) and value pair. Includes the attrTuple-type explicitly, rather than embedding the corresponding information in the type of 'value', because otherwise it'd be difficult to extract the correct type for Byte and NumericSequence values.
Roughly analogous to Picards SAMTagAndValue.
The string key associated with this pair.
An enumerated value representing the type of the 'value' parameter.
The 'value' half of the pair.
Coding Sequence annotations, should be a subset of an Exon for a particular Transcript
Coding Sequence annotations, should be a subset of an Exon for a particular Transcript
The standard DNA alphabet with A,T,C, and G
An exon model (here represented as a value of the Exon class) is a representation of a single exon from a transcript in genomic coordinates.
An exon model (here represented as a value of the Exon class) is a representation of a single exon from a transcript in genomic coordinates.
NOTE: we're not handling shared exons here
the (unique) identifier of the transcript to which the exon belongs
The region (in genomic coordinates) to which the exon maps
A trait for values (usually regions or collections of regions) that can be subsetted or extracted out of a larger region string -- for example, exons or transcripts which have a sequence defined in terms of their coordinates against a reference chromosome.
A trait for values (usually regions or collections of regions) that can be subsetted or extracted out of a larger region string -- for example, exons or transcripts which have a sequence defined in terms of their coordinates against a reference chromosome. Passing the sequence of the reference chromosome to a transcript's 'extractSequence' method will return the sequence of the transcript.
A 'gene model' is a small, hierarchical collection of objects: Genes, Transcripts, and Exons.
A 'gene model' is a small, hierarchical collection of objects: Genes, Transcripts, and Exons. Each Gene contains a collection of Transcripts, and each Transcript contains a collection of Exons, and together they describe how the genome is transcribed and translated into a family of related proteins (or other RNA products that aren't translated at all).
This review, Gerstein et al. "What is a gene, post-ENCODE? History and updated definition" Genome Research (2007) http://genome.cshlp.org/content/17/6/669.full
is a reasonably good overview both of what the term 'gene' has meant in the past as well as where it might be headed in the future.
Here, we aren't trying to answer any of these questions about "what is a gene," but rather to provide the routines necessary to _re-assemble_ hierarchical models of genes that have been flattened into features (GFF, GTF, or BED)
A name, presumably unique within a gene dataset, of a Gene
Common names for the gene, possibly shared with other genes (for historical or ad hoc reasons)
The strand of the Gene (this is from data, not derived from the Transcripts' strand(s), and we leave open the possibility that a single Gene will have Transcripts in _both_ directions, e.g. anti-sense transcripts)
The Transcripts that are part of this gene model
An interval is a region on a coordinate space that has a defined width.
An interval is a region on a coordinate space that has a defined width. This can be used to express a region of a genome, a transcript, a gene, etc.
Creates a multi-reference-region collection of NonoverlappingRegions -- see the scaladocs to NonoverlappingRegions.
The evaluation of a regionJoin takes place with respect to a complete partition on the total space of the genome.
The evaluation of a regionJoin takes place with respect to a complete partition on the total space of the genome. NonoverlappingRegions is a class to compute the value of that partition, and to allow us to assign one or more elements of that partition to a new ReferenceRegion (see the 'regionsFor' method).
NonoverlappingRegions takes, as input, and 'input-set' of regions. These are arbitrary ReferenceRegions, which may be overlapping, identical, disjoint, etc. The input-set of regions _must_ all be located on the same reference chromosome (i.e. must all have the same refName); the generalization to reference regions from multiple chromosomes is in MultiContigNonoverlappingRegions, below.
NonoverlappingRegions produces, internally, a 'nonoverlapping-set' of regions. This is basically the set of _distinct unions_ of the input-set regions.
This class is similar to SingleReadBucket, except it breaks the reads down further.
This class is similar to SingleReadBucket, except it breaks the reads down further.
Rather than stopping at primary/secondary/unmapped, this will break it down further into whether they are paired or unpaired, and then whether they are the first or second of the pair.
This is useful as this will usually map a single read in any of the sequences.
Builds a dictionary containing record groups.
Builds a dictionary containing record groups. Record groups must have a unique name across all samples in the dictionary. This dictionary provides numerical IDs for each group; these IDs are only consistent when referencing a single dictionary.
Throws an assertion error if there are multiple record groups with the same name.
Represents a contiguous region of the reference genome.
Represents a contiguous region of the reference genome.
The name of the sequence (chromosome) in the reference genome
The 0-based residue-coordinate for the start of the region
The 0-based residue-coordinate for the first residue after the start which is not in the region -- i.e. [start, end) define a 0-based half-open interval.
Utility class within the SequenceDictionary; represents unique reference name-to-id correspondence
A symbol in an alphabet
A symbol in an alphabet
a character which represents the symbol
acharacter which represents the complement of the symbol
A transcript model (here represented as a value of the Transcript class) is a simple, hierarchical model containing a collection of exon models as well as an associated gene identifier, transcript identifier, and a set of common names (synonyms).
A transcript model (here represented as a value of the Transcript class) is a simple, hierarchical model containing a collection of exon models as well as an associated gene identifier, transcript identifier, and a set of common names (synonyms).
the (unique) identifier of the Transcript
Common names for the transcript
The (unique) identifier of the gene to which the transcript belongs
The set of exons in the transcript model; each of these contain a reference region whose coordinates are in genomic space.
the set of CDS regions (the subset of the exons that are coding) for this transcript
UnTranslated Regions
UnTranslated Regions
SequenceDictionary contains the (bijective) map between Ints (the referenceId) and Strings (the referenceName) from the header of a BAM file, or the combined result of multiple such SequenceDictionaries.
Note: VariantContext inherits its name from the Picard VariantContext, and is not related to the SparkContext object.
Note: VariantContext inherits its name from the Picard VariantContext, and is not related to the SparkContext object. If you're looking for the latter, see org.bdgenomics.adam.rdd.variation.VariationContext
Created by bryan on 4/17/15.
An alphabet of symbols and related operations