This function builds the minimum files required for Shiny Precomputed clusters must be provided. In the anndata object these will be stored using the term "cluster". If hierarchy[-1] is anything other than cluster, then any existing "cluster" column will be overwritten by hierarchy[-1]. Values can be provided without colors and ids (e.g., "cluster") or with them (e.g., "cluster_label" + "cluster_color" + "cluster_id"). In this case cluster_colors is ignored and colors are taken directly from the metadata. Cluster_id's will be overwritten to match dendrogram order. (NOTE: Some functionality with metadata colors is still under development.)

buildTaxonomy.Rd

Usage

buildTaxonomy(
  title = "AIT",
  meta.data,
  hierarchy,
  counts = NULL,
  normalized.expr = NULL,
  highly_variable_genes = NULL,
  marker_genes = NULL,
  ensembl_id = NULL,
  gene.meta.data = NULL,
  cluster_stats = NULL,
  embeddings = NULL,
  number.of.pcs = 30,
  dend = NA,
  taxonomyDir = getwd(),
  cluster_colors = NULL,
  default_embedding = NULL,
  uns.variables = list(),
  subsample = 2000,
  reorder.dendrogram = FALSE,
  add.dendrogram.markers = FALSE,
  addMapMyCells = TRUE,
  save.normalized.data = TRUE,
  check.taxonomy = TRUE,
  print.messages = TRUE,
  ...
)

Arguments

title

The file name to assign for the Taxonomy h5ad (default="AIT"; recommended to create your own title!).

meta.data

Meta.data corresponding to count matrix. Rownames must be equal to colnames of counts. "clusters" must be provided (see hierarchy[-1] and notes).

hierarchy

List of term_set_labels in the Taxonomy ordered from most gross to most fine (e.g., neighborhood, class, subclass, supertype).

counts

A count matrix in sparse format: dgCMatrix. buildTaxonomy can work with count matrices that have cells are rows or columns, so long as counts has both row names AND column names.

highly_variable_genes

Set of features defined as highly variable genes OR a number of binary genes to calculate (we recommend ~1000 - ~5000, for <100 to ~5000 cell types). If a feature list is provided, provide either as a named list of vectors, or as a single vector (in which case the name "highly_variable_genes_standard" will be used). "highly_variable_genes_standard" will also be used for calculated variable genes. Optional input, but for proper mapping we strongly recommend including either highly_variable_genes or marker_genes.

marker_genes

Set of features defined as marker genes. Provide either as a named list of vectors, or as a single vector (in which case the name "marker_genes_mode.name" will be used).

ensembl_id

A vector of ensembl ids corresponding to the gene symbols in counts.

gene.meta.data

Either NULL (default) or a data frame of additional gene information to include in the var component of anndata

cluster_stats

A matrix of median gene expression by cluster. Cluster names must exactly match meta.data$cluster. If provided, will get saved to "varm$cluster_id_median_expr_mode"

embeddings

Dimensionality reduction coordinate data.frame with 2 columns or a string with the column name for marker_genes or variable_genes from which a UMAP should be calculated. If coordinates are provided, rownames must be equal to colnames of counts. Either provide as a named list or as a single data.frame (in which case the name "default_standard" will be used). embeddings are not required, but inclusion of at least one embedding is strongly recommended.#'

number.of.pcs

Number of principle components to use for calculating UMAP coordinates (default=30). This is only used in embeddings corresponds to a variable gene column from which a UMAP should be calculated.

dend

Existing dendrogram associated with this taxonomy (e.g., one calculated elsewhere). If provided, dend must be a BINARY tree. Can also input a string with the column name for marker_genes or variable_genes from which a dendrogram should be calculated. If NULL or if the function can't figure out what gene set you want, no dendrogram will be calculated. Failure to define a dendrogram will prevent some mapping algorithms from working properly! The default is to build a dendrogram using the first highly_variable_genes or marker_genes set.

taxonomyDir

The location to save Shiny objects, e.g. "/allen/programs/celltypes/workgroups/rnaseqanalysis/shiny/10x_seq/NHP_BG_20220104/"

cluster_colors

An optional named character vector where the values correspond to colors and the names correspond to celltypes in hierarchy[-1]. If this vector is incomplete, a warning is thrown and it is ignored. cluster_colors can also be provided in the metadata (see notes)

default_embedding

A string indicating which embedding to use for calculations. Default (NULL) is to take the first one provided in embeddings.

uns.variables

If provided, a list of additional variables to be included in the uns. See Notes for schema variables not otherwise accounted for.

subsample

The number of cells to retain per cluster (default = 2000)

reorder.dendrogram

Should dendogram attempt to match a preset order? (Default = FALSE). If TRUE, the dendrogram attempts to match the celltype factor order as closely as possible (if celltype is a character vector rather than a factor, this will sort clusters alphabetically, which is not ideal).

add.dendrogram.markers

If TRUE (default=FALSE), will also add dendrogram markers to prep the taxonomy for tree mapping

addMapMyCells

If TRUE (default), will also prep this taxonomy for hierarchical mapping

save.normalized.data

If TRUE (default), will save normalized data when writing out h5ad file. Otherwise, will remove normalized data to save space (in which case it will be recalculated automatically upon loadTaxonomy)

check.taxonomy

Should the taxonomy be checked to see if it follows the AIT schema (default=TRUE)

print.messages

If check.taxonomy occurs, should any messages be written to the screen in addition to the log file (default=TRUE)

...

Additional variables to be passed to addDendrogramMarkers

Additional uns.variables:

dataset_purl: Link to molecular data if not present in X or raw.X.
batch_condition: Keys defining batches for normalization/integration algorithms. Used for cellxgene.
reference_genome: Reference genome used to align molecular measurements.

Value

AIT anndata object in the specified format (only if return.anndata=TRUE)