This vignette demonstrates a basic workflow for accessing and analysing single-cell RNA-seq data from the CELLxGENE repository using {laminr}. CZ CELLxGENE Discover is a standardised collection of scRNA-seq datasets and LaminDB makes it easy to query and access data in this repository. We will go through the steps of finding and downloading a dataset using {laminr}, performing some simple analysis using {Seurat} and saving the results your own LaminDB database.
Before we go begin, please take some time to check out the Getting
Started vignette (vignette("laminr", package = "laminr")
).
In particular, make sure you have run the commands in the “Initial
Setup” section.
Once that is done, we can load the {laminr} library.
The first thing we need to do is connect to the LaminDB database. For this tutorial, we will connect a default instance (where we will store results) and the CELLxGENE instance that we will search for datasets.
We will start by connecting to your default LaminDB instance. You can
set set the default instance using the lamin
CLI on the
command line:
lamin connect <owner>/<name>
Once a default instance has been set, we can connect to it with {laminr}:
db <- connect()
#> [92m→[0m connected lamindb: laminlabs/lamindata
db
#> lamindata
#> Core registries
#> $Run
#> $User
#> $Param
#> $ULabel
#> $Feature
#> $Storage
#> $Artifact
#> $Transform
#> $Collection
#> $FeatureSet
#> $ParamValue
#> $FeatureValue
#> Additional modules
#> bionty
#> wetlab
This gives us an object we can use to interact with the database.
Note that only the default instance can create new records. This tutorial assumes you have access to an instance where you have permission to add data.
Before we start, we will track the code that is run in this notebook.
db$track("I8BlHXFXqZOG0000", path = "example_workflow.Rmd")
#> [92m→[0m created Transform('I8BlHXFX'), started new Run('CKjEjAr6') at 2024-11-20 11:17:56 UTC
Tip: The ID should be obtained by running
db$track(path = "example_workflow.Rmd")
and copying the ID
from the output.
We can connect to other instances by providing a slug to the
connect()
function. Instances connected to in this way can
be used to query data but cannot make any changes. Let’s connect to the
CELLxGENE instance:
In Lamin, artifacts are objects that contain information (single-cell data, images, data frames etc.) as well as associated metadata. You can see what artifacts are available using the database instance object.
cellxgene$Artifact$df(limit = 5)
#> id suffix X_accessor n_objects visibility key uid size hash
#> 1 2846 tiledbsoma 290 1 cell-census/2023-12-15/soma FYMewVq5twKMDXVy0000 635848093433 Mfyw8VuqftX5REITfQH_yg
#> 2 3665 tiledbsoma 330 1 cell-census/2024-07-01/soma FYMewVq5twKMDXVy0001 870700998221 bzrXBPNvitSVKvb3GG38_w
#> 3 1270 .h5ad AnnData NA 1 cell-census/2023-07-25/h5ads/7a0a8891-9a22-4549-a55b-c2aca23c3a2a.h5ad tczTlSHFPOcAcBnfyxKA 1297573950 UlsVvBz9kMzn2r9RdoAAOg
#> 4 2840 .ipynb <NA> NA 0 <NA> JIIPyQX5l9qELPl42d75 36297 gNdUkonYgQJP_Mi3xLzt_g
#> 5 2842 .html <NA> NA 0 <NA> Whyxwf3k2GjJwTPCl1FK 716529 BDGZac3qU3oLVFpO035Qhg
#> description n_observations is_latest X_hash_type type created_at X_key_is_virtual updated_at version
#> 1 Census 2023-12-15 68683222 FALSE md5-d dataset 2024-07-12T12:12:16.091881+00:00 FALSE 2024-09-17T13:00:13.714256+00:00 2023-12-15
#> 2 Census 2024-07-01 115556140 TRUE md5-d dataset 2024-07-16T12:52:01.424629+00:00 FALSE 2024-09-17T13:01:23.739635+00:00 2024-07-01
#> 3 Supercluster: Hippocampal CA1-3 74979 FALSE md5-n <NA> 2023-11-28T21:46:12.685907+00:00 FALSE 2024-01-24T07:10:21.725547+00:00 2023-07-25
#> 4 Source of transform G69jtgzKO0eJ6K79 NA FALSE md5 <NA> 2024-01-29T08:32:13.311741+00:00 TRUE 2024-01-29T08:32:13.311792+00:00 0
#> 5 Report of run UAAiLAi0BrLvlKnsuvP3 NA FALSE md5 <NA> 2024-01-29T08:32:18.346499+00:00 TRUE 2024-01-30T09:12:06.027928+00:00 1
This is useful, but it’s not the nicest or easiest way to find a particular dataset. Instead, we will use the Lamin Hub website to find the data we want to load.
.h5ad
files and
search for “renal cell carcinoma”Once we have the artifact ID, we can load information about the artifact, similar to what we see on the website. Notice that we use a slightly different command to what we copied from the website.
artifact <- cellxgene$Artifact$get("7dVluLROpalzEh8mNyxk")
artifact
#> Artifact(uid='7dVluLROpalzEh8mNyxk', description='Renal cell carcinoma, pre aPD1, kidney Puck_200727_12', key='cell-census/2023-12-15/h5ads/02faf712-92d4-4589-bec7-13105059cf86.h5ad', id=1742, run_id=22, hash='YNYuokfAoDFxdaRILjmU9w', size=13997860, suffix='.h5ad', storage_id=2, version='2023-12-15', _accessor='AnnData', is_latest=TRUE, transform_id=16, _hash_type='md5-n', created_at='2024-01-11T09:13:23.143694+00:00', created_by_id=1, updated_at='2024-01-24T07:17:47.009288+00:00', visibility=1, n_observations=17612, _key_is_virtual=FALSE)
So far we have only retrieved the metadata about this object. To download the data itself we need to run another command.
adata <- artifact$load()
#> ℹ 's3://cellxgene-data-public/cell-census/2023-12-15/h5ads/02faf712-92d4-4589-bec7-13105059cf86.h5ad' already exists at '/home/rcannood/.cache/lamindb/cellxgene-data-public/cell-census/2023-12-15/h5ads/02faf712-92d4-4589-bec7-13105059cf86.h5ad'
adata
#> AnnData object with n_obs × n_vars = 17612 × 23254
#> obs: 'n_genes', 'n_UMIs', 'log10_n_UMIs', 'log10_n_genes', 'Cell_Type', 'cell_type_ontology_term_id', 'organism_ontology_term_id', 'tissue_ontology_term_id', 'assay_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'donor_id', 'is_primary_data', 'suspension_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage'
#> var: 'gene', 'n_beads', 'n_UMIs', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype'
#> uns: 'Cell_Type_colors', 'schema_version', 'title'
#> obsm: 'X_spatial'
This dataset has been stored as an AnnData
object.
In the next sections we will convert it to a Seurat
object and
perform some simple analysis.
There are various approaches for converting between different single-cell objects, some of which are described in the Interoperability chapter of the Single-cell Best Practices book.
Because we already have the data loaded in memory, the simplest
option is to extract the information we need and create a new
Seurat
object.
seurat <- SeuratObject::CreateSeuratObject(
counts = Matrix::t(adata$X),
meta.data = adata$obs,
)
#> Warning: Data is of class dgRMatrix. Coercing to dgCMatrix.
seurat
#> An object of class Seurat
#> 23254 features across 17612 samples within 1 assay
#> Active assay: RNA (23254 features, 0 variable features)
#> 1 layer present: counts
We could perform any normal analysis using {Seurat} but as an example we will calculate marker genes for each of the annotated cell types. To make things a bit quicker we only test the first 1000 genes but if you have a few minutes you can get results for all features.
# Set cell identities to the provided cell type annotation
SeuratObject::Idents(seurat) <- "Cell_Type"
# Normalise the data
seurat <- Seurat::NormalizeData(seurat)
#> Normalizing layer: counts
# Test for marker genes
markers <- Seurat::FindAllMarkers(
seurat,
features = SeuratObject::Features(seurat)[1:1000]
)
#> Calculating cluster Epithelial
#> Calculating cluster Fibroblast
#> For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
#> (default method for FindMarkers) please install the presto package
#> --------------------------------------------
#> install.packages('devtools')
#> devtools::install_github('immunogenomics/presto')
#> --------------------------------------------
#> After installation of presto, Seurat will automatically use the more
#> efficient implementation (no further action necessary).
#> This message will be shown once per session
#> Calculating cluster Myeloid
#> Calculating cluster Tumor
#> Warning: The following tests were not performed:
#> Warning: When testing Epithelial versus all:
#> Cell group 1 has fewer than 3 cells
# The output is a data.frame
head(markers)
#> p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
#> ENSG00000164283 1.030703e-89 2.7485040 0.205 0.048 2.396797e-85 Fibroblast ENSG00000164283
#> ENSG00000116016 3.606838e-38 2.0721038 0.152 0.051 8.387340e-34 Fibroblast ENSG00000116016
#> ENSG00000074800 5.097282e-25 -0.9810317 0.185 0.366 1.185322e-20 Fibroblast ENSG00000074800
#> ENSG00000112715 6.663398e-18 -1.1826785 0.078 0.202 1.549507e-13 Fibroblast ENSG00000112715
#> ENSG00000140416 1.844156e-17 -0.6994000 0.175 0.326 4.288400e-13 Fibroblast ENSG00000140416
#> ENSG00000125810 8.916133e-15 1.8102270 0.057 0.019 2.073358e-10 Fibroblast ENSG00000125810
Now that we have our results, we can save them to the LaminDB instance.
seu_path <- tempfile(fileext = ".rds")
saveRDS(seurat, seu_path)
db$Artifact$from_df(
markers,
description = "Marker genes for renal cell carcinoma dataset"
)$save()
#> [92m→[0m returning existing artifact with same hash: Artifact(uid='uo8EpZG3uVBHDAq60000', is_latest=True, description='Marker genes for renal cell carcinoma dataset', suffix='.parquet', type='dataset', size=11537, hash='N-D0s0VzXjS8IsIpuIF_Jw', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, storage_id=2, transform_id=173, run_id=319, created_by_id=28, created_at=2024-11-20 10:19:51 UTC)
db$Artifact$from_path(
seu_path,
description = "Seurat object for renal cell carcinoma dataset"
)$save()
#> ... uploading file60c2f2dc66740.rds: 0.0%... uploading file60c2f2dc66740.rds: 100.0%
Finally, we can close the connection to the database.
You can render this notebook to HTML:
In RStudio, click the “Knit” button
From the command line, run:
Or use the rmarkdown
package in R:
And then save it to your LaminDB instance using the
lamin
CLI: