1. R的上限决定于R本身和R基础c++包
(1) 有些问题已经碰到了R的基础包的上限,比如稀疏矩阵:
- R long vectors not supported yet: memory.c:3888 #1095, https://github.com/chanzuckerberg/cellxgene-census/issues/1095
解释是:稀疏矩阵 dgCMatrix
包的上限达到了。在Seurat V5中替换为 测试中的Census
包。或者数据太大时,建议用python。
the reason you see this error is due to a limitation on the size of sparse matrices forced by dgCMatrix
, which is the default sparse matrix class used by Seurat. See (for example) satijalab/seurat#4380 for more details. The dataset you’re trying to query is large enough to hit that limit and would fail a Seurat conversion even outside the Census.
This limitation can be removed by using Seurat v5, since it allows to use a sparse matrix class that is not dgCMatrix. Currently Seurat v5 isn’t supported by the Census or TileDB-SOMA, which is the backend library used by the Census
, although it’s on the roadmap. We will publish an update when it will be available. In the meanwhile, those large datasets/slices can be analyzed with Python
, which doesn’t have this limitation.
(2) Rinlinedfuns.h:537 不支持长向量
数据整合(IntegrateData)、细胞分群(FindClusters)、subset时遇到的错误:long vectors not supported yet: ../../src/include/Rinlinedfuns.h:537
- https://github.com/satijalab/seurat/issues/4380
- https://github.com/satijalab/seurat/issues/5278
- https://github.com/satijalab/seurat/issues/7340
- https://github.com/satijalab/seurat/issues/7987
2. 使用Seurat4的 leiden 算法: 要调用py包
数据量足够大时,要设置 method = “igraph”,这会导致速度极慢,原来2分钟,现在要30分钟。
library("reticulate")
use_python("/home/wangjl/soft/python3/python-3.10.14/bin/python3", required = T)
py_config()
library(Seurat)
sce1_alt4 <- FindClusters(sce1, resolution = 0.2, algorithm =4, method = "igraph")
# 报错:Error in unlist(object) :
# long vectors not supported yet: ../../src/include/Rinlinedfuns.h:537
# In addition: There were 50 or more warnings (use warnings() to see the first 50)
warnings()
#Warning messages:
# 1: In asMethod(object) :
# sparse->dense coercion: allocating vector of size 198.8 GiB
#2: In deparse(expr, width.cutoff, ...) :
# NAs introduced by coercion to integer range
#3: In paste(deparse(expr, width.cutoff, ...), collapse = collapse) :
# NAs introduced by coercion to integer range
#4: In cmpfun(f) : NAs introduced by coercion to integer range
#5: In paste0("L", idx) : NAs introduced by coercion to integer range
#6: In paste0("L", idx) : NAs introduced by coercion to integer range
#
# Github issue: Leiden 实现有问题,只能用小数据集,大数据集需要是指定很慢的 method = "igraph"
# https://github.com/satijalab/seurat/issues/7340
#9:07->9:37 res=0.2 和 算法1 没啥变化
DimPlot(sce1, label=T)
DimPlot(sce1_alt4, label=T)
saveRDS(sce1_alt4, paste0(outputRoot, keyword, "_02_1_Leiden.Seurat.Rds"))
3.从R中导出RSS矩阵,py分群后输出文件再导入R
(0) Seurat 怎么包装的py包,重点关注参数设置
https://zhuanlan.zhihu.com/p/521391359
else if (algorithm == 4) {
ids <- RunLeiden(
object = object,
method = method,
partition.type = "RBConfigurationVertexPartition",
initial.membership = initial.membership,
node.sizes = node.sizes,
resolution.parameter = r,
random.seed = random.seed,
n.iter = n.iter
)
}
#
(1) 从R导出邻接图到文件
# 计算邻接图(如果尚未计算)
seurat_obj <- FindNeighbors(seurat_obj)
# 提取邻接矩阵
adjacency_matrix <- seurat_obj@graphs$RNA_snn # 这里 'RNA_snn' 是默认的图名称
# 将邻接矩阵转换为数据框并导出:报错!
# write.csv(as.data.frame(as.matrix(adjacency_matrix)), paste0(outputRoot, keyword, "_adjacency_matrix.csv") )
# Error: cannot allocate vector of size 198.8 Gb 服务器500G内存,但是现在剩余不够200G了
# 导出为稀疏矩阵格式
library(Matrix)
writeMM(adjacency_matrix, paste0(outputRoot, keyword, "_adjacency_matrix.mtx") )
# /home/wangjl/data4/others/hanlu/output/BCMA3/GSE2719150_adjacency_matrix.mtx
# 导出细胞名称(列名)
writeLines(colnames(adjacency_matrix), paste0(outputRoot, keyword, "_cells.csv") )
# /home/wangjl/data4/others/hanlu/output/BCMA3/GSE2719150_cells.csv
(2) 在 Python 中使用 leidenalg 包进行细胞分群,并导出结果到csv
import pandas as pd
from scipy.io import mmread
import numpy as np
import leidenalg
import igraph as ig
def time_now():
import time
return "["+time.strftime("%Y/%m/%d %H:%M:%S", time.localtime())+"]"
# Python 包对细胞分群
# v0.2 add msg
# v0.3 add time to msg
# RNA_snnFle: 输入1 邻接矩阵: RNA_snn 的稀疏矩阵格式 adjacency_matrix.mtx
# cellidFile: 输入2 细胞名字(无标题): cells.csv #其实没啥用,就是为了输入、输出时凑形式列
# outputFile: 输出csv文件,单列带标题(Leiden): cell_clusters.csv
def py_FindClusters(RNA_snnFle, cellidFile, outputFile, resolution=0.8, UNDIRECTED=True):
#1. 读取稀疏邻接矩阵
sparse_matrix = mmread(RNA_snnFle).tocsc()
# 读取细胞名称
cells = pd.read_csv(cellidFile, header=None).squeeze().values
# 确保细胞数量与稀疏矩阵的行数匹配
if len(cells) != sparse_matrix.shape[0]:
raise ValueError("细胞名称数量与稀疏矩阵的行数不匹配。")
print(time_now(), ">1.Read File: ok")
#2. 创建图
# 获取边的索引和权重
rows, cols = sparse_matrix.nonzero()
weights = sparse_matrix.data
if UNDIRECTED:
# 创建无向图
g = ig.Graph()
g.add_vertices(sparse_matrix.shape[0]) # 添加细胞作为顶点
g.add_edges(zip(rows, cols)) # 添加边
g.es['weight'] = weights # 添加边权重
print(time_now(), ">2.Graph: UNDIRECTED ok")
else:
#有向图
g = ig.Graph.DataFrame(pd.DataFrame({'source': rows, 'target': cols, 'weight': weights}))
print(time_now(), ">2.Graph: DIRECTED ok")
#3. 使用 Leiden 算法进行聚类: 大概2min
partition = leidenalg.find_partition(g, leidenalg.RBConfigurationVertexPartition, resolution_parameter=resolution)
print(time_now(), ">3.partition: ok")
#4. 将聚类结果添加到 DataFrame 中
cell_clusters = pd.DataFrame(index=cells)
cell_clusters['leiden'] = partition.membership
# 导出细胞分类结果
cell_clusters.to_csv(outputFile)
print(time_now(), ">4.saved to: ", outputFile)
#
pre_fix="/home/wangjl/data4/others/hanlu/output/BCMA3/GSE2719150_"
RNA_snn=pre_fix+"adjacency_matrix.mtx"
cellid=pre_fix+"cells.csv"
outputFile=pre_fix +"cell_clusters.csv"
#
py_FindClusters(RNA_snn, cellid, pre_fix+"res0.6_cell_clusters.csv", resolution=0.8)
# 大概2min
(3) 细胞分群结果导入到R
cell_clusters <- read.csv( paste0(outputRoot, keyword, "_cell_clusters.csv"), row.names = 1)
sce1$leiden <- cell_clusters$leiden # 假设列名为 'leiden'
# 可视化
table(sce1$leiden)
DimPlot(sce1, label = T, group.by = "leiden") + ggtitle("Leiden, res=0.8")
Ref
- https://github.com/vtraag/leidenalg
- http://blog.dawneve.cc/index.php?k=scSeq&id=2_1#0
- https://zhuanlan.zhihu.com/p/712714461