基于R的单细胞分析岌岌可危: 数据太大时，建议用python | 手动py leiden细胞分群后导回R

biomooc

已于 2024-12-14 14:53:06 修改

阅读量623

点赞数 6

分类专栏：单细胞 Python R 文章标签： r语言 python

于 2024-12-10 09:44:37 首次发布

本文链接：https://blog.csdn.net/wangjunliang/article/details/144364172

版权

R 同时被 3 个专栏收录

79 篇文章

订阅专栏

单细胞

19 篇文章

订阅专栏

Python

4 篇文章

订阅专栏

1. R的上限决定于R本身和R基础c++包

(1) 有些问题已经碰到了R的基础包的上限，比如稀疏矩阵：

R long vectors not supported yet: memory.c:3888 #1095, https://github.com/chanzuckerberg/cellxgene-census/issues/1095

解释是：稀疏矩阵 dgCMatrix 包的上限达到了。在Seurat V5中替换为测试中的Census包。或者数据太大时，建议用python。
the reason you see this error is due to a limitation on the size of sparse matrices forced by dgCMatrix, which is the default sparse matrix class used by Seurat. See (for example) satijalab/seurat#4380 for more details. The dataset you’re trying to query is large enough to hit that limit and would fail a Seurat conversion even outside the Census.

This limitation can be removed by using Seurat v5, since it allows to use a sparse matrix class that is not dgCMatrix. Currently Seurat v5 isn’t supported by the Census or TileDB-SOMA, which is the backend library used by the Census, although it’s on the roadmap. We will publish an update when it will be available. In the meanwhile, those large datasets/slices can be analyzed with Python, which doesn’t have this limitation.

(2) Rinlinedfuns.h:537 不支持长向量

数据整合(IntegrateData)、细胞分群(FindClusters)、subset时遇到的错误：long vectors not supported yet: ../../src/include/Rinlinedfuns.h:537

https://github.com/satijalab/seurat/issues/4380
https://github.com/satijalab/seurat/issues/5278
https://github.com/satijalab/seurat/issues/7340
https://github.com/satijalab/seurat/issues/7987

2. 使用Seurat4的 leiden 算法: 要调用py包

数据量足够大时，要设置 method = “igraph”，这会导致速度极慢，原来2分钟，现在要30分钟。

library("reticulate")
use_python("/home/wangjl/soft/python3/python-3.10.14/bin/python3", required = T)
py_config()

library(Seurat)
sce1_alt4 <- FindClusters(sce1, resolution = 0.2, algorithm =4, method = "igraph")
# 报错：Error in unlist(object) : 
#   long vectors not supported yet: ../../src/include/Rinlinedfuns.h:537
# In addition: There were 50 or more warnings (use warnings() to see the first 50)
warnings()
#Warning messages:
#  1: In asMethod(object) :
#  sparse->dense coercion: allocating vector of size 198.8 GiB
#2: In deparse(expr, width.cutoff, ...) :
#  NAs introduced by coercion to integer range
#3: In paste(deparse(expr, width.cutoff, ...), collapse = collapse) :
#  NAs introduced by coercion to integer range
#4: In cmpfun(f) : NAs introduced by coercion to integer range
#5: In paste0("L", idx) : NAs introduced by coercion to integer range
#6: In paste0("L", idx) : NAs introduced by coercion to integer range
#
# Github issue: Leiden 实现有问题，只能用小数据集，大数据集需要是指定很慢的 method = "igraph"
# https://github.com/satijalab/seurat/issues/7340
#9:07->9:37 res=0.2 和 算法1 没啥变化

DimPlot(sce1, label=T)
DimPlot(sce1_alt4, label=T)

saveRDS(sce1_alt4, paste0(outputRoot, keyword, "_02_1_Leiden.Seurat.Rds"))

3.从R中导出RSS矩阵，py分群后输出文件再导入R

(0) Seurat 怎么包装的py包，重点关注参数设置

https://zhuanlan.zhihu.com/p/521391359
  else if (algorithm == 4) {
	ids <- RunLeiden(
	  object = object,
	  method = method,
	  partition.type = "RBConfigurationVertexPartition",
	  initial.membership = initial.membership,
	  node.sizes = node.sizes,
	  resolution.parameter = r,
	  random.seed = random.seed,
	  n.iter = n.iter
	)
  }
#

(1) 从R导出邻接图到文件

# 计算邻接图（如果尚未计算）
seurat_obj <- FindNeighbors(seurat_obj)

# 提取邻接矩阵
adjacency_matrix <- seurat_obj@graphs$RNA_snn  # 这里 'RNA_snn' 是默认的图名称

# 将邻接矩阵转换为数据框并导出：报错！
	# write.csv(as.data.frame(as.matrix(adjacency_matrix)), paste0(outputRoot, keyword, "_adjacency_matrix.csv") )
	# Error: cannot allocate vector of size 198.8 Gb 服务器500G内存，但是现在剩余不够200G了

# 导出为稀疏矩阵格式
library(Matrix)
writeMM(adjacency_matrix, paste0(outputRoot, keyword, "_adjacency_matrix.mtx") )
# /home/wangjl/data4/others/hanlu/output/BCMA3/GSE2719150_adjacency_matrix.mtx

# 导出细胞名称（列名）
writeLines(colnames(adjacency_matrix), paste0(outputRoot, keyword, "_cells.csv") )
# /home/wangjl/data4/others/hanlu/output/BCMA3/GSE2719150_cells.csv

(2) 在 Python 中使用 leidenalg 包进行细胞分群，并导出结果到csv

import pandas as pd
from scipy.io import mmread
import numpy as np
import leidenalg
import igraph as ig

def time_now():
  import time
  return "["+time.strftime("%Y/%m/%d %H:%M:%S", time.localtime())+"]"

# Python 包对细胞分群
# v0.2 add msg
# v0.3 add time to msg
# RNA_snnFle: 输入1 邻接矩阵: RNA_snn 的稀疏矩阵格式 adjacency_matrix.mtx
# cellidFile: 输入2 细胞名字(无标题): cells.csv #其实没啥用，就是为了输入、输出时凑形式列
# outputFile: 输出csv文件，单列带标题(Leiden): cell_clusters.csv
def py_FindClusters(RNA_snnFle, cellidFile, outputFile, resolution=0.8, UNDIRECTED=True):
	#1. 读取稀疏邻接矩阵
	sparse_matrix = mmread(RNA_snnFle).tocsc()

	# 读取细胞名称
	cells = pd.read_csv(cellidFile, header=None).squeeze().values

	# 确保细胞数量与稀疏矩阵的行数匹配
	if len(cells) != sparse_matrix.shape[0]:
		raise ValueError("细胞名称数量与稀疏矩阵的行数不匹配。")
	print(time_now(), ">1.Read File: ok")
	
	#2. 创建图
	# 获取边的索引和权重
	rows, cols = sparse_matrix.nonzero()
	weights = sparse_matrix.data

	if UNDIRECTED:
		# 创建无向图
		g = ig.Graph()
		g.add_vertices(sparse_matrix.shape[0])  # 添加细胞作为顶点
		g.add_edges(zip(rows, cols))  # 添加边
		g.es['weight'] = weights  # 添加边权重
		print(time_now(), ">2.Graph: UNDIRECTED ok")
	else:
		#有向图
		g = ig.Graph.DataFrame(pd.DataFrame({'source': rows, 'target': cols, 'weight': weights}))
		print(time_now(), ">2.Graph: DIRECTED ok")

	#3. 使用 Leiden 算法进行聚类: 大概2min
	partition = leidenalg.find_partition(g, leidenalg.RBConfigurationVertexPartition, resolution_parameter=resolution)
	print(time_now(), ">3.partition: ok")

	#4. 将聚类结果添加到 DataFrame 中
	cell_clusters = pd.DataFrame(index=cells)
	cell_clusters['leiden'] = partition.membership

	# 导出细胞分类结果
	cell_clusters.to_csv(outputFile)
	print(time_now(), ">4.saved to: ", outputFile)
#
pre_fix="/home/wangjl/data4/others/hanlu/output/BCMA3/GSE2719150_"
RNA_snn=pre_fix+"adjacency_matrix.mtx"
cellid=pre_fix+"cells.csv"
outputFile=pre_fix +"cell_clusters.csv"
#
py_FindClusters(RNA_snn, cellid, pre_fix+"res0.6_cell_clusters.csv", resolution=0.8)
# 大概2min

(3) 细胞分群结果导入到R

cell_clusters <- read.csv( paste0(outputRoot, keyword, "_cell_clusters.csv"), row.names = 1)
sce1$leiden <- cell_clusters$leiden  # 假设列名为 'leiden'

# 可视化
table(sce1$leiden)
DimPlot(sce1, label = T, group.by = "leiden") + ggtitle("Leiden, res=0.8")