转录组基本分析流程

发表于 2022-01-26 分类于生信分析

转录组差异表达分析的最基本流程和命令行整理。

查询本机线程数与核数

因为刚接触生信的时候年少无知（？）不敢乱用组里的服务器，所以下面所有转录组分析我都是用自己的电脑跑的，如果像我一样在本地运行Linux系统一定要注意不要超过本机最高的线程数！

打开cmd，输入以下命令：

1
2
3

wmic
cpu get numberofcores #获取核数
cpu get numberoflogicalprocessors #获取线程数

rawdata质量检测

（一般需要先组装参考基因组，但由于当时已经有了参考基因组，这部分的转录组分析直接从reads的比对开始）

1	fastqc -o /path/to/output/ /input/data_1.fq /input/data_2.fq

在/path/to/output/文件夹里会生成每个样品的指控报告

将gff注释转成gtf格式

（使用cufflinks的gffread功能）

1	gffread /path/to/ann.gff -T -o /path/to/ann.gtf

构建index文件

（使用hisat2软件，事先建一个具有可读写权限的文件夹/path/to/index并在其中操作）

1
2
3

hisat2_extract_exons.py ann.gtf > exons.txt #提取外显子信息到名叫exons的txt文档
hisat2_extract_splice_sites.py ann.gtf > ss.txt #提取可变剪贴信息到名叫ss的txt文档
hisat2-build -p 2 --ss ss.txt --exon exons.txt refseq.fa data #构建index索引，里面的小文件命名为data

将样品的reads比对到index索引中

（在刚刚建立的index/data文件夹里使用hisat2软件）

1
2
3

seqkit fq2fa data_1.fq -o data_1.fa #用seqkit软件将fq转换为fa
seqkit fq2fa data_2.fq -o data_2.fa
hisat2 -f -x /path/to/index/data -1 /path/to/data_1.fa -2 /path/to/data_2.fa -S /path/to/out.sam

转换sam文件到bam文件

（节省存储空间，使用samtools）

1	samtools view -bS out.sam > out.bam

对bam文件进行排序

（因为是双端测序，所以需要按照基因名称进行排序，使用samtools sort功能）

1	samtools sort out.bam -o out_sorted.bam

对排序后的bam文件进行索引

（使用samtools index功能，生成bai格式文件）

1	samtools index index_out_sorted.bam

查看bam文件比对效率

（Windows下使用IGV软件进行可视化）

对回帖bam文件进行质量评估

（使用samtools flagstat功能）

1 2	samtools flagstat out_sorted.bam > out_sorted.flagstat cat out_sorted.flagstat

对bam文件进行counts计数

（使用htseq软件）

1	htseq-count -s no -r name -f bam -i transcript_id /path/to/out_sorted.bam /path/to/ann.gtf >/path/to/matrix.count 2> /path/to/counts.log

RStudio安装Bioconductor

1
2
3

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.10")

合并表达矩阵

（使用RStudio，以四个样本为例）

> control1<-read.table("1.count",sep = "\t",col.names = c("transcript_id","control1"))
> head(control1)
> control2<-read.table("2.count",sep = "\t",col.names = c("transcript_id","control2"))
> treat1<-read.table("3.count",sep = "\t",col.names = c("transcript_id","treat1"))
> treat2<-read.table("4.count",sep = "\t",col.names = c("transcript_id","treat2"))
> raw_count <- merge(merge(control1, control2, by="transcript_id"), merge(treat1, treat2, by="transcript_id"))
# merge()函数只能从两个数据框中选择，不能超过两个，如果样本重复超过2个，需要进行多次merge操作
> head(raw_count)
> tail(raw_count)
> raw_count_filt <- raw_count[-1:-5,]
> head(raw_count_filt)
> readcount<-raw_count_filt[ ,-1]
> write.csv(readcount, file='readcount.csv')

差异表达分析

> library(tidyverse)
> library(DESeq2)
> mycounts<-read.csv("readcount+.csv")
> head(mycounts)
> rownames(mycounts)<-mycounts[,1]
> mycounts<-mycounts[,-1] #清掉第一列无用的序号
> head(mycounts)
> condition <- factor(c(rep("control",2),rep("treat",2)), levels = c("control","treat"))
> condition
> colData <- data.frame(row.names=colnames(mycounts), condition)
> colData
> dds <- DESeqDataSetFromMatrix(mycounts, colData, design= ~ condition)
> dds <- DESeq(dds)
> dds
> res = results(dds, contrast=c("condition", "control", "treat"))
> res= results(dds)
> res = res[order(res$pvalue),]
> head(res)
> summary(res) #统计差异表达分析结果

out of 5403 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 63, 1.2%
LFC < 0 (down)     : 15, 0.28%
outliers [1]       : 0, 0%
low counts [2]     : 4503, 83%
(mean count < 1502)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> write.csv(res,file="hahaha.csv")
> table(res$padj<0.05)
FALSE  TRUE 
  878    22

绘制PCA图

（在已完成上面差异表达基因分析的基础上进行）

1 2	> vsdata <- vst(dds, blind=FALSE) > plotPCA(vsdata, intgroup="condition")

绘制热图

（在已完成上面差异表达基因分析的基础上进行）

> library("pheatmap")
> select<-order(rowMeans(counts(dds, normalized = TRUE)), decreasing = TRUE)[1:20]
> df <- as.data.frame(colData(dds)[,c("condition","sizeFactor")])
> ntd <- normTransform(dds)
> pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE, cluster_cols=FALSE, annotation_col=df)