用Python做candidate genes的随机性检验+计算p值

最近在做的一个小工程,很多人都向我提了这个建议,即当一个基因组中包含的基因过多的时候,如果通过几个生物学重复来从overlapped的基因列表中筛选目标基因,会面临一个问题,就是无法确保这样选出的一个基因到底是真的“目标基因”,还是在那一堆基因里碰巧被选中的?

为了排除这种随机性,需要通过计算机模拟来生成一个基因overlap次数的分布,然后通过p值来判断这个基因是否通过了这个检验。

参考的文献:

Repeated Genetic Targets of Natural Selection Underlying Adaptation of Fishes to Changing Salinity | Integrative and Comparative Biology | Oxford Academic (oup.com)

文献里的图例:

example.png

p值的定义

The P-value is the probability of seeing a sample proportion at least as extreme as the one observed from the data if the null hypothesis is true.

https://courses.lumenlearning.com/wm-concepts-statistics/chapter/hypothesis-test-for-a-population-proportion-2-of-3

P.S. 上面这篇文献的随机检验设计得非常好,完全就是这里p值定义的教科书,对巩固基础概念的理解很有用。

代码思路

输入文件

这里我可以手动准备的输入文件是如下的几个:

整个基因组上所有基因的名称列表

1
2
3
chr1	1	100	gene_ID1
chr1 201 300 gene_ID2
chr1 301 500 gene_ID3

想要进行检验的目标基因列表以及它们在实际replicate中重复出现的次数(csv或用tab分隔的文本文档)

1
2
3
gene_ID1,2
gene_ID2,2
gene_ID3,3

输出文件

我想要得到的输出文件是以每个目标基因为行名,以重复出现次数(overlapping time)为列名的表格,例如:

1
2
3
4
gene,count0,count1,count2,count3
gene_ID1,8300,1674,24,2
gene_ID2,8321,1667,11,1
gene_ID3,8254,1724,22,0

代码的整体布局

读入输入文件→创建空输出表格→模拟随机采样三次(实际生物学重复有几次这里就重现几次)→循环上一步一百万次,每次循环中一个目标基因被随机采样到几个重复,就在输出表格的相应位置累计加1→按多种情况计算p值→写入输出文件

实际代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import random
import pandas
import numpy


# Input data reading
totalgene = pandas.read_table(r'gene_all.bed', usecols=[3], header=None) # Read in the 4th column of ncbi gene list (gene names)
querygene = pandas.read_csv(r'querygene_with_count.csv', usecols=[0], header=None) # Read in the candidate gene names (noted as "query genes" here)
querygenecount = pandas.read_csv(r'querygene_with_count.csv', usecols=[1], header=None) # Read in the numbers that one candidate gene appears in all 3 replicates (noted as "query genes count" here)


# Format converting
totalgene_array = numpy.array(totalgene.stack()) # Convert the gene list into array using pandas
totalgene_list = totalgene_array.tolist() # Convert to list format
querygene_array = numpy.array(querygene.stack())
querygene_list = querygene_array.tolist()
querygenecount_array = numpy.array(querygenecount.stack())
querygenecount_list = querygenecount_array.tolist()


# Creating empty output dataframe
df = pandas.DataFrame()
df.insert(loc=0, column='gene', value=querygene_list)
df.insert(loc=1, column='querygenecount', value=querygenecount_list)
df.insert(loc=2, column='count0', value=0)
df.insert(loc=3, column='count1', value=0)
df.insert(loc=4, column='count2', value=0)
df.insert(loc=5, column='count3', value=0)
df.insert(loc=6, column='pval', value=0)
print (df)


# Simulating random candidate genes selection for 1,000,000 times
i = 0
while i < 1000000:
R1 = random.sample(totalgene_list, 100) # Simulate R1 candidate genes selection
R2 = random.sample(totalgene_list, 200) # Simulate R2 candidate genes selection
R3 = random.sample(totalgene_list, 300) # Simulate R3 candidate genes selection
R1R2 = [a for a in R1 if a in R2] # Overlapping genes between R1 and R2
R1R3 = [b for b in R1 if b in R3] # Overlapping genes between R1 and R3
R2R3 = [c for c in R2 if c in R3] # Overlapping genes between R2 and R3
R1R2R3 = [d for d in R1R2 if d in R2R3] # Overlapping genes among R1, R2 and R3
setR1 = set(R1) # Converting list format into set to do calculations
setR2 = set(R2)
setR3 = set(R3)
set_totalgene = set(totalgene_list)
selected0 = list(set_totalgene - setR1 - setR2 - setR3)
#selected1 = [f for f in R1 or R2 or R3]
selected2 = [e for e in R1R2 or R1R3 or R2R3]


# Calculating number of times that the query genes appear in randomly selected gene lists, and making changes to output file
df.count0 = numpy.where(df.gene.isin(selected0), df.count0+1, df.count0)
#df.count1 = numpy.where(df.gene.isin(selected1), df.count1+1, df.count1)
df.count2 = numpy.where(df.gene.isin(selected2), df.count2+1, df.count2)
df.count3 = numpy.where(df.gene.isin(R1R2R3), df.count3+1, df.count3)

i += 1


# Calculating sppearance time for those genes that were only selected once in each simulation
df.count1 = 1000000 - df.count0 - df.count2 - df.count3


# Calculating p-values for each gene based on their observed numbers of appearance in empirical data
for oricount in df.querygenecount:
if oricount == 2:
df.pval = ((df.count2/1000000) + (df.count3/1000000))
elif oricount == 3:
df.pval = df.count3/1000000
df = df.round({'pval':5}) # Keeping certain number of decimal places


# Output result into either standard output or csv file
#print (df)
df.to_csv('randomisation_test_distribution_1000000_pval.csv')


P.S. 这里算p值的步骤不太完善,两种不同情况无法区分开,所以输出后还要手动修改重新算一次。