用Python做candidate genes的随机性检验+计算p值

发表于 2023-12-13 分类于生信分析

当基因组中包含的基因过多时，对多个生物学重复中筛选出的candidate genes进行随机分布检验，并用相应的p值判断其显著性（算p值功能待完善）。

最近在做的一个小工程，很多人都向我提了这个建议，即当一个基因组中包含的基因过多的时候，如果通过几个生物学重复来从overlapped的基因列表中筛选目标基因，会面临一个问题，就是无法确保这样选出的一个基因到底是真的“目标基因”，还是在那一堆基因里碰巧被选中的？

为了排除这种随机性，需要通过计算机模拟来生成一个基因overlap次数的分布，然后通过p值来判断这个基因是否通过了这个检验。

参考的文献：

Repeated Genetic Targets of Natural Selection Underlying Adaptation of Fishes to Changing Salinity | Integrative and Comparative Biology | Oxford Academic (oup.com)

文献里的图例：

p值的定义

The P-value is the probability of seeing a sample proportion at least as extreme as the one observed from the data if the null hypothesis is true.

https://courses.lumenlearning.com/wm-concepts-statistics/chapter/hypothesis-test-for-a-population-proportion-2-of-3

P.S. 上面这篇文献的随机检验设计得非常好，完全就是这里p值定义的教科书，对巩固基础概念的理解很有用。

代码思路

输入文件

这里我可以手动准备的输入文件是如下的几个：

整个基因组上所有基因的名称列表

1
2
3

chr1	1	100	gene_ID1
chr1	201	300	gene_ID2
chr1	301	500	gene_ID3

想要进行检验的目标基因列表以及它们在实际replicate中重复出现的次数（csv或用tab分隔的文本文档）

1
2
3

gene_ID1,2
gene_ID2,2
gene_ID3,3

输出文件

我想要得到的输出文件是以每个目标基因为行名，以重复出现次数（overlapping time）为列名的表格，例如：

gene,count0,count1,count2,count3
gene_ID1,8300,1674,24,2
gene_ID2,8321,1667,11,1
gene_ID3,8254,1724,22,0

代码的整体布局

读入输入文件→创建空输出表格→模拟随机采样三次（实际生物学重复有几次这里就重现几次）→循环上一步一百万次，每次循环中一个目标基因被随机采样到几个重复，就在输出表格的相应位置累计加1→按多种情况计算p值→写入输出文件

实际代码

import random
import pandas
import numpy


# Input data reading
totalgene = pandas.read_table(r'gene_all.bed', usecols=[3], header=None)  # Read in the 4th column of ncbi gene list (gene names)
querygene = pandas.read_csv(r'querygene_with_count.csv', usecols=[0], header=None)  # Read in the candidate gene names (noted as "query genes" here)
querygenecount = pandas.read_csv(r'querygene_with_count.csv', usecols=[1], header=None)  # Read in the numbers that one candidate gene appears in all 3 replicates (noted as "query genes count" here)


# Format converting
totalgene_array = numpy.array(totalgene.stack())  # Convert the gene list into array using pandas
totalgene_list = totalgene_array.tolist()  # Convert to list format
querygene_array = numpy.array(querygene.stack())
querygene_list = querygene_array.tolist()
querygenecount_array = numpy.array(querygenecount.stack())
querygenecount_list = querygenecount_array.tolist()


# Creating empty output dataframe
df = pandas.DataFrame()
df.insert(loc=0, column='gene', value=querygene_list)
df.insert(loc=1, column='querygenecount', value=querygenecount_list)
df.insert(loc=2, column='count0', value=0)
df.insert(loc=3, column='count1', value=0)
df.insert(loc=4, column='count2', value=0)
df.insert(loc=5, column='count3', value=0)
df.insert(loc=6, column='pval', value=0)
print (df)


# Simulating random candidate genes selection for 1,000,000 times
i = 0
while i < 1000000:
    R1 = random.sample(totalgene_list, 100)  # Simulate R1 candidate genes selection
    R2 = random.sample(totalgene_list, 200)  # Simulate R2 candidate genes selection
    R3 = random.sample(totalgene_list, 300)  # Simulate R3 candidate genes selection
    R1R2 = [a for a in R1 if a in R2]  # Overlapping genes between R1 and R2
    R1R3 = [b for b in R1 if b in R3]  # Overlapping genes between R1 and R3
    R2R3 = [c for c in R2 if c in R3]  # Overlapping genes between R2 and R3
    R1R2R3 = [d for d in R1R2 if d in R2R3]  # Overlapping genes among R1, R2 and R3
    setR1 = set(R1)  # Converting list format into set to do calculations
    setR2 = set(R2)
    setR3 = set(R3)
    set_totalgene = set(totalgene_list)
    selected0 = list(set_totalgene - setR1 - setR2 - setR3)
    #selected1 = [f for f in R1 or R2 or R3]
    selected2 = [e for e in R1R2 or R1R3 or R2R3]


    # Calculating number of times that the query genes appear in randomly selected gene lists, and making changes to output file
    df.count0 = numpy.where(df.gene.isin(selected0), df.count0+1, df.count0)
    #df.count1 = numpy.where(df.gene.isin(selected1), df.count1+1, df.count1)
    df.count2 = numpy.where(df.gene.isin(selected2), df.count2+1, df.count2)
    df.count3 = numpy.where(df.gene.isin(R1R2R3), df.count3+1, df.count3)
    
    i += 1


# Calculating sppearance time for those genes that were only selected once in each simulation
df.count1 = 1000000 - df.count0 - df.count2 - df.count3


# Calculating p-values for each gene based on their observed numbers of appearance in empirical data
for oricount in df.querygenecount:
    if oricount == 2:
        df.pval = ((df.count2/1000000) + (df.count3/1000000))
    elif oricount == 3:
        df.pval = df.count3/1000000
df = df.round({'pval':5})  # Keeping certain number of decimal places


# Output result into either standard output or csv file
#print (df)
df.to_csv('randomisation_test_distribution_1000000_pval.csv')

P.S. 这里算p值的步骤不太完善，两种不同情况无法区分开，所以输出后还要手动修改重新算一次。