首页 文章详情

Pingouin: 基于pandas和numpy的统计包

大邓和他的Python | 1098 2020-07-28 13:26 0 0 0
UniSMS (合一短信)

Python网络爬虫与文本数据分析

pingouin是基于Pandas和numpy开发的Python3统计包。主要统计功能有

  • 方差分析
  • 多元线性回归
  • 中介效应分析
  • 卡方检验
  • Q-Q图
  • 贝叶斯因子
  • 信效度检验
  • 等等

我是统计小白,看不懂啊;还有很多功能没有列上,感兴趣的统计大神可以看看https://pingouin-stats.org/api.html

安装

pip3 install pingouin

快速上手

构造实验数据x,y

import numpy as np

#控制代码每次随机状态保持一致
np.random.seed(666)

n=30
mean= [4,5]
cov = [(10.6), (0.61)]

x, y = np.random.multivariate_normal(mean, cov, n).T

x
array([3.04817645, 2.54387965, 4.56033188, 4.40504338, 3.77876203,
3.87177128, 3.4546112 , 4.47317551, 5.23133856, 5.40273745,
5.19344217, 3.37061786, 3.23980982, 2.85574177, 4.67728276,
4.31935242, 4.39440207, 3.87458876, 4.91426293, 3.13673286,
3.73459839, 4.18708647, 5.48558345, 3.7066784 , 3.73400287,
3.49664637, 3.95954844, 2.61545452, 5.11352964, 5.62666503])
y
array([4.47747109, 4.35695696, 5.46239455, 4.56091782, 4.07534588,
4.03904897, 3.79549165, 5.06121364, 5.71635355, 6.60772697,
6.94890455, 5.13347618, 5.41207983, 3.38254684, 5.49705058,
5.93394729, 4.65224366, 4.59491971, 5.17926604, 4.25844527,
5.72809738, 5.14997732, 5.27606588, 4.94570454, 6.02889647,
5.85451666, 4.90231286, 4.69242625, 4.69367432, 6.71644528])
import matplotlib.pyplot as plt

plt.hist(x, bins=10)


plt.hist(y, bins=10)


1. T检验

import pingouin as pg

pg.ttest(x, y)

Tdoftailp-valCI95%cohen-dBF10power
T-test-4.59762858two-sided0.000024[-1.47, -0.58]1.187102786.3460.994771

2. 皮尔森相关

pg.corr(x, y)

nrCI95%r2adj_r2p-valBF10power
pearson300.60149[0.31, 0.79]0.361790.3145150.00043982.1160.955747

3.鲁棒检验

#添加一个异常值
x[5] = 18
#使用Shepherd's pi correlation
pg.corr(x, y, method="shepherd")

noutliersrCI95%r2adj_r2p-valpower
shepherd3010.569458[0.26, 0.77]0.3242830.2742290.0012630.926066

4. 数据正态性检验

pg.normality(x)

Wpvalnormal
00.9705330.553863True
pg.normality(y)

Wpvalnormal
00.9851610.939893True
pg.multivariate_normality(np.column_stack((x, y)))
(True, 0.6257634649268228)

5. Q-Q plot

import numpy as np
import pingouin as pg

np.random.seed(666)

x = np.random.normal(size=50)
ax = pg.qqplot(x, dist='norm')

6. 单因素方差分析

# 读取数据
df = pg.read_dataset('mixed_anova')
df.sample(10)

ScoresTimeGroupSubject
1426.502562JanuaryMeditation52
555.355380JanuaryControl25
704.714565JuneControl10
1676.586494JuneMeditation47
1697.388138JuneMeditation49
1075.031982AugustMeditation47
1354.837971JanuaryMeditation45
1635.483801JuneMeditation43
375.177205JanuaryControl7
44.779411AugustControl4
# Run the ANOVA
aov = pg.anova(data=df, 
               dv='Scores',  #因变量
               between='Group'
               detailed=True)
aov

SourceSSDFMSFp-uncnp2
0Group5.45996315.4599635.2436560.02320.028616
1Within185.3427291781.041251NaNNaNNaN

7. 重复测量方差分析

pg.rm_anova(data=df, 
            dv='Scores'
            within='Time'
            subject='Subject'
            detailed=True)

SourceSSDFMSFp-uncnp2eps
0Time7.62842823.8142143.9127960.0226290.0621940.998751
1Error115.0270231180.974805NaNNaNNaNNaN

8. 有交互作用的双因素方差分析

# Compute the two-way mixed ANOVA and export to a .csv file
aov = pg.mixed_anova(data=df, 
                     dv='Scores'
                     between='Group'
                     within='Time',
                     subject='Subject'
                     correction=False
                     effsize="np2")
pg.print_table(aov)
=============
ANOVA SUMMARY
=============

Source SS DF1 DF2 MS F p-unc np2 eps
----------- ----- ----- ----- ----- ----- ------- ----- -------
Group 5.460 1 58 5.460 5.052 0.028 0.080 nan
Time 7.628 2 116 3.814 4.027 0.020 0.065 0.999
Interaction 5.167 2 116 2.584 2.728 0.070 0.045 nan

9. 多元线性回归

pg.linear_regression(data[['X''Z']], data['Y'])

namescoefseTpvalr2adj_r2CI[2.5%]CI[97.5%]
0Intercept2.9169011.4447152.0190150.0535160.268550.214368-0.0474095.881210
1X0.6105800.2022613.0187750.0054870.268550.2143680.1955751.025584
2Z-0.0072270.192089-0.0376240.9702640.268550.214368-0.4013610.386907

10. 中介效应分析

pg.mediation_analysis(data=data, x='X', m='Z', y='Y', seed=42, n_boot=1000)

pathcoefsepvalCI[2.5%]CI[97.5%]sig
0Z ~ X-0.2870320.1914540.145006-0.6792070.105142No
1Y ~ Z-0.1652990.2098880.437572-0.5952350.264637No
2Total0.6126540.1910990.0033540.2212051.004103Yes
3Direct0.6105800.2022610.0054870.1955751.025584Yes
4Indirect0.0020740.0422620.976000-0.0886190.092009No

Pingouin与Pandas

pandas.DataFrame可直接使用Pingouin的很多统计方法,例如

import pingouin as pg

# Example 1 | ANOVA
df = pg.read_dataset('mixed_anova')
df.anova(dv='Scores', between='Group', detailed=True)

# Example 2 | Pairwise correlations
data = pg.read_dataset('mediation')
data.pairwise_corr(columns=['X', 'M', 'Y'], covar=['Mbin'])

# Example 3 | Partial correlation matrix
data.pcorr()

pandas.DataFrame支持的pingouin统计方法有:

  • pingouin.anova()
  • pingouin.ancova()
  • pingouin.rm_anova()
  • pingouin.mixed_anova()
  • pingouin.welch_anova()
  • pingouin.pairwise_ttests()
  • pingouin.pairwise_tukey()
  • pingouin.pairwise_corr()
  • pingouin.partial_corr()
  • pingouin.pcorr()
  • pingouin.rcorr()
  • pingouin.mediation_analysis()


R语言相关

R语言 | 读写txt、csv、excel文件 
R语言 | 数据操作tidyr包
R语言 | 数据操作dplyr包  
R语言 | jiebaR中文分词包

Python相关

[更新] Python网络爬虫与文本数据分析 
读完本文你就了解什么是文本分析
文本分析在经管领域中的应用概述  
综述:文本分析在市场营销研究中的应用
plotnine: Python版的ggplot2作图库
小案例: Pandas的apply方法  
stylecloud:简洁易用的词云库 
用Python绘制近20年地方财政收入变迁史视频  
Wow~70G上市公司定期报告数据集
漂亮~pandas可以无缝衔接Bokeh  
YelpDaset: 酒店管理类数据集10+G  

后台回复关键词【pingouin】获取本文代码和数据
    分享”和“在看”是更好的支持!


good-icon 0
favorite-icon 0
收藏
回复数量: 0
    暂无评论~~
    Ctrl+Enter