dataframes常见方法(一) | 机器学习 |《python学习之路》| python 技术论坛-380玩彩网官网入口

未匹配的标注
  • 一些常见的方法

    • head,默认获取前五行数据,可以传入想获取的行数
      >>> df = pd.dataframe({'animal':['alligator', 'bee', 'falcon', 'lion',
      ...                    'monkey', 'parrot', 'shark', 'whale', 'zebra']})
      >>> df
            animal
      0  alligator
      1        bee
      2     falcon
      3       lion
      4     monkey
      5     parrot
      6      shark
      7      whale
      8      zebra
      >>> df.head()
            animal
      0  alligator
      1        bee
      2     falcon
      3       lion
      4     monkey
      >>> df.head()
            animal
      0  alligator
      1        bee
      2     falcon
      3       lion
      4     monkey
    • tail,默认获取后五行数据,可以传入想获取的行数
      >>> df.tail()
         animal
      4  monkey
      5  parrot
      6   shark
      7   whale
      8   zebra
      >>> df.tail(3)
        animal
      6  shark
      7  whale
      8  zebra
    • shape,查看dataframe的行列个数
      >>> df.shape
      (9, 1)
    • info,查看索引、数据类型和内存信息
      >>> df.info()
      <class 'pandas.core.frame.dataframe'>
      rangeindex: 9 entries, 0 to 8
      data columns (total 1 columns):
      animal    9 non-null object
      dtypes: object(1)
      memory usage: 152.0 bytes
    • mean,所有列的平均数
      >>> df = pd.dataframe(np.random.rand(5,5))
      >>> df
                0         1         2         3         4
      0  0.987926  0.556055  0.774863  0.926501  0.029973
      1  0.635812  0.698311  0.402425  0.727675  0.048129
      2  0.001094  0.329329  0.364231  0.754038  0.405464
      3  0.975270  0.388988  0.598047  0.355597  0.189753
      4  0.171976  0.334893  0.931219  0.967504  0.323952
      >>> df.mean()
      0    0.554416
      1    0.461516
      2    0.614157
      3    0.746263
      4    0.199454
      dtype: float64
    • count,每一列中非空值的个数
      >>> df.count()
      0    5
      1    5
      2    5
      3    5
      4    5
      dtype: int64
    • max,每一列的最大值
      >>> df.max()
      0    0.987926
      1    0.698311
      2    0.931219
      3    0.967504
      4    0.405464
      dtype: float64
    • min,每一列的最小值
      >>> df.min()
      0    0.001094
      1    0.329329
      2    0.364231
      3    0.355597
      4    0.029973
      dtype: float64
    • median,每一列的中位数
      >>> df.median()
      0    0.635812
      1    0.388988
      2    0.598047
      3    0.754038
      4    0.189753
      dtype: float64
    • std,每一列的标准差
      >>> df.std()
      0    0.453900
      1    0.161072
      2    0.241820
      3    0.242105
      4    0.165573
      dtype: float64
    • corr,列与列之间的相关系数
      >>> df.corr()
                0         1         2         3         4
      0  1.000000  0.517374  0.142778 -0.401999 -0.836540
      1  0.517374  1.000000 -0.262423  0.076483 -0.882558
      2  0.142778 -0.262423  1.000000  0.458608 -0.044045
      3 -0.401999  0.076483  0.458608  1.000000  0.032442
      4 -0.836540 -0.882558 -0.044045  0.032442  1.000000
  • 获取数据

    • 切片操作可以运用在一下所有的方法里

    • df[columns],获取列,返回列,数据类型为series

      >>> df = pd.dataframe([[1, 2], [4, 5], [7, 8]],index=['cobra', 'viper', 'sidewinder'],columns=['max_speed', 'shield'])
      >>> df
                  max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
      >>> df['max_speed']
      cobra         1
      viper         4
      sidewinder    7
      name: max_speed, dtype: int64
    • df[columns1,columns2],返回多列,数据类型为dataframe

      >>> df[['max_speed','shield']]
          max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
    • df.loc[0,0],通过对应行列的索引名称来获取,当填入行列时返回单个元素,当只填行或者列的时候返回一个行或者列的series

      >>> df.loc["cobra","max_speed"]
      1
      >>> df.loc["cobra":,"max_speed":]
                  max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
      >>> df.loc["cobra":,]
                  max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
      
    • df.iloc[0,0],通过位置来获取,当填入行列时返回单个元素,当只填行或者列的时候返回一个行或者列的series

      >>> df.iloc[0,0]
      1
      >>> df.iloc[0:,0:]
                  max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
      >>> df.iloc[0:,]
                  max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
      >>> df.loc["cobra":,]
    • df.ix[0,0],结合了loc与iloc,既可以通过位置,又可以通过索引名。当只填行或者列的时候返回一个行或者列的series

              >>> df.ix['viper',1:2]
      shield    5
      name: viper, dtype: int64
      >>> df.ix['viper',0:1]
      max_speed    4
      name: viper, dtype: int64
    • df.values[:,:],通过位置返回所有的数据,当只填行或者列的时候返回一个行或者列的array

      >>> df.values[0:,:]
      array([[1, 2],
             [4, 5],
             [7, 8]], dtype=int64)
      >>> df.values[:1,:]
      array([[1, 2]], dtype=int64)
      >>>
    • df[df[columns]>10],根据条件选出符合条件的列

      >>> df[df['max_speed']>0]
                  max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
    • df.sort_values([columns1,columns2],ascending=[false,true]),按照某列的升降序排列,当填入两个以上的列时,按照先后顺序升降序排列

      >>> df.sort_values('max_speed')
                  max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
      >>> df.sort_values('max_speed',ascending=false)
                  max_speed  shield
      sidewinder          7       8
      viper               4       5
      cobra               1       2
      >>> df.sort_values(['max_speed','shield'],ascending=false)
                  max_speed  shield
      sidewinder          7       8
      viper               4       5
      cobra               1       2
    • df.groupby([columns1,columns2]),按一列或者多列进行分组,返回分组对象

      >>> df.groupby('max_speed')
      <pandas.core.groupby.generic.dataframegroupby object at 0x000001df64584588>
      >>> df.groupby('max_speed').mean()
                 shield
      max_speed
      1               2
      4               5
      7               8
  • 数据清洗

    • df.columns = [‘a’,’b’,’c’,’d’],重新给列命名
      >>> df.columns=['a','b']
      >>> df
                  a  b
      cobra       1  2
      viper       4  5
      sidewinder  7  8
    • df.rename(data,axis),改变行索引或者列索引,axis里选择行列
      >>> df.rename(index=str,columns={'a':'a','b':'b'})
                  a  b
      cobra       1  2
      viper       4  5
      sidewinder  7  8
      >>> df.rename(str.lower,axis='columns')
                  a  b
      cobra       1  2
      viper       4  5
      sidewinder  7  8
      >>> df.rename({'cobra':'a','viper':'b','sidewinder':'c'},axis='index')
         a  b
      a  1  2
      b  4  5
      c  7  8
    • df.set_index(‘column_one’):设置索引列
      >>> df = pd.dataframe({'month': [1, 4, 7, 10],
      ...                    'year': [2012, 2014, 2013, 2014],
      ...                    'sale': [55, 40, 84, 31]})
      >>> df
         month  year  sale
      0      1  2012    55
      1      4  2014    40
      2      7  2013    84
      3     10  2014    31
      >>> df.set_index('month')
             year  sale
      month
      1      2012    55
      4      2014    40
      7      2013    84
      10     2014    31
    • df.reset_index,重新设置行索引
      >>> df = pd.dataframe([('bird', 389.0),
      ...                    ('bird', 24.0),
      ...                    ('mammal', 80.5),
      ...                    ('mammal', np.nan)],
      ...                   index=['falcon', 'parrot', 'lion', 'monkey'],
      ...                   columns=('class', 'max_speed'))
      >>> df
               class  max_speed
      falcon    bird      389.0
      parrot    bird       24.0
      lion    mammal       80.5
      monkey  mammal        nan
      >>> df.reset_index()
          index   class  max_speed
      0  falcon    bird      389.0
      1  parrot    bird       24.0
      2    lion  mammal       80.5
      3  monkey  mammal        nan
    • df.isnull,判断dataframe中有没有空值,有空值返回true,没有返回false
      >>> df = pd.dataframe({'age': [5, 6, np.nan],
      ...                    'born': [pd.nat, pd.timestamp('1939-05-27'),
      ...                             pd.timestamp('1940-04-25')],
      ...                    'name': ['alfred', 'batman', ''],
      ...                    'toy': [none, 'batmobile', 'joker']})
      >>> df
         age       born    name        toy
      0  5.0        nat  alfred       none
      1  6.0 1939-05-27  batman  batmobile
      2  nan 1940-04-25              joker
      >>> df.isnull()
           age   born   name    toy
      0  false   true  false   true
      1  false  false  false  false
      2   true  false  false  false
      >>> df.isna()
           age   born   name    toy
      0  false   true  false   true
      1  false  false  false  false
      2   true  false  false  false
    • df.notnull,判断dataframe中有没有非空值,有非空值返回true,没有返回false
      >>> df.notna()
           age   born  name    toy
      0   true  false  true  false
      1   true   true  true   true
      2  false   true  true   true
    • df.dropna(axis),删除所有包含空值的行或者列,axis=0为行,axis=1为列
      >>> df.dropna() 
         age       born    name        toy
      1  6.0 1939-05-27  batman  batmobile
      >>> df 
         age       born    name        toy
      0  5.0        nat  alfred       none
      1  6.0 1939-05-27  batman  batmobile
      2  nan 1940-04-25              joker
    • df.fillna(n),用n来替换dataframe中的所有空值
      >>> df.fillna('hahaha')
            age                 born    name        toy
      0       5               hahaha  alfred     hahaha
      1       6  1939-05-27 00:00:00  batman  batmobile
      2  hahaha  1940-04-25 00:00:00              joker
    • df.replace(‘a’,’b’),用’b’来替换datafrme中所有的’a’
      >>> df = pd.dataframe({'a': [0, 1, 2, 3, 4],
      ...                    'b': [5, 6, 7, 8, 9],
      ...                    'c': ['a', 'b', 'c', 'd', 'e']})
      >>> df.replace(0, 5)
         a  b  c
      0  5  5  a
      1  1  6  b
      2  2  7  c
      3  3  8  d
      4  4  9  e
      >>> df.replace(0,5)
         a  b  c
      0  5  5  a
      1  1  6  b
      2  2  7  c
      3  3  8  d
      4  4  9  e
    • df[[‘a’,’b’]].astype(type),改变datafrme中某几列的数据类型,即改变series的数据类型
      >>> df = pd.dataframe({"a": [1, 2, 3], "b": [4, 5, 6]})
      >>> df.rename(index=str, columns={"a": "a", "b": "c"})
         a  c
      0  1  4
      1  2  5
      2  3  6
      df = pd.dataframe([('bird', 389.0),('bird', 24.0),('mammal', 80.5),('mammal', np.nan)],index=['falcon', 'parrot', 'lion', 'monkey'],columns=('class', 'max_speed'))
  • 数据合并

    • df.append(df2),将df2中的数据根据列追加到df中的末尾,注意如果两个df的列名不相同,会显示所有列,在没有的列添加nan
      >>> df = pd.dataframe([[1, 2], [3, 4]], columns=list('ab'))
      >>> df
         a  b
      0  1  2
      1  3  4
      >>> df2 = pd.dataframe([[5, 6], [7, 8]], columns=list('ab'))
      >>> df2
         a  b
      0  5  6
      1  7  8
      >>> df.append(df2)
         a  b
      0  1  2
      1  3  4
      0  5  6
      1  7  8
      >>> df3 = pd.dataframe([[5, 6], [7, 8]], columns=list('bc'))
      >>> df3
         b  c
      0  5  6
      1  7  8
      >>> df.append(df3,sort=true) # sort设置排序规则
           a  b    c
      0  1.0  2  nan
      1  3.0  4  nan
      0  nan  5  6.0
      1  nan  7  8.0
      >>> df.append(df3,sort=true,ignore_index=true) # ignore_index重新设置索引
           a  b    c
      0  1.0  2  nan
      1  3.0  4  nan
      2  nan  5  6.0
      3  nan  7  8.0
    • pd.concat([df1,df2],axis=1),将df2中的数据根据axis选择行列追加到df1的尾部
      >>> df1 = pd.dataframe([['a', 1], ['b', 2]],columns=['letter', 'number'])
      >>> df1
        letter  number
      0      a       1
      1      b       2
      >>> df2 = pd.dataframe([['c', 3], ['d', 4]],columns=['letter', 'number'])
      >>> df2
        letter  number
      0      c       3
      1      d       4
      >>> pd.concat([df1,df2])
        letter  number
      0      a       1
      1      b       2
      0      c       3
      1      d       4
      >>> pd.concat([df1,df2],axis=1)
        letter  number letter  number
      0      a       1      c       3
      1      b       2      d       4
    • df1.join(df2,on=columns,how=’out’)
      >>> df = pd.dataframe({'key': ['k0', 'k1', 'k2', 'k3', 'k4', 'k5'],'a': ['a0', 'a1', 'a2', 'a3', 'a4', 'a5']})
      >>> df
        key   a
      0  k0  a0
      1  k1  a1
      2  k2  a2
      3  k3  a3
      4  k4  a4
      5  k5  a5
      >>> df2 = pd.dataframe({'key': ['k0', 'k1', 'k2'],'b': ['b0', 'b1', 'b2']})
      >>> df2
        key   b
      0  k0  b0
      1  k1  b1
      2  k2  b2
      >>> df.join(df2,lsuffix='_caller',rsuffix='_other') # lsuffix设置df左侧的重叠列中使用的列名,同理rsuffix为右侧
        key_caller   a key_other    b
      0         k0  a0        k0   b0
      1         k1  a1        k1   b1
      2         k2  a2        k2   b2
      3         k3  a3       nan  nan
      4         k4  a4       nan  nan
      5         k5  a5       nan  nan
    • df1.merge(df2,left_on=’column1’,right_on=’column2’)
      >>> df1 = pd.dataframe({'lkey': ['foo', 'bar', 'baz', 'foo'], 'value': [1, 2, 3, 5]})
      >>> df1
        lkey  value
      0  foo      1
      1  bar      2
      2  baz      3
      3  foo      5
      >>> df2 = pd.dataframe({'rkey': ['foo', 'bar', 'baz', 'foo'], 'value': [5, 6, 7, 8]})
      >>> df2
        rkey  value
      0  foo      5
      1  bar      6
      2  baz      7
      3  foo      8
      >>> df1.merge(df2,left_on='lkey',right_on='rkey')
        lkey  value_x rkey  value_y
      0  foo        1  foo        5
      1  foo        1  foo        8
      2  foo        5  foo        5
      3  foo        5  foo        8
      4  bar        2  bar        6
      5  baz        3  baz        7

本文章首发在 380玩彩网官网入口 网站上。

上一篇 下一篇
讨论数量: 0



暂无话题~
网站地图