pandas note

数据分析神器Pandas。

教程参考[1] [2] [3].

基本

Pandas 基于两种数据类型,series 和 dataframe。series 是一种一维的数据类型,其中的每个元素都有各自的标签。dataframe 可以看作是一个 series 的字典。

  • Series有一个与数组关联的数据标签,被叫做 索引 。最简单的Series是由一个数组的数据构成:
1
2
3
4
5
6
7
In [4]: obj = Series([4, 7, -5, 3])
In [5]: obj
Out[5]:
0 4
1 7
2 -5
3 3
  • Dataframe
    axis=0表示行,axis=1表示列
    行、列的index,分别为index, columns,例如:
    df.columns
  • loc是根据标签索引;iloc是index索引:
    df.iloc(1) # 第二行
    df.iloc(:,1) # 第二列
    df.loc([0], ['size') # 第一行、「size」列
  • df.head/tail
    df is the abbre. of dataframe.
    Show first 5 line:df.head(5)

  • Changing column labels so you can getting a column by label .

    1
    2
    3
    df.columns = ['water_year','rain_octsep', 'outflow_octsep',
    'rain_decfeb', 'outflow_decfeb', 'rain_junaug', 'outflow_junaug']
    df.rain_octsep # 相当于df['rain_octsep']
  • len(df)

  • filter
    1
    2
    3
    4
    5
    6
    7
    8
    # Using a series of booleans to filter
    df[df.rain_octsep < 1000]
    # Filtering by multiple conditionals
    df[(df.rain_octsep < 1000) & (df.outflow_octsep < 4000)]
    # 这里不能用 and 关键字,因为会引发操作顺序的问题。必须用 & 和圆括号。
    # Filtering by string methods
    df[df.water_year.str.startswith('199')]
    # 注意,你必须用 .str.[string method] ,而不能直接在字符串上调用字符方法。

添加(add data)

add a row

ref: Merge, join, and concatenate
In short, there are several methods to add data to a Dataframe: Merge, join, concatenate and append.
But only loc can add data inplace!!! All other methods will return a new dataframe and without ‘inplace’ arg

  • append

    • Dataframe: add a row
    1
    2
    3
    4
    5
    6
    7
    # method 1
    df = DataFrame(columns=('lib', 'qty1', 'qty2'))#生成空的pandas表
    for i in range(5):#插入一行
    df.loc[i] = [randint(-1,1) for n in range(3)]
    # method 2 append from a python dict
    # ref : [pandas.DataFrame.append](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html)
    df.append(your_dict, ignore_index=True)
    • python pandas 怎样高效地添加一行数据?
      简而言之:

      Pandas/numpy structures are fundamentally not suited for efficiently growing. They work best when they are created with a fixed size and stay that way.

    • add a column
      df[1]=np.arange(1,11,2) # add a column of which name is 1

add dict to DataFrame

  • add one-D dict to one line of a DataFrame
    1
    df.loc[len(df.index)] = mydict

删除/修改

  • 引用以及修改

    • Dataframe
      df[1] # columns=1的列
      df[1:2] # 第二行
      df.iloc[-8:, 1:] # 取倒数八行、不要第一列
      df.iat[1,1] #选择已知标量(单个元素)
    • Series: 修改
      s3 = pd.Series( [3,5,7], index = list('ABC'),name='s3' ), 那么修改方法为:
      s3['B'] = 6或者s3[1] = 6
      即:通过label和index都可以修改
  • drop

    • Dataframe: pandas的DataFrame对象如何清空?
      df.drop(df.index,inplace=True)
    • Series: drop以及重新添加
      s3.drop('A')
      s3.index.append(pd.Index(['A']) )

statistic/arithmetic operation

  • add/sub/mul/div/floordiv/pow/mod/

    • fill_value
      Note: It substitute a fill_value for missing data in one of the inputs.
    • div
      div, divide is the same as truediv[4]
      df/a is the same as df.truediv(a), then a/df is the same as df.truediv(a).
  • info, describe # describe: basic statistical information of dataset.

    1
    2
    pd.options.display.float_format = '{:,.3f}'.format # Limit output to 3 decimal places.
    df.describe()
  • sum, mean, std, corr # correlation between two columns(pearson correlation)
    注意std默认是除以N-1而不是N的;

  • min, max, median, idxmin, idxmax # Index of the largest value

写入/读取

  • df.to_csv('D:\\a.csv', sep=',', header=True, index=True)
  • df.to_json('D:\\a.json')
  • df.to_html('D:\\a.html')
  • df.to_sql('tableName', con=dbcon, flavor='mysql')
  • df.to_dict

  • 直接读取html中的表格
    pd.read_html("https://coinmarketcap.com/currencies/bitcoin/")
  • df.from_dict

数据清洗、分析

清洗

  • drop(), dropna(), fillna(), replace(), 重复值处理: duplicated(),unique(),drop_duplictad()
  • 转格式
    pd.to_datetime() : Convert argument to datetime, so you can df.sort_values() by time.

分析

  • groupby
    a6214["Value_IN(ETH)"].groupby(a6214["From"]) # 按照转账人分类,得到的结果为SeriesGroupBy类型

显示

打印dataframe设置

  • 完整打印table
    当一行较长时,完整显示一行:'display.max_rows', None, 'display.max_columns', None
    不自动换行:expand_frame_repr', False
    1
    2
    3
    conn = sqlite3.connect('./order_record.db', detect_types=sqlite3.PARSE_DECLTYPES)
    with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'expand_frame_repr', False): # more options can be specified also
    print(pd.read_sql_query("SELECT * FROM ALL_ORDER", conn))

显示的精度

pandas.set_option("display.precision", 10)

reference


  1. 十分钟快速入门 Pandas

  2. 10 Minutes to pandas

  3. indexnext |previous |Python For Data Analysis: Pandas入门

  4. difference between dataframe div and dataframe divide in pandas

谢谢~