数据分析神器Pandas。

教程参考^[1] ^[2] ^[3].

基本

Pandas 基于两种数据类型，series 和 dataframe。series 是一种一维的数据类型，其中的每个元素都有各自的标签。dataframe 可以看作是一个 series 的字典。

Series有一个与数组关联的数据标签，被叫做索引。最简单的Series是由一个数组的数据构成：

In [4]: obj = Series([4, 7, -5, 3])    
In [5]: obj   
Out[5]:   
0 4   
1 7   
2 -5   
3 3

Dataframe
axis=0表示行，axis=1表示列
行、列的index，分别为index, columns，例如：
df.columns
loc是根据标签索引；iloc是index索引:
df.iloc(1) # 第二行
df.iloc(:,1) # 第二列
df.loc([0], ['size') # 第一行、「size」列

df.head/tail
df is the abbre. of dataframe.
Show first 5 line：df.head(5)

Changing column labels so you can getting a column by label .

1
2
3

df.columns = ['water_year','rain_octsep', 'outflow_octsep',
              'rain_decfeb', 'outflow_decfeb', 'rain_junaug', 'outflow_junaug']
df.rain_octsep # 相当于df['rain_octsep']

len(df)

filter

# Using a series of booleans to filter
df[df.rain_octsep < 1000]
# Filtering by multiple conditionals
df[(df.rain_octsep < 1000) & (df.outflow_octsep < 4000)] 
# 这里不能用 and 关键字，因为会引发操作顺序的问题。必须用 & 和圆括号。
# Filtering by string methods
df[df.water_year.str.startswith('199')]
# 注意，你必须用 .str.[string method] ，而不能直接在字符串上调用字符方法。

添加(add data)

add a row

ref: Merge, join, and concatenate
In short, there are several methods to add data to a Dataframe: Merge, join, concatenate and append.
But only loc can add data inplace!!! All other methods will return a new dataframe and without ‘inplace’ arg

append

Dataframe: add a row

# method 1
df = DataFrame(columns=('lib', 'qty1', 'qty2'))#生成空的pandas表  
for i in range(5):#插入一行
    df.loc[i] = [randint(-1,1) for n in range(3)]  
# method 2 append from a python dict
# ref : [pandas.DataFrame.append](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html)
df.append(your_dict, ignore_index=True)

python pandas 怎样高效地添加一行数据？
简而言之：

Pandas/numpy structures are fundamentally not suited for efficiently growing. They work best when they are created with a fixed size and stay that way.
add a column
df[1]=np.arange(1,11,2) # add a column of which name is 1

add dict to DataFrame

add one-D dict to one line of a DataFrame
1
df.loc[len(df.index)] = mydict

删除/修改

引用以及修改
- Dataframe
  df[1] # columns=1的列
  df[1:2] # 第二行
  df.iloc[-8:, 1:] # 取倒数八行、不要第一列
  df.iat[1,1] #选择已知标量（单个元素）
- Series: 修改
  s3 = pd.Series( [3,5,7], index = list('ABC'),name='s3' ), 那么修改方法为：
  s3['B'] = 6或者s3[1] = 6
  即：通过label和index都可以修改
drop
- Dataframe: pandas的DataFrame对象如何清空？
  df.drop(df.index,inplace=True)
- Series: drop以及重新添加
  s3.drop('A')
  s3.index.append(pd.Index(['A']) )

statistic/arithmetic operation

add/sub/mul/div/floordiv/pow/mod/
- fill_value
  Note: It substitute a fill_value for missing data in one of the inputs.
- div
  div, divide is the same as truediv^[4]
  df/a is the same as df.truediv(a), then a/df is the same as df.truediv(a).

info, describe # describe: basic statistical information of dataset.

1 2	pd.options.display.float_format = '{:,.3f}'.format # Limit output to 3 decimal places. df.describe()

sum, mean, std, corr # correlation between two columns(pearson correlation)
注意std默认是除以N-1而不是N的；
min, max, median, idxmin, idxmax # Index of the largest value

写入/读取

写

df.to_csv('D:\\a.csv', sep=',', header=True, index=True)
df.to_json('D:\\a.json')
df.to_html('D:\\a.html')
df.to_sql('tableName', con=dbcon, flavor='mysql')
df.to_dict

读

直接读取html中的表格
pd.read_html("https://coinmarketcap.com/currencies/bitcoin/")
df.from_dict

数据清洗、分析

清洗

drop(), dropna(), fillna(), replace(), 重复值处理: duplicated()，unique()，drop_duplictad()
转格式
pd.to_datetime() : Convert argument to datetime, so you can df.sort_values() by time.

分析

groupby
a6214["Value_IN(ETH)"].groupby(a6214["From"]) # 按照转账人分类，得到的结果为SeriesGroupBy类型

显示

打印dataframe设置

完整打印table
当一行较长时，完整显示一行：'display.max_rows', None, 'display.max_columns', None
不自动换行：expand_frame_repr', False

1
2
3

conn = sqlite3.connect('./order_record.db', detect_types=sqlite3.PARSE_DECLTYPES)
with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'expand_frame_repr', False):  # more options can be specified also
    print(pd.read_sql_query("SELECT * FROM ALL_ORDER", conn))

显示的精度

pandas.set_option("display.precision", 10)

pandas note

基本