数据分析神器Pandas。
基本
Pandas 基于两种数据类型,series 和 dataframe。series 是一种一维的数据类型,其中的每个元素都有各自的标签。dataframe 可以看作是一个 series 的字典。
- Series有一个与数组关联的数据标签,被叫做 索引 。最简单的Series是由一个数组的数据构成:
|
|
- Dataframe
axis=0表示行,axis=1表示列
行、列的index,分别为index, columns,例如:
df.columns
- loc是根据标签索引;iloc是index索引:
df.iloc(1) # 第二行
df.iloc(:,1) # 第二列
df.loc([0], ['size') # 第一行、「size」列
-
df.head/tail
df
is the abbre. of dataframe.
Show first 5 line:df.head(5)
-
Changing column labels so you can getting a column by label .
123df.columns = ['water_year','rain_octsep', 'outflow_octsep','rain_decfeb', 'outflow_decfeb', 'rain_junaug', 'outflow_junaug']df.rain_octsep # 相当于df['rain_octsep'] -
len(df)
- filter 12345678# Using a series of booleans to filterdf[df.rain_octsep < 1000]# Filtering by multiple conditionalsdf[(df.rain_octsep < 1000) & (df.outflow_octsep < 4000)]# 这里不能用 and 关键字,因为会引发操作顺序的问题。必须用 & 和圆括号。# Filtering by string methodsdf[df.water_year.str.startswith('199')]# 注意,你必须用 .str.[string method] ,而不能直接在字符串上调用字符方法。
添加(add data)
add a row
ref: Merge, join, and concatenate
In short, there are several methods to add data to a Dataframe: Merge, join, concatenate and append.
But only loc can add data inplace!!! All other methods will return a new dataframe and without ‘inplace’ arg
-
append
- Dataframe: add a row
1234567# method 1df = DataFrame(columns=('lib', 'qty1', 'qty2'))#生成空的pandas表for i in range(5):#插入一行df.loc[i] = [randint(-1,1) for n in range(3)]# method 2 append from a python dict# ref : [pandas.DataFrame.append](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html)df.append(your_dict, ignore_index=True)-
python pandas 怎样高效地添加一行数据?
简而言之:Pandas/numpy structures are fundamentally not suited for efficiently growing. They work best when they are created with a fixed size and stay that way.
-
add a column
df[1]=np.arange(1,11,2) # add a column of which name is 1
add dict to DataFrame
- add one-D dict to one line of a DataFrame 1df.loc[len(df.index)] = mydict
删除/修改
-
引用以及修改
- Dataframe
df[1] # columns=1的列
df[1:2] # 第二行
df.iloc[-8:, 1:] # 取倒数八行、不要第一列
df.iat[1,1] #选择已知标量(单个元素)
- Series: 修改
s3 = pd.Series( [3,5,7], index = list('ABC'),name='s3' )
, 那么修改方法为:
s3['B'] = 6
或者s3[1] = 6
即:通过label和index都可以修改
- Dataframe
-
drop
- Dataframe: pandas的DataFrame对象如何清空?
df.drop(df.index,inplace=True)
- Series: drop以及重新添加
s3.drop('A')
s3.index.append(pd.Index(['A']) )
- Dataframe: pandas的DataFrame对象如何清空?
statistic/arithmetic operation
-
add/sub/mul/div/floordiv/pow/mod/
fill_value
Note: It substitute afill_value
for missing data in one of the inputs.- div
div, divide is the same as truediv[4]
df/a
is the same asdf.truediv(a)
, thena/df
is the same asdf.truediv(a)
.
-
info, describe # describe: basic statistical information of dataset.
12pd.options.display.float_format = '{:,.3f}'.format # Limit output to 3 decimal places.df.describe() -
sum, mean, std, corr # correlation between two columns(pearson correlation)
注意std默认是除以N-1而不是N的; -
min, max, median, idxmin, idxmax # Index of the largest value
写入/读取
写
df.to_csv('D:\\a.csv', sep=',', header=True, index=True)
df.to_json('D:\\a.json')
df.to_html('D:\\a.html')
df.to_sql('tableName', con=dbcon, flavor='mysql')
df.to_dict
读
- 直接读取html中的表格
pd.read_html("https://coinmarketcap.com/currencies/bitcoin/")
df.from_dict
数据清洗、分析
清洗
drop(), dropna(), fillna(), replace()
, 重复值处理:duplicated(),unique(),drop_duplictad()
- 转格式
pd.to_datetime
() : Convert argument to datetime, so you candf.sort_values()
by time.
分析
- groupby
a6214["Value_IN(ETH)"].groupby(a6214["From"]) # 按照转账人分类,得到的结果为SeriesGroupBy类型
显示
打印dataframe设置
- 完整打印table
当一行较长时,完整显示一行:'display.max_rows', None, 'display.max_columns', None
不自动换行:expand_frame_repr', False
123conn = sqlite3.connect('./order_record.db', detect_types=sqlite3.PARSE_DECLTYPES)with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'expand_frame_repr', False): # more options can be specified alsoprint(pd.read_sql_query("SELECT * FROM ALL_ORDER", conn))
显示的精度
pandas.set_option("display.precision", 10)