[TOC]
Python features a dynamic type system and automatic memory management and supports multiple programming paradigms.
module
asyncio
Note: asynchronous generators and asynchronous comprehensions is added until python3.6
Python中常见的四种函数形式:
- 普通函数
- 生成器函数
在3.5过后,我们可以使用async修饰将普通函数和生成器函数包装成异步函数和异步生成器。 - 异步函数(协程)
- 异步生成器
coroutine
-
直接调用异步函数不会返回结果,而是返回一个coroutine对象:
12print(async_function())# <coroutine object async_function at 0x102ff67d8> -
协程需要通过其他方式来驱动,因此可以使用这个协程对象的send方法给协程发送一个值:
print(async_function().send(None))
不幸的是,如果通过上面的调用会抛出一个异常:
StopIteration: 1
因为生成器/协程在正常返回退出时会抛出一个StopIteration异常,而原来的返回值会存放在StopIteration对象的value属性中,通过以下捕获可以获取协程真正的返回值:12345def run(coroutine):try:coroutine.send(None)except StopIteration as e:return e.value -
await
await: suspends the execution of the coroutine until the awaitable it takes completes and returns the result.
await语法只能出现在通过async修饰的函数中,否则会报SyntaxError错误。
原理:
await后面的对象需要是一个Awaitable,或者实现了相关的协议。
查看Awaitable抽象类的代码,表明了只要一个类实现了__await__
方法,那么通过它构造出来的实例就是一个Awaitable:
Coroutine类继承了Awaitable,而且实现了send,throw和close方法。所以await一个调用异步函数返回的协程对象是合法的。
For more details see ref1.
event
-
run_until_complete
123loop = asyncio.get_event_loop()...loop.run_until_complete() # it's blocking (阻塞) -
Event loop is closed
Jupyter的一个project就是一个进程,而一个进程中默认只有一个event。例如:1234567>>> import asyncio>>> asyncio.get_event_loop().close()>>> asyncio.get_event_loop().is_closed()True>>> asyncio.get_event_loop().run_until_complete(asyncio.sleep(1)).....RuntimeError: Event loop is closed
event loop在asyncio中有复杂的机制,引用Flask作者著名的文章I don’t understand Python’s Asyncio
On the surface it looks like each thread has one event loop but that’s not really how it works.
- if you are the main thread an event loop is created when you call asyncio.get_event_loop()
- if you are any other thread, a runtime error is raised from asyncio.get_event_loop()
- You can at any point asyncio.set_event_loop() to bind an event loop with the current thread. Such an event loop can be created with the asyncio.new_event_loop() function.
- Event loops can be used without being bound to the current thread.
- asyncio.get_event_loop() returns the thread bound event loop, it does not return the currently running event loop.
更详细的讨论请参见原文,总之这里有许多问题。
如果想避免之前的示例问题,参考在一段Python程序中使用多次事件循环一文,可以如下操作:
我们可以使用asyncio.new_event_loop函数建立一个新的事件循环,并使用asyncio.set_event_loop设置全局的事件循环,这时候就可以多次运行异步的事件循环了,不过最好保存默认的asyncio.get_event_loop并在事件循环结束的时候还原回去。
|
|
gather
ref: Getting values from functions that run as asyncio tasks
|
|
另外,run_until_complete也会返回函数结果:
|
|
Tips
asyncio.run()
is added in python3.7asyncio.sleep(1)
异步 I/O 里面的 sleep() 方法, 它也是一个协程, 异步 I/O 里面不能使用 time.sleep(), time.sleep() 会阻塞整个线程
RuntimeError: This event loop is already running
problem
This problem happens because:The kernel itself runs on an event loop, and as of Tornado 5.0, it’s using the asyncio event loop. So the asyncio event loop is always running in the kernel.
Can’t invoke asyncio event_loop after tornado 5.0 update
Solvation:
pip3 install tornado==4.5.3
and restart notebook
aiohttp
ref: 异步爬虫: async/await 与 aiohttp的使用,以及例子
-
get
12async with aiohttp.get('https://github.com') as r:await r.text() -
timeout
123with aiohttp.Timeout(0.001):async with aiohttp.get('https://github.com') as r:await r.text() -
session
session可以进行多项操作,比如post, get, put, head等等
|
|
atexit
ref: 深入理解python中的atexit模块 非常好的简介,如果需要进一步了解,可以参考。
atexit 模块介绍
python atexit 模块定义了一个 register 函数,用于在 python 解释器中注册一个退出函数,这个函数在解释器正常终止时自动执行,一般用来做一些资源清理的操作。 atexit 按注册的相反顺序执行这些函数; 例如注册A、B、C,在解释器终止时按顺序C,B,A运行。
Note:如果程序是非正常crash,或者通过os._exit()退出,注册的退出函数将不会被调用。
heapq (Heap queue algorithm)
- nlargest/nsmallest 123456789101112nums = [1, 8, 2, 23, 7, -4, 18, 23, 42, 37, 2]heapq.nlargest(3, nums)# or more complicated:portfolio = [{'name': 'IBM', 'shares': 100, 'price': 91.1},{'name': 'AAPL', 'shares': 50, 'price': 543.22},{'name': 'FB', 'shares': 200, 'price': 21.09},{'name': 'HPQ', 'shares': 35, 'price': 31.75},{'name': 'YHOO', 'shares': 45, 'price': 16.35},{'name': 'ACME', 'shares': 75, 'price': 115.65}]cheap = heapq.nsmallest(3, portfolio, key=lambda s: s['price'])
re module
compile
[Python Regex Flags] (http://xahlee.info/python/python_regex_flags.html)
-
compile(r’[0-9]+’)
Usage:123import repattern = re.compile(r'[0-9]+')frist_number = pattern.match(your_str).group(0) -
another example
1234import remys = 'abe(ac)ad)'p1 = re.compile(r'[(](.*?)[)]', re.S)match_list = re.findall(p1, mys) # findall return a list of matched string
other func
-
replace(str1, str2)
-
split()
string 对象的 split() 方法只适应于非常简单的字符串分割情形, 它并不允许有多个分隔符或者是分隔符周围不确定的空格。
123>>> line = 'asdf fjdk; afed, fjek,asdf, foo'>>> re.split(r'[;,\s]\s*', line)['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']当你使用 re.split() 函数时候,需要特别注意的是正则表达式中是否包含一个括号捕获分组。 如果使用了捕获分组,那么被匹配的文本也将出现在结果列表中。
123>>> fields = re.split(r'(;|,|\s)\s*', line)>>> fields['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']
sys module
- sys.version/version_info(version_info is a object)
os module
1. about path
-
about path
12345os.getcwd()os.chdir('../')os.listdir()os.chdir('../')os.path.exist #determine if a file or dir exists -
os.remove
remove a file; if file does not exists, an Error will be throwed out. -
os.rmdir
remove a dir -
os.path.splitext(path)
Split the pathname path into a pair (root, ext) -
os.path.basename(your_path)
2. os.popen
-
shell cmd is executed in background and you can’t change it
-
inplemented by subprocess.Popen, so why not you use subprocess?
time & calendar
-
‘datetime’ object & ‘timedelta’ object
1234>>> datetime.strptime("2018-11-28T19:57:01.522Z",FMT)datetime.datetime(2018, 11, 28, 19, 57, 1, 522000)>>> datetime.strptime("2018-11-28T19:56:55.557Z",FMT) - datetime.strptime("2018-11-28T19:57:01.522Z",FMT)datetime.timedelta(-1, 86394, 35000) -
time.time()
Return the current time in seconds since the Epoch. -
mktime(tupletime)
-
localtime()
Convert seconds since the Epoch to a time tuple expressing local time.
When ‘seconds’ is not passed in, convert the current time instead. -
strftime()
strftime(format[, tuple]) # -> string
-
strptime()
strptime(string, format) # -> struct_time
Parse a string to a time tuple according to a format specification. -
time.sleep(secs)
-
calendar.isleap(year)
-
calendar.weekday(year,month,day)
subprocess module
1. For cmd that needn’t stdout
subprocess.run("cp standard_py/*py .", shell=True, check=True)
- shell=True
you can use a string instead of a series of args! - check=True
throw an Error if shell cmd exit wrong!
2. For cmd needing stdout
An example:
|
|
-
universal_newlines=True
stdout
Captured stdout from the child process. A bytes sequence, or a string if run() was called with universal_newlines=True. None if stdout was not captured. -
stdout=PIPE
without this argument, stdout will be printed as stdout of python script, instead of captured, as from python doc:This(ct: means run) does not capture stdout or stderr by default. To do so, pass PIPE for the stdout and/or stderr arguments.
glob module
The glob module finds all the pathnames matching a specified pattern
- glob.glob(pathname, *, recursive=False)
- glob.iglob(pathname, recursive=False)
Return an iterator which yields the same values as glob()
collections module
-
Counter
12>>> Counter([1,2,2,2,2,3,3,3,4,4,4,4])Counter({1: 5, 2: 3, 3: 2}) -
deque (double-ended queue, pronounced as ‘Deck’)
doc:deque([iterable[, maxlen]])
method: pop(popleft) / append / extend; clear / copy (It’s shadow copy ) / insert / remove; count / index ; reverse / rotate -
python doc: OrderedDict
This dict will record the insert order! And this is the only difference between OrderedDict and dict
NOTE: OrderedDict may looks like a list of tuple, but it’s definitely not!123import collectionsdic = collections.OrderedDict()dic['k1'] = 'v1'
Standard Usage:
|
|
PIL
img = Image.open(‘origin.png’) #支持多种格式
注意:类似于.htm和.html,.jpg和.jpeg没有区别,只是两种写法- font
font = ImageFont.truetype("FreeMono.ttf", 28, encoding="unic")
img2.save('./test_image_data/cat_001_blur.jpeg','jpeg')
#save(‘path’,‘format’)- resize()、rotate()、convert(mode=‘your_mode’)
- Coordinates (0, 0) in the upper left corner.
draw = ImageDraw.Draw(img)
123#img is from:#img = Image.open('./test_image_data/cat_001.jpg')draw.text((width - add_width, 0), number, font=font, fill=fillcolor) # first parameter is the start point of the draw
random
- random.seed(a=None, version=2)
functions for integers
- random.randrange(start, stop[, step])
- random.randint(a, b)
return a random integer N such that a <= N <= b. Alias for randrange(a, b+1).
real-valued distributions
- random.random()
Return the next random floating point number in the range [0.0, 1.0).
- random.uniform(a, b)
- random.gauss(mu, sigma)
shutil
The shutil module offers a number of high-level operations on files and collections of files. In particular, functions are provided which support file copying and removal.
shutil.copyfile(src, dst)
logging[1] [2]
- A sample example:[2] 123import logginglogging.warning('Watch out!') # will print a message to the consolelogging.info('I told you so') # will not print anything
result is : WARNING:root:Watch out!
- Used in program:
|
|
-
logging.warn() vs. logging.warning()
logging.warning()
just logs something at the WARNING level, whilelogging.warn
will raise an exception.
ref: Python warnings.warn() vs. logging.warning() -
filter
一般都是用继承logging.Filter
类的方法,再写filter
method。
官方说3.2版本之后,用个函数就可以作为filter,但是没有示例程序。
functools
wraps
ref: What does functools.wraps do?
普通decorator,会导致被修改的函数的属性产生变化(比如func_1,加decorator后,名字会被修改)。
|
|
简而言之:decorator都需要functools.wraps
pickle
pickle.load是从file-like object中导入数据;对应的,pickle.dump也是针对file-like object。[3]
- TIPS
一个存有bs4.element.Tag类型的对象,并不一定能pickle。
因为pickle要求数据可以遍历,而bs4.element.Tag类型的对象可能有环?或者单纯是我的数据中,bs4.element.Tag类型的数量太多了?
BeautifulSoup4
- beautifulsoup4
- 文档
find_all(text="your_text")
不好用!- ‘html5lib’ engine is slower but more robust than ‘lxml’!
soup.prettify()
and<tag>.prettify()
Formatting html
四大对象种类[4]
- Tag: 通俗点讲就是 HTML 中的一个个标签
如何抓取超链接?
因为超链接不是显示存储的。要想得到超链接的字符串,需要下面的操作:12if isinstance(i, bs4.element.Tag) and i.has_attr('href'):con_str.append(i['href'] )
- NavigableString:
print(type(soup.p.string))
, that’s it. - BeautifulSoup: 一个文档的全部内容.大部分时候可以把它当作是一个特殊的 Tag
- Comment: 注释
Selenium
-
initialize
12chromePath = r'/usr/local/bin/chromedriver'wd = webdriver.Chrome(executable_path= chromePath) -
usage
123wd.find_element_by_id('login_pwd')wd.find_element_by_class_name('radiocheck').click()wd.find_element_by_xpath('//*[@id="sendpck"]/img')
NOTE:
1. you can search an element and then click()
2. you can get xpath in source code page.
- login website
Please refer to login fucntion in download_voice.ipynb
TIPS
- 断网
有时候断网后,selenium就一直无法重新连接,尽管你在浏览器中手动访问网站是可以的。
这时可以手动刷新一下selenium要访问的页面,问题可能就解决了。
我猜,这可能是因为selenium保留了上次访问的失败状态,这导致重新链接到网络后,selenium仍然无法正常工作。