0

Python数据处理技巧

 2 years ago
source link: https://allenwind.github.io/blog/10568/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

快速统计词表、交替迭代两个迭代器、计算所有子集、可读字节单位、滑动窗口

快速统计字词频率表

import itertools
import collections

chars = collections.Counter(itertools.chain(*X))

在数据集不大的情况下,通常我们使用类似如下代码来统计字词频率表,

import itertools
import collections
X = [text1, text2, ..., textn]
words = collections.Counter(itertools.chain(*X))
print(words.most_common(20))

当数据集非常大时,以上代码显得非常无力。这里提供在大数集中并行统计字词频率表的方法。这里提供一种使用多进程进行Counter的实现,见项目count-in-parallel

交替迭代两个迭代器

在模型训练时,交替输出两个正负类。

import itertools

def i1():
for i in range(0, 20, 2):
yield i

def i2():
for i in range(1, 20, 2):
yield i

def combine_iters():
ii1 = itertools.cycle(i1())
ii2 = itertools.cycle(i2())

while True:
yield next(ii1)
yield next(ii2)

iiters = combine_iters()
for i in range(100):
print(next(iiters))

计算所有子集

import numpy as np

def all_subgroup(es):
size = 2**len(es)
for i in range(size):
bins = list(np.binary_repr(i))
idx = (np.array(bins) == "1")
r = len(es) - len(idx)
yield es[r:][idx]

es = np.array(["a", "b", "c", "e", "f"])
for e in all_subgroup(es):
print(e)

可读字节单位

def humanize_bytes(bytesize, precision=3):
abbrevs = (
(1 << 50, 'PB'),
(1 << 40, 'TB'),
(1 << 30, 'GB'),
(1 << 20, 'MB'),
(1 << 10, 'kB'),
(1, 'bytes')
)
if bytesize == 1:
return '1 byte'
for factor, suffix in abbrevs:
if bytesize >= factor:
break
return '%.*f %s' % (precision, bytesize / factor, suffix)
def rsize(cls, old_paste, weight, height):
assert old_paste.is_image, TypeError('Unsupported Image Type.')
f = open(old_paste.path, 'rb')
im = Image.open(f)
img = cropresize2.crop_resize(im, (int(weight), int(height)))
rst = cls(old_paste.filename, old_paste.mimetype, 0)
img.save(rst.path)
filestat = os.stat(rst.path)
rst.size = filestat.st_size
return rst

时间序列与滑动窗口

import numpy as np

_row = lambda x: x

def series2X(series, size, func=_row):
# 把时间序列转换为滑动窗口形式
X = np.array([series[i:i+size] for i in range(len(series)-size+1)])
return np.apply_along_axis(func, 1, X)

def series2Xy(series, size, func=_row):
# 把时间序列转换为单步带标注形式数据
X = np.array([series[:-1][i:i+size] for i in range(len(series)-size)])
y = np.array(series[size:])
return np.apply_along_axis(func, 1, X), y

获取随机种子

import os
import numpy as np
bs = os.urandom(10)
seed = int.from_bytes(bs, "big")

np.random.seed(seed)

转载请包括本文地址:https://allenwind.github.io/blog/10568
更多文章请参考:https://allenwind.github.io/blog/archives/


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK