python IO笔记

文件主要分为两种，文本文件和二进制文件（图片、视频）

需要读取非ascii编码的文件，必须以二进制模式打开，再解码

>>> f = open('/Users/michael/gbk.txt', 'rb')
>>> u = f.read().decode('gbk')
>>> u
u'\u6d4b\u8bd5'
>>> print u
测试

换种思路，利用自带模块codecs自动转码

1
2
3

import codecs
with codecs.open('/Users/michael/gbk.txt', 'r', 'gbk') as f:
    f.read() # u'\u6d4b\u8bd5'

¶1. 读取文件

1 2	>>> f.read(size) ['1\n', '00:00:00,000 --> 00:00:02,060\n', 'you\n']

read()
读取数据并转换为string(text mode)或者bytes(binary mode)
size为数值，可为正负，默认读取文件所有内容
到达文件底，返回空字符串
readline()
从文件中读取一行。
readlines() 和read()相似读取所有行，但是会将文件按照行分隔符\n解析为列表。
这三种方法都会将换行符读入，需要手动去掉

with open('filepath','r') as f:
    list1 = f1.readlines()
for i in range(0, len(list1):
    list[i] = list[i].rstrip('\n')
    ## read() 和readlin由于在print时会默认将换行符解析，但还是要做处理
    lines = f.readlines()

¶1.1 传统读取方式

用了open打开新的文件对象需要及时关闭

file = open("infile")
for line in file.readline():
    if not line:
        break
    pass
file.close()

## 按照行读取
file = open("sample.txt")
longest = max(len(line) for line in file)
for line in file:
    ## TODO
file.close()

readlines 速度最快
readline 比 fileinput要快一些

¶1.2 比较两个文件相同的ip

import bisect

with open('1.txt', 'r') as f1:
    list1 = f1.readlines()
for i in range(0, len(list1):
    list1[i] = list1[i].strip('\n')

with open('2.txt', 'r') as f2:
    list2 = f2.readlines()
for i in range(0, len(list2):
    list2[i] = list2[i].strip('\n')
    # list2[i].strip()

list2.sort()
length_2 = len(list2)
same_date = []

for i in list1:
    pos = bisect.bisect_left(list2, i)
    if pos < len(list2) and list2[pos] == i:
        same_data.append(i)
same_data = list(set(same_data))
print(same_data)

¶1.3 读取特定文件再排序输出

result = list()
with open('file', 'r') as f:
    for line in f.readlines():
        line = line.strip()
        if not len(line) or line.startswith('#'):
            continue
        result.append(line)
result.sort()
open('output','w').write('%s' %'\n'.join(result))

¶2. 写文件

f1 = open('test.txt', 'w'):
    f1.writelines(['1\n','2\n','3\n'])

with open('/Users/macbook/test/txt', 'w') as f:
    f.write('')

writelines()对应readlines(), 针对列表操作，每一个接受的字符串需要加上换行符
要写入特定编码的文本文件，请效仿codecs的示例，写入unicode，由codecs自动转换成指定编码。

¶3. file_obj(offset, whence=0)

offset 偏移量
whence，位置，0为文件开头，1当前位置，2为文件尾部
whence只适用于'r'，对于写或者追加模式不起作用。

¶4. 字符串编码

1
2
3

## 传入encoding参数，读取以某种方式编码的文件，errors进行忽视

with open('test.txt', encoding = 'utf-8', errors = 'ignore')

¶5. 交换文件行位置

with open('test.txt', 'r') as input, open('output.txt', 'w') as output:
    line = input.readline()
    while line:
        col1, col2 = line.strip().split(',')
        output.write('{},{}\n').format(col2, col1)
        line = input.readline()

¶6. 文件序列化

序列化的过程指的是将变量从内存变成可存储或传输的过程，在python种叫做pickling，而其他语言叫做serialization, marshalling, flattening.
pickle主要应用于对json文件的操作

## py2 有Pickle和高效的cPickle，py3只有pickle
import pickle
lines = ['I love you', 
    'I still love you',
    'I hate you']
## 序列化并保存, 注意⚠️使用binary模式
with open('lines.pkl','wb') as f:
    ## dump 将对象序列化为file-like object
    pickle.dump(lines, f)
## 文件读取并反序列化
with open('lines.pkl', 'rb') as f:
    lines_back = pickle.load(f)

print(lines_back)

dumps 将任何对象序列化为bytes
dump 将对象序列化为file-like object
load 将 file-like object反序列化为对象
loads 将bytes反序列化为对象

¶7. 排序

读取文件并按照某列排序

import csv, sys, operator
data = csv.reader(open('file', delimiter = ',')
sortedlist = sorted(data, key = lambda x: x[0], int(x[1]))
with open('test.scv', 'w', newline = '') as f:
    fileWriter = csv.writer(f, delimiter = ',')
    for row in fileWriter:
        fileWriter.writerow(row)
f.close()

(https://stackoverflow.com/questions/3969813/which-parallel-sorting-algorithm-has-the-best-average-case-performance)

¶8. StringIO 和 BytesIO

从内存中读写

from io import StringIO
from io import BytesIO

f = StringIO()
f.write('hello')
f.wwrite('world!')
print(f.getvalue())
## 读
while True:
    s = f.readline()
    if s == '':
        break
    print(s.strip())

## 
f = BytesIO()
f.write('中'.encode('utf-8'))
print(f.getvalue())

## 读
f.read()

如果对文件写入需要check seek的位置，可以通过f.readline()将seek的位置定位到最后，也可以用f.seek(n)

¶9. python读写json文件

#输出
cat data.json | python -m json.tool

import json
#read
data = json.load(open("./data.json")

#write
with open('./data.json','w') as fp:
    json.dump(data, fp , indent=2)
    
#dump as a string
print(json.dumps(foo,indent = 4)

load 从file-like object读取字符串并反序列化
loads 把json的字符串反序列化
dump 把json写入file-like object
dumps 返回一个str

¶10. python读写csv文件

import csv
with open('input.csv','rb') as f:
    reader = csv.reader(f, delimiter = ',')
    header = next(reader)
    #skip the header
    noheader = next(reader, none)
    for row in reader:
        print(row)
    #print list
    for row in reader:
        print(list(zip(header, row)
    #print dict
    for row in reader:
        print(dict(zip(header, row)

1
2
3

#write
writer = csv.writer(open('output.csv','wb') , delimiter = ',')
write.writerow(['a','b','c')]

¶10.1 从hdfs中读取csv

#read csv from HDFS
result = subprocess.run(['hadoop','fs','-text','/path/to/data/part*'], stdout = subprocess.PIPE)
lines = result.stdout.decode().strip().spilit('\n')
reader = csv.reader(lines)

参考：