Python文件读写

文件主要分为两种,文本文件和二进制文件(图片、视频)

需要读取非ascii编码的文件,必须以二进制模式打开,再解码

1
2
3
4
5
6
>>> f = open('/Users/michael/gbk.txt', 'rb')
>>> u = f.read().decode('gbk')
>>> u
u'\u6d4b\u8bd5'
>>> print u
测试

换种思路,利用自带模块codecs自动转码

1
2
3
import codecs
with codecs.open('/Users/michael/gbk.txt', 'r', 'gbk') as f:
f.read() # u'\u6d4b\u8bd5'

1. 读取文件

1
2
>>> f.read(size)
['1\n', '00:00:00,000 --> 00:00:02,060\n', 'you\n']
  • read()
    读取数据并转换为string(text mode)或者bytes(binary mode)
    size为数值,可为正负,默认读取文件所有内容
    到达文件底,返回空字符串

  • readline()
    从文件中读取一行。

  • readlines() 和read()相似读取所有行,但是会将文件按照行分隔符\n解析为列表。

  • 这三种方法都会将换行符读入,需要手动去掉

1
2
3
4
5
6
with open('filepath','r') as f:
list1 = f1.readlines()
for i in range(0, len(list1):
list[i] = list[i].rstrip('\n')
## read() 和readlin由于在print时会默认将换行符解析,但还是要做处理
lines = f.readlines()

1.1 传统读取方式

用了open打开新的文件对象需要及时关闭

1
2
3
4
5
6
7
8
9
10
11
12
13
file = open("infile")
for line in file.readline():
if not line:
break
pass
file.close()

## 按照行读取
file = open("sample.txt")
longest = max(len(line) for line in file)
for line in file:
## TODO
file.close()
  1. readlines 速度最快
  2. readline 比 fileinput要快一些

1.2 比较两个文件相同的ip

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import bisect

with open('1.txt', 'r') as f1:
list1 = f1.readlines()
for i in range(0, len(list1):
list1[i] = list1[i].strip('\n')

with open('2.txt', 'r') as f2:
list2 = f2.readlines()
for i in range(0, len(list2):
list2[i] = list2[i].strip('\n')
# list2[i].strip()

list2.sort()
length_2 = len(list2)
same_date = []

for i in list1:
pos = bisect.bisect_left(list2, i)
if pos < len(list2) and list2[pos] == i:
same_data.append(i)
same_data = list(set(same_data))
print(same_data)

1.3 读取特定文件再排序输出

1
2
3
4
5
6
7
8
9
result = list()
with open('file', 'r') as f:
for line in f.readlines():
line = line.strip()
if not len(line) or line.startswith('#'):
continue
result.append(line)
result.sort()
open('output','w').write('%s' %'\n'.join(result))

2. 写文件

1
2
3
4
5
f1 = open('test.txt', 'w'):
f1.writelines(['1\n','2\n','3\n'])

with open('/Users/macbook/test/txt', 'w') as f:
f.write('')
  • writelines()对应readlines(), 针对列表操作,每一个接受的字符串需要加上换行符

  • 要写入特定编码的文本文件,请效仿codecs的示例,写入unicode,由codecs自动转换成指定编码。

3. file_obj(offset, whence=0)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
'''input.txt
123
456
789
'''
f = open('input.txt','r+')
f.readline()
## 定位到第一行的\n
f = open('input.txt','rb+')
f.readline()
## 定位到第一行的\n
f.seek(-1,1)
f.write('hello'.encode('utf-8'))
f.close()
'''
123hello789
'''
  • offset 偏移量
  • whence, 位置,0为文件开头,1当前位置,2为文件尾部
  • whence只适用于'r',对于写或者追加模式不起作用。

4. 字符串编码

1
2
3
## 传入encoding参数,读取以某种方式编码的文件,errors进行忽视

with open('test.txt', encoding = 'utf-8', errors = 'ignore')

5. 交换文件行位置

1
2
3
4
5
6
with open('test.txt', 'r') as input, open('output.txt', 'w') as output:
line = input.readline()
while line:
col1, col2 = line.strip().split(',')
output.write('{},{}\n').format(col2, col1)
line = input.readline()

6. 文件序列化

序列化的过程指的是将变量从内存变成可存储或传输的过程,在python种叫做pickling,而其他语言叫做serialization, marshalling, flattening.
pickle主要应用于对json文件的操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
## py2 有Pickle和高效的cPickle,py3只有pickle
import pickle
lines = ['I love you',
'I still love you',
'I hate you']
## 序列化并保存, 注意⚠️使用binary模式
with open('lines.pkl','wb') as f:
## dump 将对象序列化为file-like object
pickle.dump(lines, f)
## 文件读取并反序列化
with open('lines.pkl', 'rb') as f:
lines_back = pickle.load(f)

print(lines_back)
  • dumps 将任何对象序列化为bytes
  • dump 将对象序列化为file-like object
  • load 将 file-like object反序列化为对象
  • loads 将bytes反序列化为对象

7. 排序

读取文件并按照某列排序

1
2
3
4
5
6
7
8
import csv, sys, operator
data = csv.reader(open('file', delimiter = ',')
sortedlist = sorted(data, key = lambda x: x[0], int(x[1]))
with open('test.scv', 'w', newline = '') as f:
fileWriter = csv.writer(f, delimiter = ',')
for row in fileWriter:
fileWriter.writerow(row)
f.close()

(https://stackoverflow.com/questions/3969813/which-parallel-sorting-algorithm-has-the-best-average-case-performance)

8. StringIO 和 BytesIO

从内存中读写

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from io import StringIO
from io import BytesIO

f = StringIO()
f.write('hello')
f.wwrite('world!')
print(f.getvalue())
## 读
while True:
s = f.readline()
if s == '':
break
print(s.strip())

##
f = BytesIO()
f.write('中'.encode('utf-8'))
print(f.getvalue())

## 读
f.read()

如果对文件写入需要check seek的位置,可以通过f.readline()将seek的位置定位到最后,也可以用f.seek(n)

9. python读写json文件

1
2
3
4
5
6
7
8
9
10
11
12
13
#输出
cat data.json | python -m json.tool

import json
#read
data = json.load(open("./data.json")

#write
with open('./data.json','w') as fp:
json.dump(data, fp , indent=2)

#dump as a string
print(json.dumps(foo,indent = 4)
  • load 从file-like object读取字符串并反序列化
  • loads 把json的字符串反序列化
  • dump 把json写入file-like object
  • dumps 返回一个str

10. python读写csv文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import csv
with open('input.csv','rb') as f:
reader = csv.reader(f, delimiter = ',')
header = next(reader)
#skip the header
noheader = next(reader, none)
for row in reader:
print(row)
#print list
for row in reader:
print(list(zip(header, row)
#print dict
for row in reader:
print(dict(zip(header, row)
1
2
3
#write
writer = csv.writer(open('output.csv','wb') , delimiter = ',')
write.writerow(['a','b','c')]

10.1 从hdfs中读取csv

1
2
3
4
#read csv from HDFS
result = subprocess.run(['hadoop','fs','-text','/path/to/data/part*'], stdout = subprocess.PIPE)
lines = result.stdout.decode().strip().spilit('\n')
reader = csv.reader(lines)

参考:

  1. python文件读写小结
  2. 文件读写