python – 将多个csv文件读取到HDF5时的Pandas…

投稿者源码 2019-09-26

233

使用Python3,Pandas 0.12

我正在尝试将多个csv文件(总大小为7.9 GB)写入HDF5存储,以便稍后处理. csv文件每个包含大约一百万行,15列和数据类型主要是字符串,但有些浮点数.但是,当我尝试读取csv文件时,我收到以下错误：

Traceback (most recent call last):
  File "filter-1.py", line 38, in 
    to_hdf()
  File "filter-1.py", line 31, in to_hdf
    for chunk in reader:
  File "C:/Python33/lib/site-packages/pandas/io/parsers.py", line 578, in __iter__
    yield self.read(self.chunksize)
  File "C:/Python33/lib/site-packages/pandas/io/parsers.py", line 608, in read
    ret = self._engine.read(nrows)
  File "C:/Python33/lib/site-packages/pandas/io/parsers.py", line 1028, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas/parser.c:6745)
  File "parser.pyx", line 740, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7146)
  File "parser.pyx", line 781, in pandas.parser.TextReader._read_rows (pandas/parser.c:7568)
  File "parser.pyx", line 768, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:7451)
  File "parser.pyx", line 1661, in pandas.parser.raise_parser_error (pandas/parser.c:18744)
pandas.parser.CParserError: Error tokenizing data. C error: EOF inside string starting at line 754991
Closing remaining open files: ta_store.h5... done

编辑：

我设法找到一个产生这个问题的文件.我认为它正在阅读一个EOF角色.但是我无法克服这个问题.鉴于组合文件的大小,我认为检查每个字符串中的每个单个字符太麻烦了. (即便如此,我仍然不确定该怎么做.)据我检查,csv文件中没有可能引发错误的奇怪字符.
我也尝试将error_bad_lines = False传递给pd.read_csv(),但错误仍然存??在.

我的代码如下：

# -*- coding: utf-8 -*-

import pandas as pd
import os
from glob import glob


def list_files(path=os.getcwd()):
    ''' List all files in specified path '''
    list_of_files = [f for f in glob('2013-06*.csv')]
    return list_of_files


def to_hdf():
    """ Function that reads multiple csv files to HDF5 Store """
    # Defining path name
    path = 'ta_store.h5'
    # If path exists delete it such that a new instance can be created
    if os.path.exists(path):
        os.remove(path)
    # Creating HDF5 Store
    store = pd.HDFStore(path)

    # Reading csv files from list_files function
    for f in list_files():
        # Creating reader in chunks -- reduces memory load
        reader = pd.read_csv(f, chunksize=50000)
        # Looping over chunks and storing them in store file, node name 'ta_data'
        for chunk in reader:
            chunk.to_hdf(store, 'ta_data', mode='w', table=True)

    # Return store
    return store.select('ta_data')
    return 'Finished reading to HDF5 Store, continuing processing data.'

to_hdf()

编辑

如果我进入引发CParserError EOF的CSV文件…并手动删除导致问题的行之后的所有行,则正确读取csv文件.但是我删除的所有内容都是空行.
奇怪的是,当我手动纠正错误的csv文件时,它们会被单独加载到商店中.但是当我再次使用多个文件的列表时,’false’文件仍然会返回错误.

百度未收录

本文由投稿者创作，文章地址：https://blog.isoyu.com/archives/python-jiangduogecsvwenjianduqudaohdf5shidepandas.html
采用知识共享署名4.0 国际许可协议进行许可。除注明转载/出处外，均为本站原创或翻译，转载前请务必署名。最后编辑时间为:9 月 26, 2019 at 10:21 下午

文章总数：	15035 篇
留言数量：	21473 条
友情链接：	60 个
网站运行：	4185 天
浏览总量：	15846959 次
最后更新：	2024年4月1日

源码

python – 将多个csv文件读取到HDF5时的Pandas…

相关文章

Olloclip 发布了一个“新”套装，说这是为 iPhone 视频摄影师准备的全家桶-ios学习从入门到精通尽在姬长信

报道称苹果选择百度为国行 iPhone 16 等设备提供 AI 功能-ios学习从入门到精通尽在姬长信

苹果宣布：iOS 17 等系统的开发者测试版已经对所有用户免费开放-ios学习从入门到精通尽在姬长信

热评文章

最赞的文章