python - Convert large csv to sparse matrix for use in sklearn

Question

Welcome To Ask or Share your Answers For Others

python - Convert large csv to sparse matrix for use in sklearn

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Convert large csv to sparse matrix for use in sklearn

I have a ~30GB (~1.7 GB compressed | 180K rows x 32K columns) matrix saved in a csv format. I would like to convert this matrix to sparse format to be able to load the full dataset in memory for machine learning with sklearn. The cells that are populated contain float numbers less than 1. A caveat of the large matrix is the target variable is stored as the last column. What is the best method to allow this large matrix to be utilized in sklearn? I.E. How can you transition the ~30GB csv into a scipy sparse format without loading the original matrix into memory?

Pseudocode

Remove target variable (keep order intact)
Convert ~30 GB matrix to sparse format (Help!!)
Load sparse format into memory and target variable to run machine learning pipeline (How would I do this?)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:25:45+0000

You can row-wise build a sparse matrix in memory pretty easily:

import numpy as np
import scipy.sparse as sps

input_file_name = "something.csv"
sep = ""

def _process_data(row_array):
    return row_array

sp_data = []
with open(input_file_name) as csv_file:
    for row in csv_file:
        data = np.fromstring(row, sep=sep)
        data = _process_data(data)
        data = sps.coo_matrix(data)
        sp_data.append(data)


sp_data = sps.vstack(sp_data)

This will be easier to write into hdf5 which is a way better way to store numbers at this scale than a text file.

Categories

python - Convert large csv to sparse matrix for use in sklearn

python - Convert large csv to sparse matrix for use in sklearn

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags