Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share

Login

Remember

Register

Ask
Q&A
All Activity
Hot!
Unreplyed
Tags
Users
Post an Article

Post an Article

Welcome To Ask or Share your Answers For Others

Categories

Topic[话题] (13)

Life[生活] (4)

Technique[技术] (2.1m)

Idea[创意] (3)

Jobs[工作] (2)

Others[杂七杂八] (18)

Code Example[编程示例] (0)

python - TFIDF for Large Dataset

0 votes

536 views

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - TFIDF for Large Dataset

I have a corpus which has around 8 million news articles, I need to get the TFIDF representation of them as a sparse matrix. I have been able to do that using scikit-learn for relatively lower number of samples, but I believe it can't be used for such a huge dataset as it loads the input matrix into memory first and that's an expensive process.

Does anyone know, what would be the best way to extract out the TFIDF vectors for large datasets?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

python

Please log in or register to add a comment.

Welcome To Ask or Share your Answers For Others

Please log in or register to reply this article.

1 Reply

0 votes

replyed Oct 24, 2021 by 深蓝 (71.8m points)

Gensim has an efficient tf-idf model and does not need to have everything in memory at once.

Your corpus simply needs to be an iterable, so it does not need to have the whole corpus in memory at a time.

The make_wiki script runs over Wikipedia in about 50m on a laptop according to the comments.

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Please log in or register to add a comment.

OGeek|极客中国-欢迎来到极客的世界，一个免费开放的程序员编程交流平台！开放，进步，分享！让技术改变生活，让极客改变未来！ Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share

Click Here to Ask a Question

Just Browsing Browsing

[1] reactjs - Adding drei to react-three-fiber causes error

[2] 离线百度地图配置本地瓦片图问题

[3] 目前的网站技术，如果要做响应式主流是flex布局？div+css已经被淘汰了吗？

[4] entity relationship - How to represent Multivalued, composite, derived attributes in crow's foot ER Diagram?

[5] Angular 10 form action not getting variable

[6] laravel - Livewire fire method when we type into input element

[7] python - How to unit-test a pytest plugin's hook acting on session-specific data?

[8] selenium - What is the element name of Google search button?

[9] react setState如何修改深层次的数据

[10] vue3+element plus 走马灯高度自适应问题

1.4m articles

1.4m replys

5 comments

56.9k users

Most popular tags

javascript python c# java How android c++ php ios html sql r c node.js .net iphone asp.net css reactjs jquery ruby What Android objective mysql linux Is git Python windows Why regex angular swift amazon excel algorithm macos Java visual how bash Can multithreading PHP Using scala angularjs typescript apache spring performance postgresql database flutter json rust arrays C# dart vba django wpf xml vue.js In go Get google jQuery xcode jsf http Google mongodb string shell oop powershell SQL C++ security assembly docker Javascript Android: Does haskell Convert azure debugging delphi vb.net Spring datetime pandas oracle math Django

Xstack问答社区
生活宝问答社区
OverStack问答社区
Ostack问答社区
在这了问答社区
在哪了问答社区
Xstack问答社区
无极谷问答社区
TouSu问答社区
SQlite问答社区
Qi-U问答社区
MLink问答社区
Jonic问答社区
Jike问答社区
16892问答社区
Vigges问答社区
55276问答社区
OGeek问答社区
深圳家问答社区
深圳家问答社区
深圳家问答社区
Vigges问答社区
Vigges问答社区
在这了问答社区
DevDocs API Documentations

Xstack问答社区
生活宝问答社区
OverStack问答社区
Ostack问答社区
在这了问答社区
在哪了问答社区
Xstack问答社区
无极谷问答社区
TouSu问答社区
SQlite问答社区
Qi-U问答社区
MLink问答社区
Jonic问答社区
Jike问答社区
16892问答社区
Vigges问答社区
55276问答社区
OGeek问答社区
深圳家问答社区
深圳家问答社区
深圳家问答社区
Vigges问答社区
Vigges问答社区
在这了问答社区
在这了问答社区
DevDocs API Documentations

Xstack问答社区
生活宝问答社区
OverStack问答社区
Ostack问答社区
在这了问答社区
在哪了问答社区
Xstack问答社区
无极谷问答社区
TouSu问答社区
SQlite问答社区
Qi-U问答社区
MLink问答社区
Jonic问答社区
Jike问答社区
16892问答社区
Vigges问答社区
55276问答社区
OGeek问答社区
深圳家问答社区
深圳家问答社区
深圳家问答社区
Vigges问答社区
Vigges问答社区
在这了问答社区
DevDocs API Documentations

Send feedback
深圳家
深圳家
极客中国
搜外友链
Ostack Developer QA ZONE
CC BY-SA 3.0
Contact with WebMaster by Email: jeky_zhao@qq.com

Snow Theme by Q2A Market

Powered by Question2Answer

...