Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
280 views
in Technique[技术] by (71.8m points)

parquet - Apache Drill has bad performance against SQL Server

I tried using apache-drill to run a simple join-aggregate query and the speed wasn't really good. my test query was:

SELECT p.Product_Category, SUM(f.sales)
FROM facts f
JOIN Product p on f.pkey = p.pkey
GROUP BY p.Product_Category

Where facts has about 422,000 rows and product has 600 rows. the grouping comes back with 4 rows.

First I tested this query on SqlServer and got a result back in about 150ms.

With drill I first tried to connect directly to SqlServer and run the query, but that was slow (about 5 sec).

Then I tried saving the tables into json files and reading from them, but that was even slower, so I tried parquet files.

I got the result back in the first run in about 3 sec. next run was about 900ms and then it stabled at about 500ms.

From reading around, this makes no sense and drill should be faster! I tried "REFRESH TABLE METADATA", but the speed didn't change.

I was running this on windows, through the drill command line.

Any idea if I need some extra configuration or something?

Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Drill is very fast, but it's designed for large distributed queries while joining across several different data sources... and you're not using it that way.

SQL Server is one of the fastest relational databases. Data is stored efficiently, cached in memory, and the query runs in a single process so the scan and join is very quick. Apache Drill has much more work to do in comparison. It has to interpret your query into a distributed plan, send it to all the drillbit processes, which then lookup the data sources, access the data using the connectors, run the query, return the results to the first node for aggregation, and then you receive the final output.

Depending on the data source, Drill might have to read all the data and filter it separately which adds even more time. JSON files are slow because they are verbose text files that are parsed line by line. Parquet is much faster because it's a binary compressed column-oriented storage format designed for efficient scanning, especially when you're only accessing certain columns.

If you have a small dataset stored on a single machine then any relational database will be faster than Drill.

The fact that Drill gets you results in 500ms with Parquet is actually impressive considering how much more work it has to do to give you the flexibility it provides. If you only have a few million rows, stick with SQL server. If you have billions of rows, then use the SQL Server columnstore feature to store data in columnar format with great compression and performance.

Use Apache Drill when you:

  • Have 10s of billions of rows or more
  • Have data spread across many machines
  • Have unstructured data like JSON stored in files without a standard schema
  • Want to split the query across many machines to run in faster in parallel
  • Want to access data from different databases and file systems
  • Want to join data across these different data sources

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...