Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
463 views
in Technique[技术] by (71.8m points)

pyspark - Spark SQL security considerations

What are the security considerations when accepting and executing arbitrary spark SQL queries?

Imagine the following setup:

Two files on hdfs are registered as tables a_secrets and b_secrets:

# must only be accessed by clients with access to all of customer a' data
spark.read.csv("/customer_a/secrets.csv").createTempView("a_secrets")

# must only be accessed by clients with access to all of customer b's data
spark.read.csv("/customer_b/secrets.csv").createTempView("b_secrets")

These two views, I could secure using simple hdfs file permissions. But say I have the following logical views of these tables, that I'd like to expose:

# only access for clients with access to customer a's account no 1
spark.sql("SELECT * FROM a_secrets WHERE account = 1").createTempView("a1_secrets")

# only access for clients with access to customer a's account no 2
spark.sql("SELECT * FROM a_secrets WHERE account = 2").createTempView("a2_secrets")


# only access for clients with access to customer b's account no 1
spark.sql("SELECT * FROM b_secrets WHERE account = 1").createTempView("b1_secrets")

# only access for clients with access to customer b's account no 2
spark.sql("SELECT * FROM b_secrets WHERE account = 2").createTempView("b2_secrets")

Now assume I receive an arbitrary (user, pass, query) set. I get a list of accounts the user can access:

groups = get_groups(user, pass)

and extract the logical query plan of the user's query:

spark.sql(query).explain(true)

giving me a query plan along the lines of (this exact query plan is made up)

== Analyzed Logical Plan ==
account: int, ... more fields
Project [account#0 ... more fields]
+- SubqueryAlias a1_secrets
   +- Relation [... more fields]
      +- Join Inner, (some_col#0 = another_col#67)
         :- SubqueryAlias a2_secrets
         :  +- Relation[... more fields] csv
== Physical Plan ==
... InputPaths: hdfs:/customer_a/secrets.csv ...

Assuming I can parse a logical query plan to determine exactly which tables and files are being accessed, is it safe to grant access to the data produced by the query? I'm thinking of potential problems like:

  • Are there ways to access registered tables without them showing up in a logical query plan?
  • Are the ways to load new data and register it as tables through pure spark SQL? (input to spark.sql(1))?
  • Do users have access to any sql functions with side effects (that modifies or accesses unathorized data)?
  • Are there ways to register UDFs/execute arbitrary code purely through spark.sql(1)?

To summarise: Can I safely accept arbitrary SQL, register it with df = spark.sql(1), analyse data access using df.explain(True), and then return results using e.g. df.collect()?

Edits: - 23 Jan 15:29: edited to include an "EXPLAIN" prefix in

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

TL;DR You should never execute any untrusted code on your Spark cluster.

Are the ways to load new data and register it as tables through pure spark SQL?

Yes. CREATE TABLE can be executed using sql method so if as long as users have permissions to access filesystem they can create tables.

Are there ways to register UDFs/execute arbitrary code purely through spark.sql(1)?

Yes, as long they can control classpath which, can be modified with SQL.

spark.sql("""add jar URI""")

Do users have access to any sql functions with side effects (that modifies or accesses unathorized data)?

Effectively yes (by extension of the previous point).

Can I safely accept arbitrary SQL,

No.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...