Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
839 views
in Technique[技术] by (71.8m points)

how to pass python package to spark job and invoke main file from package with arguments

I have my python code with a structure like,

Project1
--src
----util.py
----job1.py
----job2.py
--config
----config1.json
----config2.json

I want to run this job1 in spark but these I just cannot invoke job1.py because its dependent on other files like util.py and job2.py and config files and thus I need to pass complete package as an input to spark.

I tried running spark-submit job1.py but it fails with dependencies like job2.py and util.py because they are not available to executors.

Based on spark documentation, I see --files is an option to do this but it works by passing all filenames to spark-submit which looks difficult if number of files in codebase in future.

Another option I see is passing code zip file with --archive option but still it fails because not able to reference files in zip.

So Can anyone suggest any other way to run such codebase in spark?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Specific to your question, you need to use --py-files to include python files that should be made available on the PYTHONPATH.

I just ran into a similar problem where I want to run a modules main function from a module inside an egg file.

The wrapper code below can be used to run main for any module via spark-submit. For this to work you need to drop it into a python file using the package and module name as the filename. The filename is then used inside the wrapper to identify which module to run. This makes for a more natural means of executing packaged modules without needing to add extra arguments (which can get messy).

Here's the script:

"""
Wrapper script to use when running Python packages via egg file through spark-submit.

Rename this script to the fully qualified package and module name you want to run.
The module should provide a ``main`` function.

Pass any additional arguments to the script.

Usage:

  spark-submit --py-files <LIST-OF-EGGS> <PACKAGE>.<MODULE>.py <MODULE_ARGS>
"""
import os
import importlib


def main():
    filename = os.path.basename(__file__)
    module = os.path.splitext(filename)[0]
    module = importlib.import_module(module)
    module.main()


if __name__ == '__main__':
    main()

You won't need to modify any of this code. It's all dynamic and driven from the filename.

As an example, if you drop this into mypackage.mymodule.py and use spark-submit to run it, then the wrapper will import mypackage.mymodule and run main() on that module. All command line arguments are left intact, and will be naturally picked up by the module being executed.

You will need to include any egg files and other supporting files in the command. Here's an example:

spark-submit --py-files mypackage.egg mypackage.mymodule.py --module-arg1 value1

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...