python - PySpark logging from the executor

Question

Welcome To Ask or Share your Answers For Others

python - PySpark logging from the executor

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:06:57+0000

You cannot use local log4j logger on executors. Python workers spawned by executors jvms has no "callback" connection to the java, they just receive commands. But there is a way to log from executors using standard python logging and capture them by YARN.

On your HDFS place python module file that configures logging once per python worker and proxies logging functions (name it logger.py):

import os
import logging
import sys

class YarnLogger:
    @staticmethod
    def setup_logger():
        if not 'LOG_DIRS' in os.environ:
            sys.stderr.write('Missing LOG_DIRS environment variable, pyspark logging disabled')
            return 

        file = os.environ['LOG_DIRS'].split(',')[0] + '/pyspark.log'
        logging.basicConfig(filename=file, level=logging.INFO, 
                format='%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s')

    def __getattr__(self, key):
        return getattr(logging, key)

YarnLogger.setup_logger()

Then import this module inside your application:

spark.sparkContext.addPyFile('hdfs:///path/to/logger.py')
import logger
logger = logger.YarnLogger()

And you can use in inside your pyspark functions like normal logging library:

def map_sth(s):
    logger.info("Mapping " + str(s))
    return s

spark.range(10).rdd.map(map_sth).count()

The pyspark.log will be visible on resource manager and will be collected on application finish, so you can access these logs later with yarn logs -applicationId .....

Categories

python - PySpark logging from the executor

python - PySpark logging from the executor

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags