PySpark Considerations
Extracting raw data from hdfs
I have tested multiple ways to get data from hdfs. There is two situations:
The data fits in a local node RAM memory:
# will produce a local csv sql("select * from my_table where true") .toPandas() .to_csv("myLocalFile.csv", encoding="utf8") # otherwise spark shell keeps running exit(0)
# run the python script: PYTHONSTARTUP=my/python/path/prog.py pyspark --master yarn [...]
The dataset does not fit into RAM memory:
# will produce some csv on hdfs sql("select * from my_table where true") .write() .format("csv") .save("/output/dir/on/hdfs/") # otherwise spark shell keeps running exit(0)
# run the python script: PYTHONPATH=my/python/path/prog.py pyspark --master yarn [...] # merge the files on the local filesystem hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.csv
Comments
Comments powered by Disqus