Working with Databricks Workspace Files
source link: https://jdhao.github.io/2023/11/18/databricks-workspace-files/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Some observation and finding in working with Databricks workspace files.
How to read/access workspace files
For regular Python
The behavior to access the workspace file is also different based on the databricks runtime (abbreviation, DBR) version. For the following code:
with open('/Workspace/Users/<user-email>/path/to/file') as f:
content = f.readlines()
print(content)
In DBR 10.4, I get the following error:
FileNotFoundError: [Errno 2] No such file or directory: ‘/Workspace/Users//path/to/file’
Since DBR 11.3, we can access the files under the databricks workspace using their absolute paths (source here). So the above code should work as expected to print the file content. However, this does not apply to the notebooks under the workspace (source here). I think this is fine, because most people don’t have such needs to read notebooks directly.
Since DBR 14.0, as discussed later, the current working directory is changed to the folder where the notebook is run.
So you can additionally use relative path to access workspace files.
For example, if there is test.py
in the folder as the notebook, you can run the following code without error:
with open('./test.py', 'r') as f:
content = f.readlines()
print(content)
For spark code
For spark code, it is also possible to access the workspace files. However, there are two requirements:
- you must use the fully-qualified path for the workspace files, e.g., the path should be something like
file:/Workspace/Users/<user-name>/<folder-name>/MOCK_DATA.csv
- the cluster can’t be in shared access mode, otherwise, you will see the following error when trying to access the workspace files:
java.lang.SecurityException: Cannot use com.databricks.backend.daemon.driver.WorkspaceLocalFileSystem - local filesystem access is forbidden
If both condition is satisfied, you should be able to run the following code without error:
df = spark.read.csv("file:/Workspace/Users/<user-name>/<folder-name>/MOCK_DATA.csv", header=True)
display(df)
comparison
yeah, databricks just makes things f*king
complicated.
I am scratching my hair out trying to figuring out these complicated rules and cases.
Here is a comparison table (hopefully it makes it easier to understand):
DBR versions | open() with absolute path | open() with relative path | spark.read with absolute path | spark.read with relative path |
---|---|---|---|---|
DBR 11.3 single user | ✅ | not supported, cwd is not workspace folder | ✅ | ❌, path must be absolute |
DBR 11.3 shared | ✅ | not supported, cwd is not workspace folder | ❌ | ❌, path must be absolute |
DBR 14.1 single user | ✅ | ✅ | ✅ | ❌, path must be absolute |
DBR 14.1 shared | ✅ | ✅ | ❌ | ❌, path must be absolute |
Current working directory
In the old DBR, when you run the Python code, the current working directory is /databricks/driver
.
To check the DBR version and your current working directory, use this:
import os
print(spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion"))
print(os.path.abspath('./'))
In my DBR 10.4 (single user
access mode, the directory is different if you use shared
access mode),
I see the following output:
10.4.x-scala2.12
/databricks/driver
Starting in databricks 14.0, the current working directory is changed to the directory where the notebook runs (source here). In DBR 14.1 cluster, I see the following output:
14.1.x-scala2.12
/Workspace/Users/<user-email>/<current-folder-name>
You can use relative path to write and read file, but their location is different in different DBR. For example, for the following code:
with open('./demo.txt', 'w') as f:
f.write("hello world\n")
If you use 10.4, the file is saved in /databricks/driver/demo.txt
, under the driver node.
If you use 11.3, the file is saved in /home/spark-<some-random-string>/demo.txt
If you use 14.1, the file is saved in /Workspace/<user-email>/<current-folder>/demo.txt
.
- default current working directory in dbr 14.0: https://learn.microsoft.com/en-us/azure/databricks/files/cwd-dbr-14
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK