Downloading files from Databricks’ DBFS
A quick tutorial on how to access your DBFS instance to download files solely via your browser.
More often than not, you may be interested in downloading data from your Databricks instance. And whilst Databricks provides a UI for retrieving your DataFrame result, sometimes you are interested in generating data from your Databricks instance not directly related to DataFrames. Typical use cases involve simulation results, generating textual data, or even storing your DataFrame schema.
By default, Databricks does not provide a way to remotely access/download the files within DBFS. In this quick guide, I’ll show you how to access your DBFS data in 2 minutes without any external tools, relying simply, on your browser.
1. Storing our output into a file in DBFS
Consider taking a DataFrame schema into a text file so you can process it overcoming Databricks’ cell output:
base_data: DataFrame = spark.read.json([…])
base_schema: str = str(base_data.schema)
Start by writing a file to DBFS we want to download:
dbutils.fs.put("/dbfs/FileStore/schema_output.txt", base_schema)
Note: This is important to place the file under the dbfs/FileStore/{your_path} file path, the reasoning behind it will be further explored in the second step
2. Downloading the file from DBFS
Databricks does not allow downloading data directly via the DBFS Data UI widget, however, the data within the FileStore folder is exposed via endpoint, and that is exactly how we will access our file.
2.1 Fetch our Databricks tenant instance URL
Retrieve your Databricks tenant instance URL by accessing the Databricks platform within your Cloud provider. For the sake of this tutorial, we will do so using Azure, however, keep in mind the process is similar to all providers.
Considering the following case, we are interested in two portions:
- The instance address in blue
- (Optional) The o parameter in red, alongside the tenant’s ID in blue
https://adb-12345.11.azuredatabricks.net/?o=12345#notebook/9999111/command/1111
2.2 Create your GETG request to the file system endpoint
The files endpoint makes the information within the FileStore folder available for access via a GET request — or simply, by accessing the URL via your browser.
Note: Again keep in mind, the data must reside within the FileStore folder or its subfolder, as long as the parent is the FileStore.
In step 1 we stored our file in the path:
/dbfs/FileStore/schema_output.txt
Hence to access the file, we insert the path directly into the URL, replacing the /dbfs/FileStore with files:
https://adb-12345.11.azuredatabricks.net/files/schema_output.txt?o=12345
Similarly, should we have stored our file in the path:
/dbfs/FileStore/schema/schema_output.txt
We could access the file via the URL:
https://adb-12345.11.azuredatabricks.net/files/schema/schema_output.txt?o=12345
Troubleshooting
Incorrect path error
One of the most frustrating errors — and the most cryptic — is the incorrect path error. Make sure you don’t include the FileStore folder in the access path.
HTTP ERROR: 404
Problem accessing /files/FileStore/schema_output.txt. Reason:
Bad Target: GET FileStore/schema_output.txt