Datasets#
The bfabric-cli dataset command provides dataset-specific operations including viewing, downloading, and uploading datasets.
Overview#
bfabric-cli dataset --help
Available subcommands:
Subcommand |
Purpose |
|---|---|
|
View dataset details |
|
Download dataset data |
|
Upload new datasets |
Showing Datasets#
View detailed information about a dataset.
Basic Usage#
bfabric-cli dataset show [DATASET_ID] [OPTIONS]
Parameters#
Parameter |
Required |
Description |
|---|---|---|
|
Yes |
ID of the dataset to view |
|
No |
Output format: |
Examples#
Show dataset details:
bfabric-cli dataset show 12345
Show as YAML (useful for many columns):
bfabric-cli dataset show 12345 --format yaml
Show as JSON:
bfabric-cli dataset show 12345 --format json
Output#
The output includes:
Dataset metadata (ID, name, description, etc.)
Associated workunit
Container information
File details
Creation/modification timestamps
Downloading Datasets#
Download dataset data to a local file.
Basic Usage#
bfabric-cli dataset download [DATASET_ID] [OUTPUT_FILE] [OPTIONS]
Parameters#
Parameter |
Required |
Description |
|---|---|---|
|
Yes |
ID of the dataset to download |
|
Yes |
Local file path for output |
|
No |
Format for output file: |
Examples#
Download as Parquet:
bfabric-cli dataset download 12345 my_data.parquet --format parquet
Download as CSV:
bfabric-cli dataset download 12345 my_data.csv --format csv
Download as TSV:
bfabric-cli dataset download 12345 my_data.tsv --format tsv
Working with Downloaded Data#
Read Parquet file (Python):
import polars
df = polars.read_parquet("my_data.parquet")
print(df.head())
Read CSV file:
import pandas as pd
df = pd.read_csv("my_data.csv")
print(df.head())
Read TSV file:
import pandas as pd
df = pd.read_csv("my_data.tsv", sep="\t")
print(df.head())
Notes#
The file format depends on how the dataset was stored in B-Fabric
Parquet is recommended for large datasets
Progress is shown during download
Uploading Datasets#
Upload new datasets to B-Fabric from local files.
Basic Usage#
bfabric-cli dataset upload [FORMAT] [INPUT_FILE] [OPTIONS]
Formats#
Available upload formats:
Format |
Command |
Description |
|---|---|---|
CSV |
|
Upload from CSV file |
TSV |
|
Upload from TSV file |
Parquet |
|
Upload from Parquet file |
Common Parameters#
Parameter |
Required |
Description |
|---|---|---|
|
Yes |
Path to local file to upload |
|
Yes |
Container ID to attach dataset to |
|
No |
Dataset name (default: filename) |
|
No |
Dataset description |
Examples#
Upload CSV:
bfabric-cli dataset upload csv my_data.csv --container-id 1234
Upload Parquet with metadata:
bfabric-cli dataset upload parquet my_data.parquet \
--container-id 1234 \
--name "My Dataset" \
--description "Analysis results from experiment X"
Upload TSV:
bfabric-cli dataset upload tsv my_data.tsv --container-id 1234
Upload Subcommands#
Upload CSV#
bfabric-cli dataset upload csv [INPUT_FILE] [OPTIONS]
CSV-specific options:
Option |
Description |
|---|---|
|
Column delimiter (default: |
|
Whether first row is header (default: true) |
Upload Parquet#
bfabric-cli dataset upload parquet [INPUT_FILE] [OPTIONS]
Parquet handles data types automatically, no additional options needed.
Upload TSV#
bfabric-cli dataset upload tsv [INPUT_FILE] [OPTIONS]
TSV-specific options:
Option |
Description |
|---|---|
|
Whether first row is header (default: true) |
Verifying Upload#
After uploading, verify the dataset:
# Show the new dataset (you'll need the new ID from upload output)
bfabric-cli dataset show <NEW_DATASET_ID>
Or check the container in B-Fabric web interface:
https://<bfabric-url>/project/show.html?id=<container-id>&tab=datasets
Notes#
Container must exist and you must have write permissions
File size is limited by B-Fabric configuration
Upload progress is displayed
The dataset ID is returned after successful upload
Workflow Example#
Complete workflow for downloading, processing, and re-uploading:
# 1. Download dataset
bfabric-cli dataset download 12345 analysis.parquet --format parquet
# 2. Process the data (e.g., with Python or other tools)
# python process_data.py
# 3. Upload the processed dataset
bfabric-cli dataset upload parquet processed.parquet \
--container-id 6789 \
--name "Processed Analysis" \
--description "Processed version of dataset 12345"
Tips and Best Practices#
Choose the Right Format#
Parquet: Best for large datasets, preserves data types, efficient storage
CSV: Universal format, easy to share, but slower and no type info
TSV: Tab-separated, good for tabular data with special characters
File Size Considerations#
# Check file size before uploading
ls -lh large_file.parquet
# For very large files, consider splitting or using streaming uploads
Naming Conventions#
# Use descriptive names with dates
bfabric-cli dataset upload parquet results_2025-01-20.parquet \
--container-id 1234 \
--name "QC Results - 2025-01-20"
Batch Processing#
# Process multiple datasets in a loop
for dataset_id in 12345 12346 12347; do
bfabric-cli dataset download $dataset_id data_$dataset_id.parquet --format parquet
# Process data_$dataset_id.parquet
done
Common Issues#
Upload Fails - Container Not Found#
Error: Container with ID X not found
Solution: Verify the container exists and you have access:
bfabric-cli api read project id <container-id>
Download Fails - Format Not Supported#
Error: Dataset format not supported
Solution: Check available formats or use a different format:
bfabric-cli dataset show <dataset-id>
# Look for format information in the output
Large File Upload Times Out#
Solution: For very large files, consider:
Using a more efficient format (Parquet)
Uploading during off-peak hours
Contacting B-Fabric admin for size limits
See Also#
API Operations - Generic CRUD operations
Workunits - Working with workunits (datasets are often linked to workunits)
Python Dataset API - Using datasets in Python