Manage data¶
Add data and directories¶
With DVC, you can store and version files, ML models, directories, and intermediate results with Git without having to check in the file contents to Git:
$ uv run dvc get https://github.com/iterative/dataset-registry \
get-started/data.xml -o data/data.xml
$ uv run dvc add data/data.xml
This adds the file data/data.xml to data/.gitignore and writes
the meta information to data/data.xml.dvc.
See also
To manage different versions of your project data with Git, simply add data/.gitignore and data/data.xml.dvc:
$ git add data/.gitignore data/data.xml.dvc
$ git commit -m ":monocle_face: Add data to dvc"
See also
Saving and retrieving data¶
The data can be copied from the working directory of your Git repository to the remote storage location with
$ uv run dvc push
If you want to retrieve more recent data, you can do so with
$ uv run dvc pull
Importing and updating data¶
As an alternative to dvc get, you can also import data and models from
another project using dvc import, for example:
$ uv run dvc import https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
Importing 'get-started/data.xml (https://github.com/iterative/dataset-registry)' -> 'data/data.xml'
This loads the file from the dataset-registry into our data
directory, adds it to .gitignore, and creates data.xml.dvc.
You can use dvc update to update these data sources before reproducing a
pipeline that depends on them, for example:
$ uv run dvc update data/data.xml.dvc
'data/data.xml.dvc' didn't change, skipping
Deleting data¶
If you want to remove files or directories from DVC management, you can do so with dvc remove:
$ uv run dvc remove data/data.xml.dvc
You can then use dvc gc -w to delete all files and their previous versions
from the cache.