Manage data

Add data and directories

With DVC, you can store and version files, ML models, directories, and intermediate results with Git without having to check in the file contents to Git:

$ uv run dvc get https://github.com/iterative/dataset-registry \
    get-started/data.xml -o data/data.xml
$ uv run dvc add data/data.xml

This adds the file data/data.xml to data/.gitignore and writes the meta information to data/data.xml.dvc.

See also

.dvc Files

To manage different versions of your project data with Git, simply add data/.gitignore and data/data.xml.dvc:

$ git add data/.gitignore data/data.xml.dvc
$ git commit -m ":monocle_face: Add data to dvc"

Saving and retrieving data

The data can be copied from the working directory of your Git repository to the remote storage location with

$ uv run dvc push

If you want to retrieve more recent data, you can do so with

$ uv run dvc pull

Importing and updating data

As an alternative to dvc get, you can also import data and models from another project using dvc import, for example:

$ uv run dvc import https://github.com/iterative/dataset-registry  get-started/data.xml -o data/data.xml
Importing 'get-started/data.xml (https://github.com/iterative/dataset-registry)' -> 'data/data.xml'

This loads the file from the dataset-registry into our data directory, adds it to .gitignore, and creates data.xml.dvc.

You can use dvc update to update these data sources before reproducing a pipeline that depends on them, for example:

$ uv run dvc update data/data.xml.dvc
'data/data.xml.dvc' didn't change, skipping

Deleting data

If you want to remove files or directories from DVC management, you can do so with dvc remove:

$ uv run dvc remove data/data.xml.dvc

You can then use dvc gc -w to delete all files and their previous versions from the cache.