.. SPDX-FileCopyrightText: 2020 cusy GmbH
..
.. SPDX-License-Identifier: BSD-3-Clause
Managing data with DVC
======================
For data analysis, and especially for machine learning, it is extremely valuable
to be able to reproduce different versions of analyses that were performed with
different data sets and parameters. However, in order to obtain reproducible
analyses, both the data and the model (including algorithms, parameters,
:abbr:`etc. (et cetera)`) must be versioned. Due to the size of the data,
versioning data for reproducible analyses is a bigger problem than versioning
models. Tools such as `DVC `_ help with data management by
allowing users to transfer data to a remote data storage location using a
:doc:`Git <../git/index>`-like workflow. This simplifies the retrieval of
specific versions of data to reproduce an analysis.
DVC was developed to enable the sharing and traceable management of :abbr:`ML
(Machine Learning)` models and data sets. It uses its own system for storing
files with support for :abbr:`SSH (Secure Shell)` and :abbr:`HDFS (Hadoop
Distributed File System)`, among others.
.. tip::
`cusy seminar: Storing code and data in a versioned and reproducible manner
`_
.. seealso::
* `Get Started with DVC `_
* `Documentation `_
* `Git Repository `_
Comparison with related technologies
------------------------------------
git-annex
~~~~~~~~~
`git-annex `_ focuses more on discovering and
using datasets, which are then easily managed with Git. DVC, on the other hand,
stores the data generated at each step of the pipeline in :file:`.dvc` files,
which can then be managed by Git. DVC also provides handy tools for manipulating
and visualising data pipelines, see for example :doc:`dvc status `.
Finally, :ref:`dvc remote ` can also be used to specify external
dependencies.
Workflow management systems such as Airflow and Luigi
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DVC focuses on data science workflows and modelling; therefore, DVC pipelines
are much lighter, easier to create and modify than with `Airflow
`_ and `Luigi
`_. However, DVC lacks advanced
features such as execution monitoring, optimisation and fault tolerance. DVC is
also a pure command line tool with no graphical user interface, and it does not
run daemons or servers. `CML `_ attempts to fill some of these
gaps in a lightweight manner with GitHub, GitLab, or Bitbucket. However, DVC and
CML are well suited for iterative machine learning processes; and once a good
model has been found with the two, you are still free to integrate the pipeline
into Luigi or Airflow.
Installation
------------
DVC can be installed with :term:`uv`. Please note, however, that you must
explicitly specify the extras. These can be ``[ssh]``, ``[s3]``, ``[gs]``,
``[azure]``, and ``[oss]`` or ``[all]``. For ``ssh``, the command looks like
this:
.. code-block:: console
$ uv add dvc[ssh]
Alternatively, DVC can also be installed via other package managers:
.. tab:: Debian/Ubuntu
.. code-block:: console
$ sudo wget https://dvc.org/deb/dvc.list -O /etc/apt/sources.list.d/dvc.list
$ sudo apt update
$ sudo apt install dvc
.. tab:: macOS
.. code-block:: console
$ brew install iterative/homebrew-dvc/dvc
.. toctree::
:hidden:
init
data
pipeline
params
metrics
experiments
dag
reproduce
integration
fds