Data Version Control (DVC)
- DVC is an open-source tool that serves as a powerful asset in the machine learning project toolkit, with a primary focus on data versioning.
- Data versioning is a critical aspect of any ML project. It allows you to track changes and updates in your datasets over time, ensuring you can always recreate, compare, and reference specific dataset versions used in your experiments.
- In this lab tutorial, we will be utilizing DVC with Google Cloud Storage to enhance data versioning capabilities, ensuring efficient data management and collaboration within your machine learning project.
Creating a Google Cloud Storage Bucket
- Navigate to Google Cloud Console.
- Ensure you’ve created a new project specifically for this lab.
- In the Navigation menu, select “Cloud Storage,” then go to “Buckets,” and click on “Create a new bucket.”
- Assign a unique name to your bucket.
- Select the region as
us-east1 - Proceed by clicking “Continue” until your new bucket is successfully created.
- Once the bucket is created, we need to get the credentials to connect the GCP remote to the project. Go to the
IAM & Adminservice and go toService Accountsin the left sidebar. - Click the
Create Service Accountbutton to create a new service account that you’ll use to connect to the DVC project in a bit. Now you can add the name and ID for this service account and keep all the default settings. We’ve chosenlab2for the name. ClickCreate and Continueand it will show the permissions settings. SelectOwnerin the dropdown and clickContinue. - Then add your user to have access to the service account and click
Done. Finally, you’ll be redirected to theService accountspage. You’ll see your service account and you’ll be able to click onActionsand go to where youManage keysfor this service account. - Once you’ve been redirected, click the
Add Keybutton and this will bring up the credentials you need to authenticate your GCP account with your project. Proceed by downloading the credentials in JSON format and securely store the file. This file will serve as the authentication mechanism for DVC when connecting to Google Cloud.
Installing DVC with Google Cloud Support
- Ensure you have DVC with Google Cloud support installed on your system by using the following command:
pip install dvc[gs] - Note that, depending on your chosen remote storage, you may need to install optional dependencies such as
[s3],[azure],[gdrive],[gs],[oss],[ssh]. To include all optional dependencies, use[all]. - Run this command to setup google cloud bucket as your storage
dvc remote add -d myremote gs://<mybucket> - In order for DVC to be able to push and pull data from the remote, you need to have valid GCP credentials.
- Run the following command for authentication
dvc remote modify --lab2 credentialpath <YOUR JSON TOKEN LOCATION>
Tracking Data with DVC
- Ensure you have downloaded the required data and placed it in the “data” folder, renaming the file to “CC_GENERAL.csv.”
- To initiate data tracking, execute the following steps:
- Run the
dvc initcommand to initialize DVC for your project. This will generate a.dvcfile that stores metadata and configuration details. Your.dvcfile config metadata will look something like this[core] remote = lab2 ['remote "lab2"'] url = gs://ie7374 - Next, use
dvc add data/CC_GENERAL.csvto instruct DVC to start tracking this specific dataset. - To ensure version control, add the generated
.dvcfile to your Git repository withgit add data/CC_GENERAL_csv.dvc. - Also, include the
.gitignorefile located in the “data” folder in your Git repository by runninggit add data/.gitignore. - To complete the process, commit these changes with Git to record the dataset tracking configuration.
- Run the
- To push your data to the remote storage in Google Cloud, use the following DVC command:
dvc pushThis command will upload your data to the Google Cloud Storage bucket specified in your DVC configuration, making it accessible and versioned in the cloud.
Handling Data Changes and Hash Updates
Whenever your dataset undergoes changes, DVC will automatically compute a new hash for the updated file. Here’s how the process works:
- Update the Dataset: Replace the existing “CC_GENERAL.csv” file in the “data” folder with the updated version.
- Update DVC Tracking: Execute
dvc add data/CC_GENERAL.csvagain to update DVC with the new version of the dataset. When DVC computes the hash for the updated file, it will be different from the previous hash, reflecting the changes in the dataset. - Commit and Push: Commit the changes with Git and push them to your Git repository. This records the update to the dataset, including the new hash.
- Storage in Google Cloud: When you run dvc push, DVC uploads the updated dataset to the Google Cloud Storage bucket specified in your DVC configuration. Each version of the dataset is stored as a distinct object within the bucket, organized for easy retrieval.
Reverting to Previous Versions with Hashes
To revert to a previous dataset version:
- Checkout Git Commit: Use Git to checkout the specific commit where the desired dataset version was last committed. For example, run
git checkout <commit-hash> - Use DVC: After checking out the Git commit, use DVC to retrieve the dataset version corresponding to that commit by running
dvc checkout. DVC uses the stored hash to identify and fetch the correct dataset version associated with that commit.
💡Note: Follow this tutorial to learn more about DVC.