Inside DagsHub: The GitHub for data science and machine learning – Analytics India Magazine

  • Lauren
  • January 31, 2022
  • Comments Off on Inside DagsHub: The GitHub for data science and machine learning – Analytics India Magazine

Data science and machine learning deal with complex mathematical concepts and programming tools to build the right kind of algorithms for business decisions. Collaborations and discussions while undertaking and building these projects can be of great help for data scientists and machine learning practitioners. Just like GitHub exists for collaborating on software development in an open-source capacity, a 2019-launched platform named DagsHub is becoming increasingly popular for data scientists and machine learning engineers to come together at a common ground to build their work.

“It is like GitHub for data science and machine learning,” is how DagsHub describes itself. It is a web platform for data version control and collaboration for data scientists and machine learning engineers and is based on open-source tools, optimised for data science and oriented towards the open-source community.

The Tel-Aviv based company was launched in 2019 by Dean Pleban and Guy Smoilovsky. To date, it has raised over three million dollars in two rounds of funding in 2019 and 2020. Just a few weeks back, DagsHub launched DagsHub 2.0. With that launch, it also announced that one can now annotate data on DagsHub and have discussions on any file on the platform. 

Today, we are super excited to launch DagsHub 2.0. We’re updating DagsHub with amazing new #MLOps capabilities, enabling you to close the #data loop efficiently, zero DevOps required, as well as upgrading the data teamwork experience! a 🧵— DagsHub (@TheRealDAGsHub) January 4, 2022

Home for open source data science

Data science teams can find it tough to collaborate. While explaining the reason for starting this platform, DagsHub says that the main difference between the data science and software development workflows is that existing tools are not suitable.

The founders add, “DagsHub was created to be a home for open-source data science, where everyone can contribute and make the research and development process transparent, inclusive and better for everyone; to help developers in the fields of machine learning and data science create and learn from each other. We believe that technology should help us focus on tackling the most interesting and important challenges in life.”

Built on DVC

Data science and machine learning projects often require versioning large files, which Git is not very good at. DagsHub says that Git and git-lfs do not version the data pipeline. This means that if there is a modification in the data pipeline, the people working on the project will not know that the end of the pipeline should be reproduced.

The website informs that DagsHub is built on Git and DVC. DVC is an open-source command-line tool built for data and pipeline versioning. One can send another person a link to their DagsHub repo, and then they can explore the project. They can download the data of the owner’s project and models from any past version, experiment, or branch without running any code.

Language and library agnostic

If we look at the company website, it points out the features that DagsHub provides to users for their data science and machine learning projects. Some of the most important ones are:

Commenting – One can take notes on model architectures, discuss with others on annotations, and review another team member’s contribution to a project.

Version everything – One can explore relationships between data versions experiments and see the graph of the project history. When one finds the result they want, they can get the code as well as the configuration with just one command.

The DagsHub Annotations helps create a Label Studio instance with a single click. It is automatically synced with the datasets tracked on DagsHub Storage.

Language and library agnostic – It works for projects using Python, R, Keras and PyTorch.