Kaggle has published a report on the State of Machine Learning and Data Science for 2020. The report is based on survey responses from over two thousand users currently employed as data scientists.
The report and underlying survey were described on Kaggle’s website. Kaggle opened the 35-question survey for 3.5 weeks in October 2020 and collected over 20 thousand responses. The Enterprise Executive Summary Report focuses on the 13% of respondents who identified their job title as “data scientist.” The report identifies several key results about data scientist demographics as well as popular data science and machine learning technologies. As with the three previous annual surveys, Kaggle has also released the anonymized response data. According to Kaggle,
In our fourth year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry.
The report contains graphs and analysis of several attributes of the survey respondents, including: respondent profile, education, and experience; employment and work environment; and technologies and platforms. The report notes that the “vast majority” of data scientists are under 35 years of age, two-thirds have a graduate degree, and most have less than 10 years coding experience. Around 55% have less than 3 years of experience with machine learning.
The survey contained several questions about technology choices; these questions allowed multiple answers, with the result that the percentages for a given question can total over 100%. The most popular IDE for data scientists was Jupyter, used by 74% of respondents; second place was Visual Studio, used by 43%, up from 30% last year. Both PyCharm and RStudio were used by about 30% of respondents. In response to questions about frameworks and libraries, over 80% reported using scikit-learn, and around 50% used Google’s deep-learning framework TensorFlow. PyTorch, another popular deep-learning framework developed by Facebook, was used by 31%, up from 26% in 2019.
The most popular machine learning algorithm was linear regression, used by over 80% of data scientists, with decision tree and gradient-boost algorithms a close second- and third- place, respectively. Various neural network architectures were reported separately, with 43% using a convolutional neural network (CNN), 30% a recurrent neural network (RNN), and 15% a Transformer neural network.
Most data scientists reported using a public cloud provider, led by Amazon Web Services (AWS) at nearly 50%. About one-third reported using Google Cloud Platform (GCP), and 29% used Microsoft Azure. Basic compute infrastructure was the most common service used, with Amazon EC2 used by 40%. Function-as-a-service was also a popular choice, with 21% using AWS Lambda, with the GCP and Azure at 12% and 9% respectively. Container services had slightly less adoption, with AWS again being the leader at 14%. Just over 17% were not using any cloud platform, down from 25% a year ago. One twitter user noted that:
[This] is most likely to indicate the entire market of cloud computing applications is not saturated yet.
Kaggle’s survey response raw data is available for download from their site, along with the results from previous years’ surveys.