Browsing for something to read? CrowdFlower recently released its latest Data Scientist Report — the SaaS data enrichment company has made an annual tradition out of surveying data scientists to gain insight on their growing, enigmatic field.
2016’s report conveyed an important but simple message — messy data takes up too much of data scientists’ time. The new report echoes that lesson, but builds on it with more comprehensive information.
“Data on the Data”
While it’s interesting to learn about data scientists’ favorite tasks (or what careers they think are sexy), the data itself took center stage this year. The report considers whether scientists are using structured or unstructured data, how that data is obtained, and what kinds of data are actually being considered. One big takeaway — text data is still a huge area of focus. 91% of survey respondents said they work with text data (image data was a distant second at 33%).
A Growing Crowd
Last year, respondents were asked if they thought there was a “shortage” of data scientists. Interesting, but slightly vague. This year, the report’s authors gave vital context by considering the growth of the field. How’d they do it? They looked at results from last year’s report, showing that 25% of data scientists had been in their jobs for two years or less. They then compared answers from this year and found the “two years or less” group had grown to 35%. So either more people are graduating with data science degrees, or a whole lot of senior data science folks quit last year.
Training on Top
In keeping with the data-centric focus in this year’s report, scientists were asked to sound off on the importance of training data in their AI projects. It turns out that clean, reliable training data is incredibly valuable and that a lack of said data is considered a huge roadblock. Put in more surprising terms, more respondents said they’d rather break a leg than lose quality training data.
Questions for Next Year
Looking at all the ways 2017’s report improved on its predecessor, we can’t help but put together a wishlist for next year. How do different data scientists define “quality data”? Which types of data — images, text, etc. — are hardest to clean and enrich? And how do scientists working today believe we can grow the field even more constructively? Whether these questions are answered or not, we look forward to learning more about the state of things in 2018.