The weird part of data science

Over the years, I have had different thoughts about the data science industry, just try to write different phases down and organize my thoughts, may not have a very clear message at the end.

At the beginning……

When I first started to work in the so-called Data Science field, I thought it is cool that I can finally become an ultimate idea man, think about the problem, apply the math/modeling, then success & $ will come. At that time I thought modeling/math is the most important, and I should study even harder.

Then I found that it was not the case, no one at work is interested in reading my math [damn!] or using the complex solution if not needed, actually myself as of now, also hate the “unnecessarily” [<- this is a keyword, don’t overlook] complex solutions, yeah I know they sound cool on the resume.

Business is the most important!?

My mind shifted and thought business is the most important, I tried to involve myself a little bit more in business and tried to control/plan the projects to gain more “influences”.

Turned out I found there are some real insights from business/domain experts, but not many people think about what they are doing, even if you follow the route in that area it will not bring real impacts, but become a better player when climbing the corporate ladder.

Model becomes cool again

Interestingly at that moment, deep learning started to come out and presented very beautiful pictures that seems to have the capabilities to bring the revolution, I was interested in that and tried to think of different possibilities.

Then I joined a startup and found that the embarrassing part of data science is to ask my engineer friends to help, from collecting data, to creating pipelines to do something on top of that, implementation/engineering is the most important!

Engineering is the king

With that in mind, I came to the state to do my master’s in data science, after the previous learnings, I deliberately picked a practicum company that appreciates good engineering practices. That was the good old time, I learned tons within a short amount of time, then I started to work in the US……

Deliver first

Different companies are different but the very same core is to ask you to deliver first and then be concerned about tech debt. While it may be OK in software engineering, data science works make situations interesting because of their nature and they are the things I keep on thinking

Complexity of DS/ML projects is higher than software of a similar scale

When I think about software, I always think what I am doing is managing the complexity, the demand will change but at a specific time point, it may be fixed for a while, which offers the possibility to introduce more tools and abstractions to manage the complexities.

But data science has two categories of complexities at least, software and data. This becomes very annoying because other than those we experienced in the software, we also need to deal with uncertain data. Whenever I see a 100+ lines query, I already start to feel nervous. Although the query will follow our instructions and do what we asked it to do, when it executes, who knows whether some weird things are happening in the data itself, maybe there are some ingestion problems?

This may not sound a big deal [well we can fix that] but it is illusion because 1. the price to fix them could be really big [maybe you need to redo everything] and 2. when you spot one issue, it usually implies many more. One of my favorite quotes I want to share with more ppl is Amateurs talk about strategy and tactics. Professionals talk about logistics and sustainability in warfare. To apply that to DS/ML projects, amateurs talk about the fancy algorithm, and SOTA, professionals talk about data, business, infrastructure, engineering practice, and sustainability.

So it seems very clear that we should do things right and have the software practices at least similar to software engineering level [good one for sure, eg: well-covered tests, automation tools, organized code; not spaghetti code]

Weird part

But in reality, why do people usually have impressions that DS/ML projects require fewer engineering skills? I have a few hypotheses about this

  1. Job role: people always confuse the job duties of data scientists, it is a mix of engineer, analyst, product manager, and operations. But in reality, the spectrum is very wide, a significant portion of people are working on analytics [things like experimental design, product analytics, etc], instead of developing the algorithms. It doesn’t mean those are less important, as I mentioned above, it is hard to find a good DS in analytics with good business insights and guide the discussions. But they are very different roles and require different skills even though people may have the same title

  2. People: DS/ML fields are relatively new with fewer hard requirements in the past, it opens up a lot of opportunities for people who are outside of the fields to join. When they climb up in the ecosystem, they start to decide the industry standard, after all, everyone has some path dependencies.

  3. Project: There are not many high impacted DS/ML projects [I mean those that can become company cashcow], many projects are just good to have, it makes no big difference if you do that at a world-class level or with manual dirty scripts.

Therefore, it becomes tricky in reality, a field that requires lots of investments and infrastructure support to do well, is using dirty rush methods most of the time. The root issue seemingly is the ROI, if the project doesn’t have a high ROI, why should we invest time and effort to improve that?

The last point brings my mind back to the loop again, business[not office politics] becomes the most important again, how to identify high ROI projects or companies with such potential? That’s one of the reasons I become tired of chasing the SOTA models/reading papers in the recent period. Don’t ask what the DS/ML algorithms can solve, find what problems are worth solving and enough to make my paycheck or fortune, ideally a big one :D