Discover more from kleandata
Data science is different
Author’s note: Told from the perspective of a naive, junior data scientist, this story was inspired by a Vicki Boykis post of a similar namesake. Thanks to Vicki and Bryce Codell for edits.
Data Team Onboarding
Install python locally
I entered brew install python into my terminal and got this:
Failure while executing: git clone --depth 1 --branch v0.53.3 https://github.com/caskroom/homebrew-cask.git /Library/Caches/Homebrew/brew-cask--git
Strange. I had never seen this error before in my master’s program or my data science bootcamp. Both places used Anaconda, so I tried that next. After the installation finished, I scrolled down the onboarding doc and discovered this helpful tidbit.
Do not use Anaconda to install python. Our preferred method is to use Homebrew or install directly from python.org.
I took a sip of coffee, strangely expecting caffeine to pacify rather than exacerbate my annoyance. I googled “uninstall Anaconda” expecting a one line solution, but instead was amused to discover that I needed to install a separate package to completely uninstall. Did there exist another uninstall package that uninstalled the first uninstall package? Why was this so complicated?
Next, I went to python.org. There were so many different versions. I was afraid of downloading the latest 3.9 version since I wasn’t sure it would play well with tensorflow. Was 3.8 the right choice? Should I just get 3.7 just to be safe?
I went back to spamming brew install python. Part of me felt that the error would magically self-resolve if I mashed the keys hard enough. The dullness of the new Macbook keycaps inexplicably worsened my frustration.
I pinged my onboarding buddy, but I knew his meeting schedule likely wouldn’t allow him to respond quickly. My first day started to feel like a wash. I sent him one more SOS ping for good measure and opened up Hacker News. The third post was an interesting Stitch Fix article about their color extraction algorithm, which I proceeded to busy myself with.
My first OKR planning meeting consisted of a product manager and five engineers. They said they’d never had a data scientist on their team before, which quickly became obvious. Their primary objective was to build an “explore” page that allowed customers to discover new products. I think we sold handbags or jewelry or something.
The product manager pitched a simple scheme to order by top selling items in the past month. Surprised that nobody else critiqued this elementary strategy, I decided to speak up.
“This is actually a classic machine learning use case. What if, for example, we took a collaborative filtering approach where we serve users the products that people similar to their purchasing history also bought?”
That’s a great idea, remarked the product manager. Can we explore that option?
“Yeah I can get started fleshing this out.” They hired me as a data scientist after all. It would’ve been a shame if we couldn’t find a machine learning use case somewhere.
My script was pretty straightforward. First, I collected a historical dataset of all users and their purchase history. From this dataset, I created a matrix with users as rows and items as columns. For each cell, I calculated how many times a given user had purchased that item. Then I wrote a function that took in a user ID as input and ran a cosine similarity search across all other users in the dataset. Finally, I grabbed the most similar user’s purchase history and returned items in their list that the given user hadn’t purchased. It looked something like this:
This only took me a few hours to write, but what frustrated me most was working with other teams. To get purchase history data, I emailed the sales analytics teams with a clear specification of what I needed for my model. First, they tried to give me aggregate numbers, so I was forced to spend a few days emailing back and forth about why I needed the user level detail.
Once I finally received the correct data and its corresponding query, I was reminded why I didn’t care for SQL. I found it inelegant as a language and inadequate for my purposes as a scientific tool. Dealing with an analytics team was annoying, but was infinitely more pleasurable than wading through hundreds of tables named “orders_v2” and “orders_final_final”. I told my manager that I’d be much more productive if I had a dedicated data analyst, and she said she’d look into it.
I sent my completed script over to the data engineering team. The next day, they told me they had issues running it over a larger time period of data. I had only tested on a small sample of data on my local machine, so I suggested they use a beefy cloud machine that could handle the scale. Instead, they asked me if I could somehow make my cosine similarity function more efficient.
“Isn’t it the data engineer’s responsibility to own productionization of models? I’ve completed my portion of the project, I don’t know how much more I can contribute,” I replied. Perhaps a little terse, but I didn’t like lazy people who tried to get me to do their job.
Besides, I was getting busier every day. I independently stood up and managed our team’s weekly journal club where we discussed the latest papers on deep learning and general AI technology. Attendance as of late had been flagging, and I started getting the nagging sense that people weren’t reading the papers at all. This company’s intellectual curiosity was disturbing, and I honestly didn’t know where they would have been without me.
The product manager called a team all-hands to discuss why the “explore” product was over a month behind schedule. After some hemming and hawing from the data engineers about the need for integration testing, one of them made a side comment about how getting the collaborative filtering logic to work was a large reason for the delay.
“I delivered my collaborative filtering code to you four weeks ago, as dictated by the product roadmap,” I said as diplomatically as one could while being not so subtly thrown under the bus. “Not to mention the MAP@K score for my algorithm was 0.93.” The product manager nodded, thought it was obvious he had no idea what that meant.
Getting the computations to run in a performant way was killing the responsiveness of the app, said a data engineer. On top of that, they claimed that sometimes the logic returned strange results, such as items that the company no longer sold. Was I sure that the upstream data was bug-free as well?
“If you have a question about the underlying data, talk to the sales analytics team. They’re the ones who pulled the data.”
You mean to tell me that you didn’t even write the SQL that powered your code, scoffed one of the sassier data engineers. How do we know any part of what you’ve given us is reliable?
I took a deep breath. “Let me clear this up for everyone in the room here, since there seems to be some serious misalignment on responsibilities. I am a data scientist. I did my master’s thesis on image recommendation systems. I solve the hard problems. If there’s a problem with the algorithm, talk to me. If there’s a problem with the input data, talk to the analytics team. If there’s a problem with performance, talk to the data engineering team. If you want me to write SQL or you want me to stand up a BigQuery instance, so be it, but honestly I feel like it wouldn’t be the best usage of anybody’s time.”
To cut scope, the team decided to revert back to the original strategy of ranking by top sold items. That Friday, I put in my notice. In hindsight, it was never a good fit. Their job description read “Data Scientist” with requirements of “Master’s degree or higher; experience in deep learning frameworks a plus”, but it should have read more like “Associate data engineer”. I was more than glad to get out and find a company where management was actually competent enough to articulate what they were looking for.
Finding a new job wasn’t difficult. I quit, interviewed, and signed an offer within the space of two weeks. On my first day at the new job, I was relieved to find that my workstation came pre-installed with Anaconda plus a host of other applications. I knew that choosing a larger and more established enterprise was a good decision.
We’re really excited to get you started, said my new manager.
“Me too, I can’t tell you how happy I am to be here, especially after my last role,” I confessed. My manager smiled wryly. “So what’s the first thing I’m going to be working on? I took a look at our mobile home page this weekend and actually came up with a list of some improvements we could make to our recommendations system. Do you want me to share that with you?”
Right, so have you heard of Netsuite, he asked.
“Sounds familiar. Isn’t it some kind of web hosting platform?”
No, that’s a different company you’re thinking of. Well, Netsuite is our accounting software and there’s a large initiative to bring our accounting practices into the modern age. See, we have a lot of these one-off Excel data reports we get from a lot of our partners that our accounting team has to manually clean and upload to the system every month. We figured that someone with your data science chops could help us automate this whole process. How does that sound?