Summer at Microsoft: A brief learnings post

7 minute read



As Summer holiday’s come to an end and school start date creeps closer, I thought it would be a good idea to reminisces on my 3-month summer internship at Yammer, Microsoft. This might possibly be my last internship per se and I wanted to do it justice with a proper blog post.

Bit of background, I planned to intern at Yammer, Microsoft as Data & Applied Scientist Intern from May 18th to August 7th at their San Francisco Office. Yammer is a social network for workplace specifically aimed at organization-wide communication and comes under the office 365 bundle. I was super excited to visit sunny California and then COVID happened, so everything went virtual.

I worked on improving and scaling Yammer’s group recommendation and home feed recommendation models. This involved performing analysis and using the insight from analysis to improving models either from features perspective or data or adding entire new models in the mix. I can’t indulge in the details of work I did due to the NDA, nor do I want to bore you with the % improvement numbers that work for recruiters. Instead, I thought I would make this about using Machine Learning in Industry and how that defers from academia.

Working over 3 months led to a huge shift in my mindset of how I view machine learning and different challenges it brings when you try to deploy it. This is completely different from academic research or course where the sole goal is improving algorithms and not necessarily looking at their scalability and ability to provide personalizing that social media requires.

1. Data matters more than the choice of model

Social media sites generate a lot of data, I mean like a lot of data (think in terms of 100s of TBs) and training of entire data is not feasible. This requires data to be subsampled before being fed into the ML model. The type of subsampling you use has a great influence on the performance of the model and can lead to score bloating if incorrect subsampling is used. The type of subsampling to be chosen depends on understanding the product usage and type of recommendations the user wants. For example, the MSFT data is completely different than the EY data on Yammer and there are no interactions between users among those networks. Then the question becomes, can we use this information to intelligently reducing the data to be trained on. Similar consideration also applies while prediction. For each user, it’s not possible to evaluate on all the threads, so how do you construct the potential thread pool for evaluation has a great impact on the content being served to users. ‘Garbage-in and Garbage-out’ is really a thing in data science and at every conjunction either at infrastructure or feature engineering or modelling one should make sure that appropriate data is being used both in terms of data distribution and assumptions about data being made.

2. Interpretability is important

Your users will ask why they are seeing XYZ post and debugging would be hell if you have black box models and have no good way of attributing why some data points are scored higher compared to others. For example: Why users A consistently sees posts about or related to XYZ while he clearly likes posts related to ABC. Having at least some interpretability in the model helps in identifying common bugs at the backend side as well, like some features getting very high importance which might be associated to incorrect joins or data leakage when computing that features and so on. This wouldn’t be possible if we just use a model which does the desired task without any way for us data scientist to probe into it. Generally, this boils down to tradeoff interpretability and performance of the model. I would definitely go with an interpretable model anytime over a model which give a little better scores but is entirely Blackbox.

3. Infrastructure is super important (most underappreciated part)

The type of models you build will be restricted to the infrastructure you have. This is due to obvious reasons like though you want to serve the best threads/content possible to users you want to do this with some predefined latency and without blowing the dollar roof due to excessive syncing predictions and training. Hence, the type of infrastructure you have plays a major role in deciding the model and what can go into production. So I definitely recommend looking into the infrastructure side of things to appreciate what they do and also there are quick gains to be made in modelling by doing some quick changes at the infrastructure side. Some major choices I have seen here are whether to go with batch predictions or near real-time architecture, syncing times between training and predictions and so on. There are many articles on infrastructure design choices and there subsequent downstream impacts so I am going to leave this sub-topic here.

4. Results should be reproducible

In academia, the main focus is building models which get the highest score and model reproducibility takes back seat. Though the trend of open-sourcing code is increasing, the open-sourced code has a lot of minutely tuned hyper-parameters and seed values which beats the metric only on the given dataset and performs poorly on other datasets. I have seen this issue of reproducibility and reproducibility plagues academia intensely.

However, in industry, you want the results of the model developed to be absolutely reproducible and should be able to consistently produce high performance in production. This ensures consistent monetary gains for the company and in my context it builds confidence in customers to get consistent relevant content served by Yammer.

The two pillars of reproducibility and reproducibility should be enforced throughout the ML system right from data logging, ETL, generating features, model training to generating predictions. In real-time systems like Yammer, there is another problem of data drift and how validation data might not represent the current real-time stream data. It’s a topic for another day I guess :)

5. Documentation

Having good documentation about the infrastructure, models and analytics side of things will determine the velocity of the team. You want documentation with every A/B test and experiment that was tried, what they did, whether they failed or succeed and then a post-dive in of possible why it didn’t succeed. Thought this increases time in the short run but in the long run you would look back and see this as the best thing you invested in. Also helps new employees and interns like me to quick success xD

6. Final thoughts

I learned a lot about what goes into taking ML models to production and being able to deploy the ML model I developed to be used by 15M monthly active users is an experience in itself. Another thing I would like to add here is that the ML model form only a small part of the whole ML systems and as a data scientist our focus should be on ‘adding business value and improving customer experience’ rather than focusing only on building bigger and complex ML model.