NLP :- [FB, Stitchfix, StackOverflow]

Aug 19, 2019

Hello, Bunch of NLP links ahead!

New advances in natural language processing to better connect people

It’s just another day for Facebook to beat some Machine Learning Benchmark. This post talks about their recent breakthrough in Natural Language translation using Robustly Optimized BERT Pretraining Approach (RoBERTa) that achieves SOTA results.

A key excerpt from the article of what’s happening is,

'“RoBERTa modifies key hyperparameters in BERT, including removing BERT’s next-sentence pretraining objective and training with much larger mini batches and learning rates. We also train for much longer over 10x more data overall, as compared with BERT. This approach led to new state-of-the-art results on the widely used NLP benchmarks, General Language Understanding Evaluation (GLUE) and ReAding Comprehension from Examinations (RACE).”

Basically, It’s train longer with larger batches resulting in this. While this is quite an appreciable achievement, what caught my eyes is that Facebook also managed to include a diagnostic tool (Winogender, which is designed to test for the presence of gender bias in automated co-reference resolution system) to measure biases in these models.

Give Me Jeans not Shoes: How BERT Helps Us Deliver What Clients Want

Stitchfix, the company that heavily uses data science in their business, recently published this article of how they’re using BERT to solve a problem (that may not be a problem for human but a problem for machine). This post does a wonderful job of explaining Attention-Based NLP Models and further how Stitchfix is leveraging it to generate new features from the text and solving the problem.

CROKAGE: A New Way to Search Stack Overflow

CROKAGE - the Crowd Knowledge Answer Generator - is a project by a group of researches to return an explanation along with the code snippet when someone queries a programming problem on Stack Overflow.

The Artchitecture of CROKAGE used fasttext for training the word-embedding model from SO’s Q&As. To ensure that the written explanation was succinct and valuable, the team made use of NLP on the answers, ranking them most relevant by the four weighted factors. This definitely seems to be a nice step as most of the AI enthusiasts are envisioning a day when AI can write the code on its own.

Getting started with Text Preprocessing

Given that we’ve seen so much advanced NLP in the above articles, It’s a good idea for beginners to get started with NLP and the first task of that would be understanding Text Preprocessing. Kaggle Grandmaster SRK recently had shared this Kernel which gives a nice introduction to text preprocessing and also can serve as a nice recipe for your NLP works. This Kernel is based on Python and uses packages like nltk, spacy, re (for regex) .

If you enjoyed this Newsletter, Share it with your friends and encourage them to sign up for it. Don’t forget to share your feedback with me.

Solo Business - Not Data Science

Perseverance pays - A wonderful lesson from Harry’s talk - irrespective of whether you are planning to startup or no

Regards,
AbdulMajed

Nulldata Newsletter

Discussion about this post