In the previous post, we covered the fundamental steps behind building your natural language processing (NLP) application.
To recap, every application follows this step-by-step framework
Problem 🤔: Understand the problem you are trying to solve
Data 📊: Identify the target domain and collect the relevant data
Success Criteria 🏆: Set the right business metric for your problem.
Modeling Solution 🏎️: Train and test your NLP model
Deployment ⚡: Deploy the model into production and follow up with necessary steps depending on customer feedback.
In the previous post, we covered the right way to approach your problem and important things to consider before collecting data, so we will jump right into the remaining parts.
3. Success Criteria
Before going into the modeling, you need to set the success criteria. You need to answer the following two questions:
What is the metric you are planning to use?
What is the minimum performance you expect the model to have on your test set?
Metrics
While there are open-source packages implementing all sorts of metrics, sometimes the right metric may not be super clear your problem.
Classification tasks
For classification tasks (such as intent classification or sentiment analysis) the metric is more or less straightforward.
Accuracy is the right metric to start with. Accuracy counts the number of correctly predicted examples and normalizes it over the total number of examples in the test set.
Keep in mind that not all classes are equal. Misclassifying the hurtful or abusive text post as non-abusive is 100x worse than misclassifying the regular post as abusive. It is worth breaking down performance across different classes and tracking down the number of such critical mistakes separately. This is especially important for the highly imbalanced data (extremely easy to get high _overall_ accuracy by always classifying text as non-abusive, so track accuracy for each class separately).
I also suggest using a confusion matrix when analyzing the results. It will help you find the patterns of mistakes that the model is making.
Looking at the above confusion matrix, the diagonal elements represent the number of examples where the predicted label of the model matches the ground-truth label. The off-diagonal elements show the number of examples mislabeled by the classifier. We see that the model frequently (6 out of 16 times to be exact) confuses the versicolor with the virginica flower. A clear pattern of mistakes by the model.