> Note: If you are not familiar with machine learning you can start with this post which explains the basic concepts of Machine Learning and the Azure Machine Learning service.
The purpose of this post is to explain how to build an experiment for sentiment analysis using Azure Machine Learning and then publish it to a public API that can be consumed by any application that needs to use this feature for a particular business scenario (e.g. gather user’s opinions about a product or brand, etc.). Since there is already a Text Analytics API in the Azure Marketplace in English, we decided to create it in Spanish. And to simplify things, we used the sample Twitter Sentiment analysis experiment available in the Azure Machine Learning Gallery.
Creating a custom dataset
This is our greatest challenge: create a valid dataset but with Spanish content. There is an existing dataset used in the sample experiment we are going to use as a basis for our work, which you can find here. This experiment is based on an original dataset of 1,600,000 tweets classified as negative or positive. The Azure ML Studio sample dataset contains only 10% of this data (160,000 records). In supervised learning the more training data you have, the more accurate your trained model will be, and that’s why the first thing we want is a dataset with a considerable amount of data.
As this dataset is in English, the predictive model will learn to process English text. But since we want to create a service using the Spanish language, our data needs to be in Spanish.
To get the data in Spanish we could use Spanish tweets and manually classify them (which would take a long time) or use the original dataset translated to Spanish. In the latter option, the hard work of classifying the data is already done and we could use an automatic translation tool to do the work for us. Although automatic translation is not 100% accurate, the keywords will be there, so we’re going to go with this approach to make sure we have a good quantity of training data.
For this reason we created a very simple console application that uses the Bing Translate API to translate our dataset and return it in the correct format.
Once we have the dataset ready, the next step is to upload it to Azure ML studio so it is available to use in the experiments.
To upload the recently created dataset, in the Azure ML portal click NEW, select DATASET, and then click FROM LOCAL FILE. In the dialog box that opens, select the file you want to upload, type a name, and select the dataset type (this is usually inferred automatically). In our case, it is a TAB separated values file (.tsv).
The data in the dataset contains only 2 columns, the sentiment_label, which is 0 for a negative sentiment and 4 for positive.
Once the dataset is created, we will take advantage of the existing sample experiment of the Machine Learning Gallery, available here.
Open the experiment by clicking Open in Studio as shown below.
Then, you will be prompted to copy the experiment from the Gallery to your workspace.
At this point let’s remove the Reader module from the experiment and add the custom dataset we created. Connect the dataset to the Execute R Script module.
Run the experiment.
Pre-processing the data
This experiment uses several modules to pre-process the data before analyzing its content (like removing punctuation marks or special characters, or adjusting the data to fit the algorithm used). For more information about the data preprocessing, you can read the information available in the experiment page in the Gallery.
Scoring the model
After running the predictive experiment, let’s create the scoring model. To do this, point to SET UP WEB SERVICE and select Predictive Web Service [Recommended].
Once the Predictive Experiment is created, we need to update this experiment to make it work as expected. First delete the Filter Based Feature Selection Module and reconnect the Feature Hashing module to the Score Model module.
Delete the connection between the Score Model module and the Web Service Output module by right-clicking it and clicking Delete.
Between those two modules, add a Project Columns module, and then an Execute R Script module. Connect them in sequence and also with the Web Service Output module. The resulting experiment will resemble the following image.
Now let’s configure the Project Columns module. Select it and in the Properties pane, click Launch column selector. In the dialog box that opens, in the row with the Include dropdown, go to the text field and add the four available columns (sentiment_label, tweet_text, Scored Labels, and Scored Probabilities).
Lastly, select the Execute R Script to configure it. Click inside the R Script text box and replace the existing script with the following:
# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame
#set thresholds for classification
threshold1 <- 0.60
threshold2 <- 0.45
positives <- which(dataset1["Scored Probabilities"] > threshold1)
negatives <- which(dataset1["Scored Probabilities"] < threshold2)
neutrals <- which(dataset1["Scored Probabilities"] <= threshold1 &
dataset1["Scored Probabilities"] >= threshold2)
new.labels <- matrix(nrow=length(dataset1["Scored Probabilities"]),
new.labels[positives] <- "positive"
new.labels[negatives] <- "negative"
new.labels[neutrals] <- "neutral"
data.set <- data.frame(assigned=new.labels,
colnames(data.set) <- c('Sentiment', 'Score')
# Select data.frame to be sent to the output Dataset port
This will return two columns as the output of the service: Sentiment and Score.
The sentiment column will be returned as Positive, Neutral or Negative and the Score column will be the Score Probability. The classification will be made based on the defined thresholds and will fall into the following 3 categories:
- Less than 0.45: Negative
- Between 0.45 and 0.60: Neutral
- Above 0.60: Positive
Now that everything is set up, we can run the experiment.
Publishing and Testing the Web Service
Once the predictive experiment finishes, click Deploy Web Service. The deployed service screen will appear. Click Test.
In the Enter data to predict dialog box, enter a text in Spanish in the TWEET_TEXT parameter and click the check mark button.
Wait for the web service to predict the results, which will be shown as an alert.
We have made the following test page that uses our generated API to test the service.
We tested the resulting API with some sample text, and we are pleased with the outcome (the model learned how to classify Spanish texts quite well). Nevertheless, there are some ways to improve the model we have created, such as:
- Trying other training algorithms and comparing their performance
- Improving the input dataset, either by having a brand new dataset with manually classified information in Spanish or using common keywords for getting classified results.
Given that this is a proof of concept, we consider this to be a successful experiment.