Blogs

Emoji Prediction

Anonymous — Wed, 09 May 2018 01:00:00 +0000

Emoji Prediction Anonymous (not verified) Tue, 05/08/2018 - 19:00

Can you tell from a text whether the writer is happy? Angry? Disappointed? Can you put the writer’s happiness on a 1-5 scale?

By Tetsumichi (Telly) Umada
Course: Machine Learning and Linguistics (Ling 4100)
Advisor: Mans Hulden
LURA 2018

Sentiment analysis in computational linguistics is a general term for techniques that quantify sentiment or mood in a text. Robust tools for sentiment analysis are often very desirable for companies. Imagine that a company has just launched a new product GizmoX. Now the management wants to know how customers feel about it. Instead of calling or writing each person who bought GizmoX, if we could just have a program go on the web and find text on message boards that discuss GizmoX and automatically rate their attitude toward their recent purchase. Valuable information could be obtained, practically for free. Because sentiment analysis is used so widely for this purpose, it is sometimes called Opinion Mining.

Of course, to be really accurate at analyzing sentiment you almost have to have a human in the loop. There are many subtleties in texts that computer algorithms still have a hard time with - detecting sarcasm, for example. But, for many practical purposes you don't have to be 100% accurate in your analysis for it to be useful. A sentiment analyzer that gets it right 80% of the time can still be very valuable.

Emoji Prediction

Emoji prediction is a fun variant of sentiment analysis. When texting your friends, can you tell their emotional state? Are they happy? Could you put an appropriate smiley on each text message you receive? If so, you probably understand their sentiment.

In this project, we build what's called a classifier that learns to associate emojis with sentences. Although there are many technical details, the principle behind the classifier is very simple: we start with a large amount of sentences that contain emojis collected from Twitter messages. Then we look at features from those sentences (words, word pairs, etc.) and train our classifier to associate certain features with their (known) smileys. For example, if the classifier sees the word "happy" in many sentences that also has the smiley, it will learn to classify such messages as . On the other hand, the word "happy" could be preceded by "not" in which case we shouldn't rely on just single words to be associated with certain smileys. For this reason, we also look at word sequences, and in this case, would learn that "not happy" is more strongly associated with sadness, outweighing the "happy" part. The classifier learns to look at the totality of many word sequences found in a sentence and figures out what class of smiley would best characterize that sentence. Although the principle is simple, if we have millions of words of text with known smileys associated with the sentences, we can actually learn to do pretty well on this task.

If you don’t want to actually re-create the classifier, you can skip ahead to the Error Analysis section where you'll see how well it does in predicting 7 different smileys after being "trained" on some text.

Technical: Quickstart

To use this project, it's required to install python3, jupyter notebook, and some python libraries.

Install

Install python3

If you don't have python3 on your computer, there are two options:

Download python3 from Anaconda, which includes Python, Jupyter Notebook, and the other libraries.
Download python3 from python.org

Install packages

All packages used for this project are written in requirements.txt. To install, you can run

$ pip3 install -r requirements.txt

Download project

To download this project repository, you can run

$ git clone https://github.com/TetsumichiUmada/text2emoji.git

Run jupyter notebook

To start jupyter notebook, you move to the directory with cd path_to/text2emoji, then run

$ jupyter notebook

Project Details

The goal of this project is to predict an emoji that is associated with a text message. To accomplish this task, we train and test several supervised machine learning models on a data to predict a sentiment associated with a text message. Then, we represent the predicted sentiment as an emoji.

Data Sets

The data comes from the DeepEmoji/data repository. Since the file format is a pickle, we wrote a python 2 script to covert a pickle to a txt file. The data (both pickle and txt files) and scripts are stored in the text2emoji/data directory.

Among the available data on the repository, we use the PsychExp dataset for this project. In the file, there are 7840 samples, and each line contains a text message and its sentimental labels which are represented as a vector [joy, fear, anger, sadness, disgust, shame, guilt].

In the txt file, each line is formatted like below:

[ 1. 0. 0. 0. 0. 0. 0.] Passed the last exam.

Since the first position of the vector is 1, the text is labeled as an instance of joy.

For more information about the original data sets, please check DeepEmoji/data and text2emoji/data.

Preprocess and Features

How does a computer understand a text message and analyze its sentiment? A text message is a series of words. To be able to process text messages, we need to convert text into numerical features.

One of the methods to convert a text to numerical features is called an n-grams. An n-gram is a sequence of n words from a given text. A 2-gram(bigram) is a sequence of two words, for instance, "thank you" or "your project", and a 3-gram(trigram) is a three-word sequence of words like "please work on" or "turn your homework".

For this project, first, we convert all the texts into lower case. Then, we create n-grams with a range from 1 to 4 and count how many times each n-gram appears in the text.

Models and Results

Building a machine learning model involves mainly two steps. The first step is to train a model. After that, we evaluate the model on a separate data set---i.e. we don't evaluate performance on the same data we learned from. For this project, we use four classifiers and train each classier to see which one works better for this project. To train and test the performance of each model, we split the data set into a "training set" and a "test set", in the ratio of 80% and 20%. By separating the data, we can make sure that the model generalizes well and can perform well in the real world.

We evaluate the performance of each model by calculating an accuracy score. The accuracy score is simply the proportion of classifications that were done correctly and is calculated by

For this project, we tested following classifiers. Their accuracy scores are summarized in the table below.

Classifier	Training Accuracy	Test Accuracy
SVC	0.1458890	0.1410428
LinearSVC	0.9988302	0.5768717
RandomForestClassifier	0.9911430	0.4304813
DecisionTreeClassifier	0.9988302	0.4585561

Based on the accuracy scores, it seems like SVC works, but gives poor results. The LinearSVC classifier works quite well although we see some overfitting (meaning that the training accuracy is high and test accuracy is significantly lower). This means the model has difficulty generalizing to examples it hasn't seen.

We can observe the same phenomenon for the other classifiers. In the error analysis, we therefore focus on the LinearSVC classifier that performs the best.

Error Analysis

We analyze the classification results from the best performing (LinearSVC) model, using a confusion matrix. A confusion matrix is a table which summarizes the performance of a classification algorithm and reveals the type of misclassifications that occur. In other words, it shows the classifier's confusion between classes. The rows in the matrix represent the true labels and the columns are predicted labels. A perfect classifier would have big numbers on the main diagonal and zeroes everywhere else.

It is obvious that the classifier has learned many significant patterns: the numbers along the diagonal are much higher than off the diagonal. That means true anger most often gets classified as anger, and so on.

On the other hand, the classifier tends to often misclassify text messages associated with guilt, shame, and anger. This is perhaps because it's hard to pinpoint specific words or sequences of words that characterize these sentiments. On the other hand, messages involving joy are more likely to have words such as "good", "like", and "happy", and the classifier is able to handle such sentiments much better.

Future Work

To improve on the current results, we probably, first and foremost, need access to more data for training. At the same time, adding more specific features to extract from the text may also help. For example, paying attention to usage of all caps, punctuation patterns, and similar things would probably improve the classifier.

A statistical analysis of useful features through a Chi-squared test to find out more informative tokens could also provide insight. As in many other tasks, moving from a linear classifier to a deep learning (neural network) model would probably also boost the performance.

Example/Demo

Here are four example sentences and the emojis the classifier associates them with:

References

Off

Traditional

White

Pragmatic Skills in Individuals with Down Syndrome

Anonymous — Wed, 09 May 2018 00:59:18 +0000

Pragmatic Skills in Individuals with Down Syndrome Anonymous (not verified) Tue, 05/08/2018 - 18:59

Children with Down syndrome demonstrate an intricate profile of strengths and limitations in pragmatic aspects of language.

By Amy Sanders
Course: Semantics (Ling 3430)
Advisor: Prof. Barbara Fox
LURA 2018

The Early Circles program at CU Boulder offers information and coaching to families with children with Down syndrome. During the summer of 2016, I participated in the Early Circles internship through the 91��Ƭ��AV’s Speech, Language, and Hearing Clinic (SLHC). The program paired me with a specific family as part of the experience. While working with the child, H.W., I implemented various interaction strategies to promote learning and communication through play. I learned so much about family-centered communication intervention approaches and translated this knowledge into experience when working closely with the family. Although the internship only took place over the summer, I built a great connection with the family and now provide childcare for H.W. to this day.

H.W. is three years old and attends a full-day preschool in Boulder. She loves preparing, serving, and eating meals in her play kitchen! She also enjoys spending time in the “real” kitchen with her parents while they cook; recently, she has started adding salt to the pan. H.W. is very chatty as evidenced by the immense number of signs that she uses. We communicate using both English and some sign language. She is currently taking a climbing class and loves it!

I immediately thought of my fabulous experiences with H.W. when Dr. Barbara Fox noted that we could choose any topic for our research paper in my course on Semantics (LING 3430), as long as it correlated with a class topic. Lucky for me, Dr. Fox had recently lectured about pragmatics! I imagined that many individuals thought of pragmatic skills solely in relation to being “sociable,” but I knew that the area of pragmatics encompassed so much more.

For my paper, I decided to research pragmatic skills in individuals with Down syndrome. I researched various areas of pragmatics, including narrative skills, topic maintenance, turn-taking, communication repair, and intent. Although I consulted the literature, I also video recorded interactions of H.W. and myself. It was a wonderful experience to review the videos and consider H.W.’s pragmatic skills. Writing this paper was especially meaningful to me because of my connection with H.W. and my interests in both linguistics and Down syndrome.

After much research, I concluded that there was a lot of variability among individuals with Down syndrome. For example, one study found that children with Down syndrome had strong narrative skills as they generated significantly longer and more complex narratives than an expressive-language-matched-group (Boudreau & Chapman 2000, p. 1154). Another study showed that children with Down syndrome rarely initiated communication and that turn-taking was infrequent. These studies, combined with additional research, suggest that people with Down syndrome demonstrate an intricate profile of strengths and limitations in pragmatic aspects of language.

Even further, some of what I found working with H.W. did not align with specific studies that I consulted. For example, while one study noted that children with Down syndrome rarely engaged in turn-taking, H.W. had strong turn-taking skills. One example of her great turn-taking abilities is evident when we read books together. Currently, one of her favorite books is titled “I Like Berries, Do You?” by Marjorie W. Pitzer. The book includes pictures of children eating various kinds of food such as chicken or bananas. Each page reads, “I like ________, do you?” When I read the book to H.W., she responds to the question immediately following each time that I present it. She waits for me to finish speaking before she responds, thus indicating that she’s aware that I have a specific time to speak and that when I’m finished speaking, it’s her turn to talk. Because she knows the word “yeah,” she is able to respond verbally at just the right moment.

Overall, additional research is required in order to better understand the pragmatic skills in individuals with Down syndrome at various linguistic stages and age levels. Yet, despite variability, current studies seem to suggest that individuals with Down syndrome have relatively strong pragmatic skills. From personal experience working with H.W., I find this to be true.

Off

Traditional

White

Gender without Bodies

Anonymous — Wed, 09 May 2018 00:50:00 +0000

Gender without Bodies Anonymous (not verified) Tue, 05/08/2018 - 18:50

I have loved the film Her since I first saw it in the winter of 2012.

Catchen's video essay can be found here!

By Michael Catchen
Course: Language, Gender, and Sexuality (Ling 2400)
Advisor: Prof. Kira Hall; TA Ayden Parish
LURA 2018

A poster for the theatrical release is staring down at me as I write this. Writer/director Spike Jonze uses an absurd premise—a man starts dating his artificially intelligent computer—to tell a story that is truly first and foremost about human relationships. It is not difficult to see the parallels between Her and Lost In Translation, a 2002 film directed by Sofia Coppola (who is Jonze’s ex-wife). Both films deal with each director’s perspective on their divorce, and the causes of the communication breakdowns and periods of growing-apart, which are responsible for the end of many relationships, both romantic and platonic.

When thinking about potential research projects for the course Language, Gender, and Sexuality (LING 2400) during the Fall 2017 term, I quickly gravitated to Her as a topic because the operating system/love interest, who names herself Samantha, is undoubtedly a woman, but the process by which she is gendered is murky at best. Traditionally, performance of a gender identity is thought to be expected of a person based on their sex—females are expected to perform the identity “woman”—but Samantha has no body, her femininity is performed strictly through language, because that is the only vessel she has to communicate. Samantha’s gendering is the process by which Jonze moves her from the realm of the artificial to the realm of humanity, which is in-and-of-itself a reflection on contemporary American culture’s failure to recognize people separately from their gender identity.

I decided to look into the process by which Samantha is gendered, and I found the pivotal scene occurs when the protagonist, whose name is Theo, is initializing his brand new operating system. The OS setup-assistant asks whether Theo wants his OS to have a male or female voice, already a problematic questions as it is well accepted in the existing literature that differences in male and female voices are not a result of simple biology, but rather performed to adjust to social norms of what is considered “male” and “female” speech. By making the innocuous decision of choosing a female voice, this operating system has been anthropomorphized into not just being female, but also performing femininity. Theo’s first impression on meeting Samantha tells it all—“you seem like a person, but you’re just a voice in a computer.”

The project offered the choice between a written essay and a video essay, the latter being a fairly general category that uses video and a narrator to offer some form of analysis. In my experience, video essays are particularly well suited toward film analysis because scenes can be cut into the video, allowing the audience to hear dialogue and see composition directly, rather than relying on the extensive quotation and descriptions of framing that would be necessary in a traditional written essay.

Off

Traditional

White