Natural Language Understanding: Measuring the Semantic Similarity between Sentences

Starting Date:
Prerequisites: Strong knowledge and experiences in Python and Pytorch programming. Knowledge in basic machine learning and deep learning concepts and techniques.
Will results be assigned to University:

Overview

To implement and design various deep neural networks for measuring the semantic similarities between sentence pairs.

Background

Natural language understanding (NLU) is widely viewed as a grand challenge in Artificial Intelligence (AI). An important sub-task in NLU is to measure the semantic similarity between sentence pairs, also known as the Semantic Textual Similarity (STS) task. A good STS model has many applications, for example in the search engine (to find the most relevant page for a given query), question answering (to cluster similar answers), document summarization (to measure how much information in the
source documents is included in the summary), and automatic sentence paraphrasing (to check whether the revised sentence delivers the same meaning as the original one). This project focuses on developing STS models using the latest machine learning and artificial intelligence techniques.

The arguably simplest, yet surprisingly strong method to measure the similarity between a pair of sentences is to count how many words or phrases the pair of sentences share. However, this method can easily
fail when synonyms, homonyms and heteronyms are used in the sentences. To tackle this problem, neural-based text embedding methods have been proposed: they learn to project sentences to a high-dimensional vector space, such that semantically similar words (e.g. eggplant and aubergine) are neighbours in the vector space. The neural-based methods yield better performance than counting overlapping words. However, some recent study shows that the neural-based methods fail when comparing (i) sentences that use very different words but deliver very similar meanings (e.g. `President Obama visited Beijing last week’ and `The first African-American US President arrived at Peking a few days ago’), and (ii) sentences that use very similar words but deliver completely different meanings (e.g. `Hamilton beats Button and wins the game’ and `Button beats Hamilton and wins the game’). In this project, we aim at developing models that can alleviate these problems in this project.

Prerequisites

Strong knowledge and experiences in Python and Pytorch programming. Knowledge in basic machine learning and deep learning concepts and techniques.