Abstract
Personalization in human-robot interaction (HRI) has been shown to have powerful effects on both users’ perception of robots and objective interaction outcomes. Calling a human user by their name, an important signal to communicate understanding the user and memorizing information about them, remains an ongoing challenge in HRI research as typical text-to-speech algorithms struggle correctly pronouncing the numerous names that exist even just in the English language. This paper presents a pipeline for fusing text and audio features to extract and re-use user information like names with the correct pronunciation. We discuss technical guidelines for implementation and remaining challenges.
Additional Content
Copyright Notice
The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author’s copyright. These works may not be reposted without the explicit permission of the copyright holder.