Sarah COSENTINO コセンティノサラ

(1)

Non-verbal interaction and amusement feedback capabilities for entertainment

robots

エンターテインメントロボットにおける非言語的相互作用と娯楽フィードバック機能

に関する研究

July 2015

Sarah COSENTINO

コセンティノサラ

(2)

(3)

Non-verbal interaction and amusement feedback capabilities for entertainment

robots

エンターテインメントロボットにおける非言語的相互作用と娯楽フィードバック機能

に関する研究

July 2015

Waseda University

Graduate School of Advanced Science and Engineering Department of Integrative Bioscience

and Biomedical Engineering Research on Biorobotics

Sarah COSENTINO

コセンティノサラ

(4)

(5)

A mio marito e alla mia famiglia.

…e ai creatori di Skype!!

(6)

(7)

ACKNOWLEDGMENTS

“Home is behind, the world ahead, and there are many paths to tread through shadows to the edge of night, until the stars are all alight.”

― J.R.R. Tolkien, The Lord of the Rings

Research is an adventurous saga. It is a continuous series of successes and failures, of exhilaration and misery. The journey is long and difficult, as we follow a path that has not yet been traced to an unknown destination.

This epic mission would hardly be possible without the support of other people.

Here, I would like to express my gratitude to all of you who helped me in any possible way in my study and my life, who made this journey worth and meaningful, shaping me in a better researcher and especially a better person.

First of all, I would like to gratefully thank my supervisor Prof. Atsuo Takanishi for his confidence in my ability to complete the PhD course. He motivated me during many tough periods; his bright guidance and warm encouragements, together with his positive atmosphere, made my experience in his laboratory outstanding.

My distinct gratitude goes to Prof. Salvatore Sessa for his guidance, patience, and friendship. He has been always next to me, giving me the right pressure to write papers and complete the research. He has been my mentor, encouraging me to grow not only as an engineer but also as a methodical thinker, with continuous and precious advices.

(8)

of paramount importance to overcome technical and logistic issues. I would like to sincerely thank them for their on-going support. I would also like to express my gratitude to the WB team and in particular to Weisheng Kong and Di Zhang, for their assistance in the experiments setup and hardware development and to the biped team and in particular to Tatsuhiro Kishi and Takuya Otani, in charge of the research on the KOBIAN side.

During the course of my PhD I also have had the great opportunity to meet and collaborate with the ST Microelectronics team, provider of the microcontrollers and sensors used in all the Waseda bioinstrumentation devices. This collaboration induced a two-month internship at ST Microelectronics headquarters in Shinagawa, Tokyo, in which I learned very useful concepts on microcontroller programming and control applications. Special thanks go to Paolo Oteri, Matteo Maravita, Paolo Palma, and all the ST Italian team, who warmly welcomed me and helped me with kindness, friendship and understanding: “A questo tavolo di ingegneri ce ne sono due. Tre che ne fanno uno, e uno honoris causa”.

During the course of this PhD, I had the opportunity to participate to the JSPS Strategic Young Researcher Overseas Visits Program for Accelerating Brain Circulation, and spend 10 months abroad in various research laboratories at several universities:

Carnegie Mellon University, in Pittsburgh, PA, U.S., University College of London, London, U.K., and Aix-Marseille Université, Marseille, France. This was another incredible experience that gave me so much, both for professional and personal growth.

I would like to thank with all of my heart everyone who made this experience truly unforgettable and so enriching. In U.S.: Prof. Alex Waibel, Prof. Florian Metze and senior researcher Susan Burger, Prof. Provine, with his insights on laughter research, Faith Boldt, Eric Riebling, Nikolas Wolfe and Lara Martin, and the Darlings at Darlington, my American family: George, Naomi, Jasmine, Jared and Charlie. I hope to see you guys all very soon :D!

(9)

In U.K.: Prof. Nadia Bianchi-Berthouze, Dr. Harry Griffin, Dr.Hane Aung, Dr. Paul Marshall, Louise Gaynor, and the whole UCLIC lab. In U.K. I had the occasion to participate to the Humor summer school, where I met remarkable people and professors, whom I hope I can meet again, and with whom I would love to collaborate in the future: Prof. Jessica Milner Davis, Prof. Christie Davies, Prof. Willibald Ruch, Prof.

Graeme Ritchie, Dr. Sharon Lockyer, Dr. Elena Hoicka, Dr. Alessandro Valitutti, and Livia Cadei.

In France: special thanks to Prof. Thierry Chaminade, whom projects and methods were very interesting, and I hope we will be able to collaborate (again) very soon; and of course to all the Italians at Amadeus.

The research described in this thesis would not be possible without the technical expertise and the scientific acumen of Dr. Klaus Petersen, Dr. Zhuohua Lin, former PhD students at Takanishi laboratory, and co-founders of LP-Research, who developed also the first prototypes of WB IMU and WB EMMG.

Special thanks also to Prof. Massimiliano Zecca, that made all of this happen, by replying to my first enquiry email, years ago. His continuous answering to my academic questions with other questions really helped me to become a critical thinker and a high achiever, and to develop the right degree of academic maturity and independence.

Thanks from the bottom of my heart also goes to Mrs Hisako Ohta, for her continuous support, kindness and patience in helping me solve a lot of problems on regular university affairs. Thanks also the other members of Takanishi laboratory, especially the students who have contributed to this research.

I would like to express my deepest gratitude to Prof. Shuji Hashimoto, Prof.

Tetsunori Kobayashi, Prof. Mitsuo Umezu and Prof. Masakatsu G. Fujie who gave me a lot of advices on how to complete my thesis and provided corrections of my dissertation.

This work has been also supported in part by Global COE Program "Global Robot Academia", MEXT, Japan; by the JSPS Strategic Young Researcher Overseas Visits Program for Accelerating Brain Circulation; by the MEXT scholarship for overseas students; and by a grant from STMicroelectronics. I would also like to express my gratitude to RoboCasa and the Humanoid Robotics Institute for their supports to the research.

(10)

Finally, I would like to thank my family. The support and encouragement of my mother father and sister have been fundamental to complete this experience. The love and patience of my husband Alessandro, especially his resilience during these long years of separation, have contributed every single day to the accomplishment of this work. I love you, Alessandro.

(11)

iii

ABSTRACT

Nowadays, robots are not anymore confined between factory walls but are beginning populating our daily life more and more. In the near future robots are expected to play a major role in the society, assisting humans not only in difficult or dangerous tasks but also in everyday chores. In particular, a huge field of application would be entertainment, from simple entertaining performance robots to educational or assistive and personal robots for the elderly or physically and mentally challenged persons. In this framework, robots will need to be able not only to do their task like or better than humans, but also to communicate with fellow humans naturally and smoothly, at the same level of perception.

At the moment, in fact, in order to operate a robot, the user requires training, or at the very least an instruction manual to refer to: the most of the cognitive load for communication is placed on the user, it is the user who has to learn and adapt to the robot language. So far, communication issues are holding robot widespread. To promote mainstream robot adoption in the future, the communication load must be shifted back on the robot, which needs then to learn how to communicate with humans. In fact, smooth communication and natural interfaces are important not only for Robotics, but necessary for ensuring acceptance and penetration of technology among all layers of human society.

To implement a natural way of communication between robot and humans, we must start from the observation of natural human-human communication patterns.

Communication among humans is performed through the simultaneous use of both

(12)

Both types of communication present several cognitive challenges for robots. In verbal communication, humans use a single channel –spoken language– to convey information. However, the actual spoken languages in the world are hundreds, with different phonetic, grammar, and prosody rules, and large vocabularies; and it would require virtually infinite memory and computational abilities. On the other hand, non- verbal communication is achieved through several channels: kinesics, haptics, appearance, proxemics, chronemics, paralanguage, silence, and environment management, some of which are dependent on the particular culture, and some of which are culture independent and virtually universal, characteristic of humans as a species.

Non-verbal behavior is as important as verbal communication; it complements and enriches it, and sometimes contradicts and altogether replaces it.

In particular, non-verbal communication plays an important role in entertainment performances, in fact skilled entertainers know how to when and how to move to convey effective information, can read the audience emotional reaction and can adapt their performance accordingly. Entertainment robots should then learn how to effectively communicate with the human audience.

In this thesis, the road towards the integration of social communication skills in entertainment robots is presented.

The specific goal of this work is to develop a communication interface for a musical entertainment robot, enabling the robot to detect and process human non-verbal social signals, both conscious and unconscious. More specifically, the proposed system aims at solving this problem: how can human body language, both conscious and unconscious, emotional, be automatically detected and processed. The advantage of such a system is that an entertainment robot will be able to perceive commands and emotional feedback from the interacting human partner, and be able to adapt its performance accordingly.

More interestingly, this system can also be used in several different applications, in the fields of Robotics and Healthcare.

The aim of this work is achieved by:

(13)

v

1) Analyzing human motion carrying communication and emotional information 2) Combining human tracking system and surface electromyography in a body

communication interface system

3) verifying that the proposed system is effective in recognizing the human social information

In summary, this research demonstrates a general approach to human non-verbal communication signals perception. The system is reconfigurable and can be further expanded to detect different types of non-verbal communication signals, both conscious and unconscious, and adapted to different scenarios.

The validity of the proposed system is verified with specific experiments in the field of musical and humorous entertainment. The results are extremely promising, and show that this is an effective approach to enhance human-robot interaction and take entertainment robots to the next step. In addition, this methodology can be used in a more general context as a general methodology to achieve natural human-robot interaction in a variety of application fields in which natural human-robot interaction would be beneficial, like in the educational, medical, personal service fields.

The research has been carried out in Japan, United States, United Kingdom and France.

This thesis consists of 7 chapters in which are presented the background of the issue addressed, the theoretical and empirical notions on which the proposed methodology is based, the specific robot platform used and the experiments to test and validate such methodology, and a discussion on limits and possible extensions of this work. The thesis is laid out as follows:

Chapter 1 introduces the background with a detailed analysis on the motivations at the basis of this work. I explain the theory on entertainment, why entertainment is important, and in particular what are the current limitations of entertainment robotics.

Chapter 2 is dedicated to the specific entertainment robot platforms used in the development of this work, the Waseda Flutist robot WF-RVII, the Waseda Saxophonist robot WAS-3, and the Waseda humanoid emotional robot, KOBIAN. It describes their characteristics and purpose, and the limitation that the present work is meant to

(14)

particular paralanguage and kinesics, conscious and unconscious communication. It contains the basics of emotions and physiological changes related to communication signals, and it explains the methodology proposed for detection and analysis of both cognitive and emotional kinesics. At the end of the chapter the proposed approach limitations and extensions are discussed.

Chapter 4 is the continuation of the work in Chapter 3. It addresses the problem of recognizing a natural language, in the form of direct, intentional kinesics, and it presents the practical implementation of the proposed general human-robot interaction method on the Waseda Flutist robot WF-RVII, according to its specific needs. In particular, basic theory of music and musical interaction is presented, together with a novel non-verbal, direct interaction framework. At the end of the chapter the proposed approach limitations and extensions are discussed.

Chapter 5 instead addresses the problem of conscious emotional interaction.

Conscious emotional signals mimic and are modelled intuitively on unconscious emotional signals, as the subject is used to perceive them. Differently from symbolic natural language signs, these signals are not fixed, and more subjective, however more universally recognized from humans as a species. The proposed general human-robot interaction method is tested on the Waseda Saxophonist robot WAS-3, interacting with a dancer. In particular, basic notions on modern and expressive dance are presented, together with an analysis of emotional body movement expressions in dance. At the end of the chapter the proposed approach limitations and extensions are discussed.

Chapter 6 is the continuation of the work in chapters 3, 4, 5. It addresses the problem of indirect, unconscious emotional expression, and it presents the practical implementation of the proposed general human-robot interaction method on the Waseda humanoid emotional robot KOBIAN, as an entertainment comedian robot. In particular, theory of humorous interaction is presented, together with a novel non- verbal, emotional interaction framework. At the end of the chapter the proposed approach limitations and extensions are discussed.

(15)

vii

Chapter 7 concludes the thesis. Results are restated and evaluated from a general perspective. Broader considerations and future works are discussed, showing the overall contribution of this thesis, and also different future research directions.

In conclusion, the following results were achieved. A framework for conscious and subconscious kinesics interaction was developed, allowing robots to understand and interpret human direct and indirect social signals. Specifically, the developed interaction system and interface allowed the Waseda Flutist robot WF-RVII to follow the directions of an orchestra conductor and adapt its performance accordingly and the Waseda Saxophonist robot WAS-3 to interact intuitively with an expressive dancer. An expanded version of the system would allow the Waseda humanoid emotional robot KOBIAN to detect amusement cues, in the form of spontaneous laughter, from its audience, tuning its humorous performance accordingly.

Through the human conscious and subconscious kinesics perception method proposed in this thesis, it will be possible for robots to perceive human natural social communication signals, thus increasing the decisional power of the robots during interaction-related tasks, and making the robot more effective, increasing its field of action. As a result, not only the robots used in this work, but the whole field of robotics can take advantage of the concepts in this thesis. According to the interaction mechanisms described in The Media Equation by Reeves, in fact, “Individuals’

interactions with computers, television, and new media are fundamentally social and natural, just like interactions in real life;” or better say, the effects on people interacting with media, and especially autonomous agents like machines and robots, are often profound, leading them to behave and to respond to these media in unexpected ways, most of which they are completely unaware. For this reason, not only humanoids, but any kind of service robot should be able to understand both conscious and subconscious human communication, as people tend to respond to them as they would to another person.

(16)

(17)

List of figures

Figure 1.1 Entertainment market revenues trend ... 20

Figure 1.2 Jordi-stick, an entertainment therapeutic application ... 21

Figure 1.3 Educational entertainment infographic ... 22

Figure 1.4 The theory of Spatial Presence ... 26

Figure 1.5 Flow chart of thesis chapters. ... 34

Figure 2.1 Waseda Flutist Robot WF-4RVI ... 37

Figure 2.2 Lips, oral cavity and vocal cord structures ... 38

Figure 2.3 Tonguing mechanism ... 39

Figure 2.4 Efficiency of sound generation ... 39

Figure 2.5 Attack time comparison ... 39

Figure 2.6 Waseda Saxophonist Robot WAS-3 ... 40

Figure 2.7 Waseda Emotional Expression Robot KOBIAN ... 41

Figure 2.8 KOBIAN emotional expression head ... 42

Figure 2.9 Direct visual musical interaction principle ... 44

Figure 2.10 Direct visual musical interaction levels: basic ... 44

Figure 2.11 Direct visual musical interaction levels: extended ... 45

Figure 2.12 Unconscious amusement feedback recognition ... 46

Figure 3.1 Human social signals classification ... 49

Figure 3.2 Typical examples of emotional body postures and facial expressions ... 52

Figure 3.3 The ABC of Psychology ... 53

Figure 3.4 Personality and social intelligence ... 56

(22)

Figure 3.7 General CAN hybrid architecture ... 60 Figure 3.8 General wireless system architecture ... 61 Figure 3.9 WB IMU ... 61 Figure 3.10 WB EMG ... 62 Figure 4.1 Example of musical score... 66 Figure 4.2 An orchestra conductor, Seiji Ozawa (小澤征爾) ... 67 Figure 4.3 Conducting movement representing musical parameters ... 67 Figure 4.4 Examples of Tempo notation in BPM ... 69 Figure 4.5 Musical conductor ... 72 Figure 4.6 Measurement system settings. ... 72 Figure 4.7 Acceleration data ... 75 Figure 4.8 Acceleration norm peak recognition ... 75 Figure 4.9 Acceleration peaks detection for the computation of Tempo in bpm ... 76 Figure 4.10 Dynamics discrimination by thresholding ... 77 Figure 4.11 Principal Component contribution rate ... 79 Figure 4.12 Interaction system diagram ... 81 Figure 4.13 Calculated Tempo ... 83 Figure 4.14 Beat patterns performance comparison ... 83 Figure 4.15 Comparison between professional conductor and novices ... 84 Figure 4.16 Dynamics discrimination ... 85 Figure 4.17 Articulation discrimination ... 85 Figure 4.18 Discrimination between Articulations with similar meanings ... 86 Figure 4.19 Discrimination between macro groups of similar meanings Articulations ... 87 Figure 4.20 Comparison between professional conductor and novices ... 89 Figure 4.21 Articulation represented by a student ... 90 Figure 4.22 Result of perception experiment ... 90 Figure 5.1 Convex poligon from 2D images ...103 Figure 5.2 Limb extensions from 3D images ...104 Figure 5.3 comparison between 2D and 3D performances ...104

(23)

xv

Figure 5.4 correlation between musical and movement parameters ...106 Figure 5.5 phrase selection criteria ...107 Figure 5.6 Specific phrase selection system diagram ...108 Figure 5.7 comparison of two real-time adaptation algorithm performances ...109 Figure 5.8 Interaction system diagram ...110 Figure 5.9 result of perception experiment ...111 Figure 6.1 Literature by sensor technique, percentages. ...118 Figure 6.2 Literature by sensor technique, in number of articles. ...118 Figure 6.3 Laughter temporal segmentation ...121 Figure 6.4 Approximate temporal segmentation in seconds ...121 Figure 6.5 Correlation between call duration and call amplitude and position in bout .124 Figure 6.6 Spectra of a pulse of typical male and female laugh ...124 Figure 6.7 Respiration muscles ...127 Figure 6.8 Respiration parameters during laughter ...128 Figure 6.9 Larynx muscle activation ...130 Figure 6.10 Facial expression in Duchenne, and non-duchenne laughter ...131 Figure 6.11 trunk movement related to laughter vocalization...132 Figure 6.12 Laughter body movements depending on various degrees of arousal ...132 Figure 6.13 EMG electrodes positions for laughter measurement ...135 Figure 6.14 Sensor positioning ...142 Figure 6.15 Umbilical IMU PC contribution rate ...145 Figure 6.16 Umbilical EMG PC contribution rate ...147 Figure 6.17 Experiment setup. ...149 Figure 6.18 EMG electrodes positions for laughter measurement ...150

(24)

(25)

List of tables

Table I Examples of Basic Tempo markings ... 68 Table II Dynamics Markings ... 69 Table III Examples of Basic Articulation markings ... 70 Table IV Weight of each parameter ... 79 Table V Musical expression implementation ... 82 Table VI Ictus detection ... 82 Table VII Movement patterns and related evoked emotional states ... 98 Table VIII Musical patterns and related evoked emotional states ...101 Table IX Musical patterns and related evoked emotional states ...105 Table X Laughter temporal characteristics ...125 Table XI Laughter frequency characteristics ...126 Table XII Umbilical IMU PCA LOOCV results ...145 Table XIII Umbilical EMG PCA LOOCV results ...147 Table XIV Emg data stream reliability. ...153 Table XV Confusion matrix: LD 7 classes ...156 Table XVI Confusion matrix: LD 6 classes ...156 Table XVII Confusion matrix: LD and SVM binary ...156 Table XVIII Classification results ...156

(26)

(27)

Chapter 1 Entertainment and Robots

1.1 Introduction

This is the introductory chapter of the thesis and presents the research background, the problem statement, and the specific goals and contribution of this research. The outline of the thesis concludes the chapter. In particular, this research is at the intersection between Robotics, Affective Computing and Medicine, as it discusses the study of methodologies and the development of devices that can measure human physiology, recognize, and interpret human behavior. For this reason, this is an interdisciplinary work spanning several fields, from robotics and computer science to medicine, psychology and cognitive science. A variety of topics in these different fields will be covered in this introduction and throughout the whole thesis. The studies and findings of this research, demonstrated here with a specific application, propose a more general methodology and theory and can be generalized and applied to different applications.

(28)

1.2 Background

1.2.1 Entertainment and its application

Entertainment is defined in the Oxford Dictionaries as “the action of providing or being provided with amusement or enjoyment” [1].

In fact, nowadays, the experience of being entertained has become strongly associated with being amused, so that a common understanding of the idea is fun and laughter, although many entertainment activities have serious purposes, as in the case of various forms of ceremony, celebration, religious rituals, or even satire. More in general, with entertainment we denote a form of activity that holds the attention of an audience, possibly giving pleasure and enjoyment. Hence, entertaining activities may also be means of achieving insight, spiritual or intellectual growth.

Humans have engaged in entertaining activities, to various ends, for centuries [2]–[4].

With the advent of cheap and fast means of mass communication, entertainment came to play a more and more prominent role in our everyday life, as media provide a growing number of entertainment opportunities and products, in response to the variety of individual preferences. Entertainment media have become the driving force of the new world economy, generating business revenue of trillion dollars each year.

Figure 1.1 Entertainment market revenues trend – Image from http://www.statista.com/

(29)

Chapter 1

Cosentino Sarah – 21 – 博士論文

The power of entertainment to naturally attract and hold the attention of people can be exploited for several purposes. Advertainment (neologistic portmanteau for advertising and entertainment) has been successfully used for long time to promote products or brands, placing specific and recognizable references in entertainment media [5], [6]. From the application of entertainment to educational purposes stems instead Edutainment, content designed to pass educational values whilst entertaining the audience [7]. Effective edutainment has existed for millennia in the form of parables and fables conveying ethical concepts and morals. Toys and games are normally the earliest edutainment products a person will come across in her life: manufactured nowadays in a virtually infinite variety and generally labeled according to the expected users’ age range, they can be used to attain and refine skills, to reinforce or expose character traits, and to explore talents and interests [8], [9]. In recent years, there have been several attempts to use Entertainment in the medical field, with entertaining therapeutic activities designated to promote motivation during rehabilitation programs, or to enhance mental health counteracting the effects of chronic depression [10]–[14].

Figure 1.2 Jordi-stick, an entertainment therapeutic application for cystic fibrosis therapy Image from Jordi-stick commercial website, http://jordi-airflow.de/der-jordistick/

(30)

Figure 1.3 Educational entertainment infographic – Image from An Ethical Island Blog

(31)

Chapter 1 1.2.2 Entertainment theory

To design and produce effective entertaining products, a deep understanding of the underlying mechanisms and processes of entertainment is necessary.

Dolf Zillmann is the pioneer of research on entertainment: since the late ‘70s, he undertook a systematic research on the motivations, uses and effects of entertainment.

Zillmann and his colleagues, in 30 years, have deepened their knowledge and developed and refined the theoretical concepts surrounding entertainment, carrying on a series of empirical experiments to discover what entertainment is, how it works, which its effects on the audience are, and why the audience is so attracted to it [15]–[20].

Zillmann defines entertainment as “any activity designed to delight, and, to a smaller degree, enlighten through the exhibition of fortunes and misfortunes of others, but also through the display of special skills by others and/or self” [16]. This definition is based on the general assumption that entertainment audience is hedonistically oriented to maintain and foster their positive mood and to alter and revert their negative mood.

From this basis, a number of different theories on entertainment have been developed, that can be combined in very interesting ways.

In particular, the main theories used today to explain specific forms of entertainment are:

 Affective-disposition

 Excitation-transfer

 Mood-management

 Selective-exposure

Affective-disposition theory is best at describing what the audience state is during a narrative media experience: based on Zillmann idea of empathy, it substantially hypothesizes that the audience, analyzing the events in the narrative, judges the actions of the characters and develops either a positive or negative affective disposition for them, building up expectations, hopes and fears for the characters’ destiny, eventually resulting in specific affective reactions depending on the actual story outcome.

(32)

Excitation-transfer theory attempts to explain why the audience is attracted by specific contents that lead to burdening experiences during exposure, for example sad or even horror movies. The general idea is that the more stressful and intense the negative experience is, the more intense and gratifying the resulting feelings of relief will be after the exposure.

Mood-management and selective-exposure theories are the most intertwined, and affirm, quite obviously, that the audience selects specific media products rationally to either enhance or alter their mood [18].

These theories have been thoroughly tested, and are strong and sound. However, they are very general, and should be differentiated, refining the initial hypotheses, according to individual differences and situation diversification:

 Instead of a general unspecified user, specific different types of users, with different age, culture, social status, needs and individual character should be accounted for

 Instead of targeting a specific and univocal leisure time situation, different social situations and cultural context in which entertainment can be experienced should be imagined and tested

 Instead of prototypical entertainment products, entertainment products with specific characteristics and stimuli elicitation mechanisms should be considered and designed

From this starting point, it has been observed that males and females do not generally prefer or fully understand the same type of entertainment products, and so populations of different age, or different culture. This led to the development of specific type of products according to the targeted population and application purposes.

1.2.3 Humor, laughter, and entertainment

As stated in the definition itself of entertainment by the Oxford dictionary, the experience of being entertained has come to be strongly associated with amusement, fun and laughter.

(33)

Chapter 1

In fact, it is undeniable that humor and laughter-eliciting activities constitute a big share of entertainment practices. This is directly linked to the theory of mood- management: the theory projects that people learn, essentially by trial and error, which kind of media entertainment helps them best to repair and improve their emotional condition. When distressed or in an unpleasant mood, they simply engage in activities that have helped them in the past to snap out of such undesirable states.

Not all genres of media entertainment have the same capacity for distress or mood repair. Humor and lighthearted comedy, with its belittlement of everyday problems, is thought to have more of this capacity than, say, crime drama and tragedy. Researchers expected therefore, and found, that many would have developed a tendency to call on comedy for the repair of their gloom [21]–[23].

On the premise that humor, because it often trivializes life problems and converts them to laughing matters, diminishes anguish, it is treated as a potential antidote to stress, both in cognitive and endocrinological terms [24]–[27]. Prolonged aversive experiences are known to instigate increased release of stress hormones, mostly cortisol in some form, and also to impair immunological function and ultimately health. Humor, or a humorous disposition that ensures frequent lighthearted responses to problems, can thus be expected to curb and prevent many of these pathogenic effects. Irrespective of potential health benefits, however, humor appears to hold promise as a mood repairer that enhances the quality of life by carrying with it a more positive outlook, increased initiative, and greater tolerance of adversity.

Introducing humor in the daily routine of subjects in situation of distress and depression, for example elderly in poor health conditions or lonely, or people with any kind of psychophysical problem, could improve the overall health conditions and help to regain psychological equilibrium [28], [29].

1.2.4 Social interaction and entertainment

Technology advancements and the proliferation of interactive media products, able to significantly affect both the audience perception and expectations about the media,

(34)

opened a completely new direction in the entertainment experience, and so in entertainment product concept and design.

Users, more and more, are not satisfied with the role of pure observer anymore, but in an increasing number of situations, they want to be involved and have an influencing role on the narrative they are being exposed to [30].

This should not come as a surprise.

Entertainment is an activity holding the user attention, and to keep attention focused, a stimulus must be sent repeatedly to continuously capture the subject’s alertness and cognition. Interactivity is a key stimulus to maintain high levels of alertness and attention, as the subject is forced to focus on and process the ongoing events to adjust her response [31]. If the stimulus that create and maintain attention is effective, users become so involved in the entertaining narrative that they become completely immersed in it, experiencing a sensation of being spatially located in the mediated environment and perceiving the media contents as ‘real’ [32]–[34].

Figure 1.4 The theory of Spatial Presence

Image from The Psychology of Video Games website, http://www.psychologyofgames.com/

(35)

Chapter 1

This is the most powerful characteristic of interactive entertainment media: by immersing in the mediated world, subjects will temporarily ignore the physical and social reality in which they are actually living. This can help to cope with the burdens of actual reality, and it is particularly sought after in those situations in which the subject’s attitude to reality blocks her chances of improvement or decreases her quality of life.

This naturally leads to the application of entertainment in both the educational and therapeutic fields: for example, learning or rehabilitation processes can be perceived as very boring or even hopeless, with tedious lessons and repetitive and seemingly useless exercises, hitting on the subject’s weaknesses and inabilities. The “gamification” of educational or therapeutic activities would promote attention and improve the learning or healing curve.

1.3 Entertainment robotics

Entertainment robots exist since centuries, and they arguably were the very first type of developed robots.

In fact, the first known robot was created around 400-350 BC by the mathematician Archytas and was an artificial bird. Archytas, who is sometimes referred to as the “father of mechanical engineering,” constructed his robot bird out of wood and used steam to power its movements. In its best recorded run, the robot bird flew about 200 meters before running out of steam [35], [36].

Early examples of robotic art and theater existed also in ancient China as far back as the Han Dynasty (third century BC), with the development of an elaborate functional mechanical orchestra, and other hydraulically actuated mechanical toys, including flying automatons, mechanical animals, angels and dragons, and automated cup-bearers, for the amusement of Emperors [37].

Leonardo Da Vinci designed and built several theatrical automata, including a lion which walked onstage and delivered flowers from its breast, and the first known humanoid robot around 1495 [35]–[38].

(36)

Nowadays, technology made possible the development of more complex robots, with autonomous control systems, able to perform a variety of task [36]. Entertainment robotics also evolved and expanded, and mechanical and electronic toys, as well as bigger robot performers and wider mechanical installments –amusement parks–

naturally draw the attention of millions.

However, entertainment robots for educational and therapeutic purposes have to override strong feelings like boredom or pain to engage the audience. Thus, to maximize their entertainment power and offer an engaging and immersive experience keeping the user’s attention continuously high, they should acquire an important skill: the ability of interacting with the audience. But how? Research shows that humans tend to treat and interact with new media, computers, robots, as they were real people [33], [34]. Robots should then learn how humans interact socially, and interact with humans using natural and intuitive methods, to avoid adding cognitive overload on the user and to offer a completely natural experience.

1.4 Problem statement

Humans generally seek entertainment to enhance their mood and cope with difficult situations in everyday life. In this framework, entertainment can become a powerful tool to promote attention during useful but boring or repetitive tasks, or to enhance the general quality of life during difficult situations. Entertainment robots, which have proved to be very successful among humans through the history, could prove very useful both in the educational and in the healthcare fields, by keeping students and patients focused during learning or rehabilitation activities, or support with company and humor challenged individuals.

However, in order to override boredom, pain or depression and engage the audience, offering an immersive experience, entertainment robots must be able to interact naturally with the human counterpart, and in particular to:

1) Be able to understand commands and requests from the human partner 2) Be able to perceive the human partner interest and amusement levels

(37)

Chapter 1 1.5 Goal of this research

1.5.1 Aims

In this thesis, the road towards the development of futuristic entertainment robots able to understand and naturally respond to human communication signals is presented.

These robots will interact with the audience keeping their attention high and offering an engaging immersive entertainment experience, and could be used in the educational and healthcare fields, to support learning and rehabilitation processes. The specific goal of this work is to develop a method enabling a robot to perceive both conscious, direct commands and subconscious, emotional human communication signals.

More specifically, in this work the aims are achieved by:

1) developing a new reconfigurable wearable system to capture and analyze human movement carrying communication information

2) combining human motion tracking and physiological measurement data to enhance the recognition of human behavior and emotion

3) verifying that the proposed system can be reconfigured and used in various settings to perceive both conscious, direct and emotional, indirect communication from humans, in particular non-verbal gestural commands and amusement cues.

1.5.2 Contribution and Innovation

This thesis tackles several fascinating challenges. On one hand, it demonstrates how a robot with interaction ability is much more dynamic and entertaining than a standalone robot, and generates higher interest in the audience.

From the scientific point of view, it shows the possibility to extract non-verbal communication information from natural human motion, and also to detect, identify and measure human emotional behavior.

(38)

From a technical point of view, it challenges the development of ecological wearable systems capable of monitoring several human physical and physiological parameters and of analyzing them to extract meaningful communication and emotional messages.

This research demonstrates an innovative approach to human non-verbal communication signals perception. This approach explores the potentiality of a wearable sensor system which is reconfigurable and can be further expanded to detect different types of non-verbal communication signals, both conscious and unconscious, and adapted to different scenarios. The validity of the proposed system and methodology are verified with specific experiments in the field of musical and humorous entertainment. The results are extremely promising, and show that this is an effective approach to enhance human-robot interaction and take entertainment robots to the next step. In addition, this methodology can be used in a more general context as a general methodology to achieve natural human-robot interaction in a variety of application fields in which natural human-robot interaction would be beneficial, like in the educational, medical, and personal service fields.

In conclusion, the following results were achieved. A framework for conscious and subconscious kinesics interaction was developed, that can be used by robots to understand and interpret human direct and indirect social signals. Specifically, the developed interaction system and interface allowed the Waseda Flutist robot WF-RVII to follow the directions of an orchestra conductor and adapt its performance accordingly, and the Waseda Saxophonist robot WAS-3 to interact intuitively with a dancer in a joint emotional performance. An expanded version of the system would allow the Waseda humanoid emotional robot KOBIAN to detect amusement cues, in the form of spontaneous laughter, from its audience, and adapt its humorous performance accordingly.

Through the human conscious and unconscious kinesics perception method proposed in this thesis, it will be possible for robots to perceive human natural social communication signals, thus increasing the decisional power of the robots during interaction-related tasks, and making the robot more effective, increasing its field of

(39)

Chapter 1

action. As a result, not only the robots used in this work, but the whole field of robotics can take advantage of the concepts in this thesis. According to the interaction mechanisms described in The Media Equation by Reeves, in fact, “Individuals’

interactions with computers, television, and new media are fundamentally social and natural, just like interactions in real life;” [33] or better say, the effects on people interacting with media, and especially autonomous agents like machines and robots, are often profound, leading them to behave and to respond to these media in unexpected ways, most of which they are completely unaware. For this reason, not only humanoids, but any kind of service robot should be able to understand both conscious and subconscious human communication, as people tend to naturally respond to them as they would to another person.

1.6 Thesis Outline

The research has been carried out in Japan, United States, United Kingdom and France.

This thesis consists of 7 chapters in which are presented the background of the issue addressed, the theoretical and empirical notions on which the proposed methodology is based, the specific robot platform used and the experiments to test and validate such methodology, and a discussion on limits and possible extensions of this work. The thesis is laid out as follows:

Chapter 1 introduces the background with a detailed analysis on the motivations at the basis of this work. I explain the theory on entertainment, why entertainment is important, and in particular what are the current limitations of entertainment robotics.

Chapter 2 is dedicated to the specific entertainment robot platforms used in the development of this work, the Waseda Flutist robot WF-RVII, the Waseda Saxophonist robot WAS-3, and the Waseda humanoid emotional robot, KOBIAN. It describes their characteristics and purpose, and the limitation that the present work is meant to overcome. In particular, it gives an overview of the state of the art of the robots and their specific interaction needs and purposes.

(40)

Chapter 3 explains in deep the theory of human social communication, exploring in particular paralanguage and kinesics, conscious and unconscious communication. It contains the basics of emotions and physiological changes related to communication signals, and it explains the methodology proposed for detection and analysis of both cognitive and emotional kinesics. At the end of the chapter the proposed approach limitations and extensions are discussed.

Chapter 4 is the continuation of the work in Chapter 3. It addresses the problem of recognizing a natural language, in the form of direct, intentional kinesics, and it presents the practical implementation of the proposed general human-robot interaction method on the Waseda Flutist robot WF-RVII, according to its specific needs. In particular, basic theory of music and musical interaction is presented, together with a novel non-verbal, direct interaction framework. At the end of the chapter the proposed approach limitations and extensions are discussed.

Chapter 5 instead addresses the problem of conscious emotional interaction.

Conscious emotional signals mimic and are modelled intuitively on unconscious emotional signals, as the subject is used to perceive them. Differently from symbolic natural language signs, these signals are not fixed, and more subjective, however more universally recognized from humans as a species. The proposed general human-robot interaction method is tested on the Waseda Saxophonist robot WAS-3, interacting with a dancer. In particular, basic notions on modern and expressive dance are presented, together with an analysis of emotional body movement expressions in dance. At the end of the chapter the proposed approach limitations and extensions are discussed.

Chapter 6 is the continuation of the work in chapters 3, 4, 5. It addresses the problem of indirect, unconscious emotional expression, and it presents the practical implementation of the proposed general human-robot interaction method on the Waseda humanoid emotional robot KOBIAN, as an entertainment comedian robot. In particular, theory of humorous interaction is presented, together with a novel non- verbal, emotional interaction framework. At the end of the chapter the proposed approach limitations and extensions are discussed.

(41)

Chapter 1

Chapter 7 concludes the thesis. Results are restated and evaluated from a general perspective. Broader considerations and future works are discussed, showing the overall contribution of this thesis, and also different future research directions.

(42)

Figure 1.5 Flow chart of thesis chapters.

(43)

Chapter 2 Waseda entertainment robots

2.1 Introduction

As already pointed out in the introductory chapter, entertainment robots were the very first type of robots ever been developed. These early robots were developed mainly as original mechanical art pieces for the amusement of the power elite. Nowadays robots come in several shape and sizes, optimized to carry out specific tasks [39]. It is not a case that modern entertainment and companion robots are mostly shaped like animals or humans: early acceptance studies in fact demonstrate that animaloids and humanoids have high rates of social acceptance and good capabilities of engaging the users in activities for cognitive stimulation, requiring advanced social skills [40]–[43].

Entertainment robots need different communication abilities, depending on the type and the purpose of the robot. For example, musical robots must be able to interact naturally with humans without disrupting the entertainment experience, so in this specific case, without interrupting their musical performance. For this reason, they must be able to understand non-verbal silent signals from their human partners and audience.

(44)

Comedian or companion robots, instead, can choose to interact more directly with their audience, and in order to assess the rate of success of their performance should look for unconscious amusement signals from their audience: for example, smile or laughter [44].

In this chapter the Waseda entertainment robots are briefly introduced, and their interaction abilities and needs are described, to show the motivations behind both the specific human-robot communication system implementation choices of this research, and the experiments carried out to validate such system [45].

Research on entertaining robots at Waseda University started more than 30 years ago with late Professor Kato, who developed a humanoid robot, WABOT (WAseda roBOT), capable of reading musical notation real-time and play the electronic organ with both hands and both feet; as well as talking freely with people [46]. Besides being naturally attractive and interesting as fine pieces of technology, instrument player robots are also, in terms of musical performances, comparable to high level musicians, because they are usually designed to overcome specific difficulties human musicians must deal with in relation to the chosen instrument [46]–[49]. Moreover, their virtually infinite memory capacity and real-time abilities allow the storage of several long musical pieces at a time or the immediate perfect execution of a new score. Research on entertainment musical robots continued under the guidance of Professor Takanishi, and another two robots have been developed: first the WF (Waseda Flutist robot) [50], [51]

and, most recently, the WAS (WAseda Saxophonist robot) [52], [53].

In recent years at Waseda University another humanoid entertainment robot has been developed, the Emotion Expression Biped Humanoid Robot KOBIAN-RII [54]. This robot was developed to study the natural emotional aspects of social interaction between human and robots. KOBIAN can walk, talk and express basic human emotions with both postural and facial displays. Its entertainment abilities are less specific than WF and WAS, so it should be able to recognize a wider range of human communication signals, in particular unconscious emotional expressions.

(45)

Chapter 2 2.2 Waseda Flutist No.4 Refined VI

The Waseda Flutist Robot plays the flute at the level of an intermediate musician. In particular, this research has been carried on with the WF-4RVI (Waseda Flutist No.4 Refined VI), the 4^th version, refined, of WF [51].

This version of the robot has a total of 41 DoFs (Figure 2.1); mechanically simulating the human organs involved during flute playing. With the aim of enhancing the expressiveness of its performance, the mechanical design of these simulated organs has been greatly improved compared to the previous versions, and automated algorithms for the generation of emotionally expressive music performance have been developed.

Figure 2.1 Waseda Flutist Robot WF-4RVI

Total : 41‐DOF DOF Configurations

(46)

The robot now features a human-like vocal cord, lips, oral cavity (Figure 2.2) and tonguing mechanisms (Figure 2.3). These structures proved to be very useful in the production of vibrato and in the effective control of the sound attack time and double tonguing. Precise control of sound attack time is essential to reproduce musical expression, and double tonguing is an important technique enabling players to shape note contours and control transitions between notes. The lips mechanism comprehends 3 DoFs for the accurate control of the air stream parameters (width, thickness and angle). The lips and oral cavity of the robot reproduce the structure and texture of human lips and oral cavity by using a special thermoplastic rubber, septon. Inside the oral cavity, an improved tonguing mechanism was implemented, to reproduce better the double tonguing technique. A series of experiments were performed to analyze the improvements on the dynamical properties of the sound during play, compared to the previous versions of the robot (Figure 2.4, Figure 2.5). These dynamical properties are very important to reproduce specific musical features with precision, at the same level of a professional human player.

Figure 2.2 Lips, oral cavity and vocal cord structures

(47)

Chapter 2

Cosentino Sarah – 39 – 博士論文 Figure 2.3 Tonguing mechanism

Figure 2.4 Efficiency of sound generation

Figure 2.5 Attack time comparison

(48)

2.3 Waseda Saxophonist No.3

The development of WAseda Saxophonist Robot (WAS) has started in 2008 [52]. In 2013, the third version of the robot (WAS-3) has been completed, allowing improved performance stability by the redesign of the shape of the oral cavity and by the addition of embedded sensors in the fingers.

WAS-3 is composed by 28 DoFs reproducing the physiology and anatomy of the organs involved during saxophone playing (Figure 2.6).

Figure 2.6 Waseda Saxophonist Robot WAS-3

In particular, this robot is equipped with eye mechanisms to enable the robot to perceive other musicians in its environment and to react to their performance actions.

The mechanism contains two CCD cameras that can be moved in 2 DOF, along the pitch and yaw axes. The mechanical capabilities (velocity, viewing angle) of the mechanism are designed to resemble the characteristics of the human vision apparatus.

(49)

Chapter 2

The waist has a rotary mechanism with pitch DOF designed to send physical full- body signals to performance partners, such as attack time, like a human musician.

2.4 Emotions expression robot KOBIAN

KOBIAN is a 48 DoFs humanoid robot designed to study the potentiality of using humanoids to provide both physical and emotional support to elderly and subjects with impairments during Activities of Daily Living (ADL) [55]. Bipedal walking, humanoid shape and emotional expression abilities make KOBIAN a good candidate to move in a human environment and interact socially at the same level of perception with fellow humans (Figure 2.7).

Figure 2.7 Waseda Emotional Expression Robot KOBIAN

KOBIAN-R, the refined version of the robot, mounts a new head (Figure 2.8) with 24 DoFs to allow complex asymmetric facial expression displays, to mimic natural human

(50)

expressivity [56], [57]. In addition, the head has been equipped with stereotyped emotional marks, used in the Japanese traditional manga literature to emphasize emotional display [58]–[61].

Figure 2.8 KOBIAN emotional expression head

(51)

Chapter 2 2.5 Interaction skills

These three robots represent two different types of entertainment robots, designed to adapt to different environment and for different purposes. Musical robots natural application, in fact, can be pure musical entertainment or musical edutainment, being musical production their main task. KOBIAN, on the other hand, has been designed with a variety of possible tasks in mind, and for this reason it is not as specialized. Its main entertainment application could be as robot partner or comedian, for emotional support and mood management purposes. The communication skills of these robots stem from different needs and must necessarily focus on different aspects of human-robot interaction. Hereafter the main interaction scenarios and subsequent interaction problems that these robots have to overcome will be described more in detail.

2.5.1 Musical interaction

At the present, the majority of robotic musicians lack a fundamental ability: the possibility to dynamically adapt their musical arrangement to other partner musicians in real-time, during a joint performance. This has so far limited the use of robot players to solo shows. The interaction ability would open many further application possibilities, not only in the entertainment field, but also in the educational field, for example with robot musical teacher, or training partner. For this reason, several researchers are now moving in this direction [62]–[67].

Since the very first stages of their development, a long-term goal of Waseda musical robots has been to be naturally integrated with human musicians [45], [46], [49], [68], [69]. This is not a trivial task, because the robots should not only be able to add expressiveness to their performance following the score author’s notes, or by improvisation; they must be able to follow real-time cues given by the other partner musicians or, in case of a bigger ensemble, by the ensemble conductor.

For this purpose, the focus lately has been on perfecting the perceptual abilities of the robots to enable the dynamic interaction between robots and humans at the same

(52)

level of perception. Previous works focused on the direct musical interaction between partners, in a small ensemble [68], [69]: this led to the development of an interaction system to recognize direct visual cues from other partner musicians during a joint performance (Figure 2.9, Figure 2.10, Figure 2.11). This interaction approach is very limited; it can only be used in small groups, and for not very dynamic performances. For a complete interactive performance, the robot must be able to recognize several different signs representing various musical features and emotional expressions to be represented, and vary its performance accordingly.

Figure 2.9 Direct visual musical interaction principle

Figure 2.10 Direct visual musical interaction levels: basic

(53)

Chapter 2

Cosentino Sarah – 45 – 博士論文 Figure 2.11 Direct visual musical interaction levels: extended

2.5.2 Entertainment and emotional feedback

In order to be successful, entertainment activities must be engaging and mood boosting. For this reason, human entertainers must tune-in with the audience, and be able to decipher their subconscious emotional feedback, adjusting their performance accordingly. In particular, comedians normally gauge the degree of amusement of the audience, based on the amount of applause or laughter they elicit, and adapt their act according to this feedback.

A proper entertainment robot, thus, to positively affect the mood and behavior of its audience, must first be able to recognize the audience’s emotional state. However, current entertainment robots have very limited interacting abilities, and only can recognize a very limited command vocabulary, or non-natural, stereotyped emotional displays, such as fixed facial expressions [70], [71]. Recognition of spontaneous natural emotional displays is much more complicated than recognition of a fixed symbolic language, as emotional states and behaviors, however universally recognized within the human race, present a very high degree of subjectivity and their display are also affected by external factors, like the actual environment, the culture background of the subjects, and so on.

(54)

Figure 2.12 Unconscious amusement feedback recognition

2.6 Discussion

Depending on their task, entertainment robots need to interact with humans using different strategies, so detecting and processing different types of human social signals.

Currently, basic interaction abilities for entertainment robots have been implemented, but still these solutions are static and not easily generalized, so that human-robot interaction is not natural from the human point of view. A more robust and general methodology is needed to enable different entertainment robots to interact with humans in a natural way, removing the user cognitive and psychological stress and allowing an effective immersive amusing experience. In the following chapters I will describe my proposed solution to this problem: a multimodal sensor system to analyze human physical and physiological changes related to social communication, which can be used to recognize and interpret both direct, conscious non-verbal communication and well as emotional, unconscious expression.

2.7 Conclusions

In this chapter I described the three Waseda Entertainment Robots developed in Takanishi laboratory, and illustrated a common problem they face during entertaining performance: natural interaction with the audience. This poses the basis for the next chapters in which I will describe a novel solution to overcome said problem.

Sarah COSENTINO コセンティノ サラ

Non-verbal interaction and amusement feedback capabilities for entertainment

robots

エンターテインメントロボットにおける非 言語的相互作用と娯楽フィードバック機能

に関する研究

July 2015

Sarah COSENTINO

コセンティノ サラ

Non-verbal interaction and amusement feedback capabilities for entertainment

robots

エンターテインメントロボットにおける非 言語的相互作用と娯楽フィードバック機能

に関する研究

July 2015

Waseda University

Graduate School of Advanced Science and Engineering Department of Integrative Bioscience

and Biomedical Engineering Research on Biorobotics

Sarah COSENTINO

コセンティノ サラ

A mio marito e alla mia famiglia.

…e ai creatori di Skype!!

ACKNOWLEDGMENTS

ABSTRACT

Table of contents

List of figures

List of tables

Chapter 1

Entertainment and Robots

1.1 Introduction

1.2 Background

1.2.1 Entertainment and its application

Chapter 1

Chapter 1

1.2.2 Entertainment theory

1.2.3 Humor, laughter, and entertainment

Chapter 1

1.2.4 Social interaction and entertainment

Chapter 1

1.3 Entertainment robotics

1.4 Problem statement

Chapter 1

1.5 Goal of this research

1.5.1 Aims

1.5.2 Contribution and Innovation

Chapter 1

1.6 Thesis Outline

Chapter 1

Chapter 2

Waseda entertainment robots

2.1 Introduction

Chapter 2

2.2 Waseda Flutist No.4 Refined VI

Chapter 2

2.3 Waseda Saxophonist No.3

Chapter 2

2.4 Emotions expression robot KOBIAN

Chapter 2

2.5 Interaction skills

2.5.1 Musical interaction

Chapter 2

2.5.2 Entertainment and emotional feedback

2.6 Discussion

2.7 Conclusions

Sarah COSENTINO コセンティノサラ

エンターテインメントロボットにおける非言語的相互作用と娯楽フィードバック機能

コセンティノサラ

エンターテインメントロボットにおける非言語的相互作用と娯楽フィードバック機能

コセンティノサラ