Keynotes


Yukie Nagai

Project Professor, The University of Tokyo

Bio: Yukie Nagai is a Project Professor at the International Research Center for Neurointelligence at the University of Tokyo. She earned her Ph.D. in Engineering from Osaka University in 2004, after which she worked at the National Institute of Information and Communications Technology, Bielefeld University, and then Osaka University. Since 2019, she has been leading the Cognitive Developmental Robotics Lab at the University of Tokyo. Her research encompasses cognitive developmental robotics, computational neuroscience, and assistive technologies for developmental disorders. Dr. Nagai employs computational methods to investigate the underlying neural mechanisms involved in social cognitive development. In acknowledgment of her work, she was elected to “World’s 50 Most Renowned Women in Robotics” in 2020 (Analytics Insight),”35 Women in Robotics Engineering and Science” in 2022 (IEEE IROS), and “Women In Tech 30” in 2024 (Forbes JAPAN), among other recognitions.

Title: How People See the World: An Embodied Predictive Processing Theory

Abstract: Patterns of visual attention vary across individuals and contexts. People’s intentions, action goals, and prior knowledge about a scene influence where they look. While extensive research has been conducted on visual attention and eye tracking, much remains unknown about the perceptual world itself—how people experience vision subjectively and how internal and external factors shape these experiences. This talk introduces a neuro-inspired theory of human visual perception based on embodied predictive processing, a framework I have been advocating as a unified theory of cognition. The core idea of embodied predictive processing is that the brain continuously integrates sensory signals from the body with predictive signals from the brain, striving to minimize prediction errors. This is achieved either by updating internal models or by altering actions and, consequently, sensory inputs. Through this process, sensorimotor signals are integrated, creating a dynamic interplay between perception and action. Our experiments with computational neural networks, embedded in robots to simulate brain-body coupling, demonstrate that embodied sensorimotor learning drives the development of visual perception and attention. These findings highlight that visual perception is shaped not only by visual experiences but also by motor experiences. The talk will also explore how neurodiverse individuals perceive the world differently from neurotypical individuals, shedding light on the mechanisms underlying their distinct gaze behaviors.



Jean-Marc Odobez

IDIAP & EPFL Senior researcher, Head of the Perception and Activity Understanding group

Bio: Dr. Jean-Marc Odobez (MSc 1990, PhD 1994) graduated from the École Nationale Supérieure des Télécommunications de Bretagne (ENSTBr, France) in 1990 and obtained his Ph.D. from Rennes University for his dissertation at INRIA, focusing on dynamic scene analysis for scene understanding. After five years as an Assistant Professor at the University of Le Mans, France, he joined Idiap, where he is now the Head of the Perception & Activity Understanding Group. He is also an Adjunct Faculty at EPFL in the School of Engineering and a member of the Electrical Engineering Doctoral Committee (EDEE). His research interests lie in the design of multimodal perception systems, integrating computer vision, statistical machine learning, deep learning, and social sciences to advance activity and behavior recognition, as well as human-human and human-robot interaction modeling. His work has applications in human health assessment, social robotics, and media content analysis. Dr. Odobez has published over 50 journal articles and 160 peer-reviewed conference papers in his field. He has been the principal investigator of more than 18 European and Swiss projects and has led 10 technology transfer projects with SMEs. He holds several patents in computer vision and is the co-founder of Klewel SA (www.klewel.ch) and Eyeware SA (eyeware.tech), a company specializing in eye tracking and attention modeling. He is an IEEE member and serves as an Associate Editor for the Machine Vision and Applications journal. Additionally, he regularly serves as an Area Chair for conferences such as ICMI, ICCV, CVPR, and ECCV.

Title: Looking Through Their Eyes: Decoding Gaze and Attention in Everyday Life

Abstract: Beyond words, non-verbal behaviors (NVB) are known to play important roles in face-to-face interactions. However, decoding NVB is a challenging problem that involves both extracting subtle physical NVB cues and mapping them to higher-level communication behaviors or social constructs. Gaze, in particular, serves as a fundamental indicator of attention and interest, influencing communication and social signaling across various domains such as human-computer interaction, robotics, and medical diagnosis, notably in Autism Spectrum Disorders (ASD) assessment.
However, estimating others' visual attention, encompassing their gaze and Visual Focus of Attention (VFOA), remains highly challenging, even for humans. It requires not only inferring accurate 3D gaze directions but also understanding the contextual scene to discern which object in the field of view is actually looked at. Context can include people activities that can provide priors about which objects are looked at, or the scene structure to detect obstructions in the line of sight. Recent research has pursued two avenues to address this: the first one focused on improving appearance-based 3D gaze estimation from images and videos, while the second investigated gaze following —the task of inferring where a person looks in an image.
This presentation will explore ideas and methodologies addressing both challenges. Initially, it delves into advancements in 3D gaze estimation, including personalized model construction via few-shot learning and gaze redirection eye synthesis, differential gaze estimation, and leveraging social interaction priors for model adaptation. Subsequently, recent models for estimating gaze targets location ('where') in real-world settings are introduced, including the joint inference of gaze locations for all people in a scene, the inference of social labels like eye contact and shared attention, or the inference of the semantic category of what is being looked at ('what').