Human-centered Vision Systems

Instructor: Nicu Sebe (University of Trento, Italy)

Syllabus

This tutorial aims to offer its audience a new perspective towards opportunities in employing contextual data in human-centric video-based applications. The tutorial covers topics and case studies in algorithm design for different applications in smart environments and examines the use of contextual data in various forms to provide efficiency and reliability to the vision processing. Examples of inference of the user’s activity, facial expression, eye gaze, gesture, emotion, and intention, as well as object recognition based on user interactions are used to support the presented topics.

A convergence is occurring in human-centered information systems that employ real-time sensing and inference methods to know the user’s activity, location, commands, needs, state, and intentions on the one hand, and content acquisition, searching, and delivery models based on the internet and other media resources on the other hand. Interfaces that sense and receive user’s commands and those that render the results to the user need to operate in a flexible way according to the user’s context or a preference profile. Vision can offer user interfaces that recognize the user’s location, pose, activity, gesture, area of interest, gait, mobility, habitual routines or novel actions, attention, facial expression, mood, emotions, type of clothing, and interaction with the environment, objects, appliances, or other people. While many of these features have been extensively studied by the vision community, the term “human-centered” is used to refer to the role vision can play through a collection of these features in offering flexible, adaptive, and context-aware interface and inference operations when working with an individual user.

The course focuses on three aspects of context-driven information fusion in video processing for human-centric applications: (1) Interfacing vision processing with high-level data fusion to build up knowledge based and behavior models; (2) Human pose, gaze, activity, facial expression, preferences, behavior modeling, and user feedback as sources of human-centric context; (3) Case studies of incorporating vision-based activity and expression recognition algorithms into adaptive systems that learn user preferences and adjust their services accordingly. The course topics and case studies are supported by a large collection of implemented examples covering various layers of processing from early vision extraction to intermediate soft decisions in multi-camera processing or latent-space activity recognition, and to high-level inference of semantics based on visual clues.

Syllabus

The tutorial will provide the participants with an understanding of the key concepts, state-of-the-art techniques, new application opportunities, and open issues in the areas described above. The course is organized into the following syllabus:

  1. Introduction and motivation:
    1. New paradigms in user-centric design: application spaces in smart environments, ambient intelligence, adaptive systems, intuitive interfaces
    2. Convergence of sensing, inference, and media content delivery in human-centered information systems
    3. Challenges in vision design: processing aspects in algorithm design, user acceptance aspects in application design
  2. Use of context in video processing:
    1. Types of contextual data and sources: environmental versus user-centric context, static versus dynamic context, multi-camera networks, multimodal sensing
    2. Examples in early vision, human pose analysis, user intention detection
  3. Interface of vision and high-level inference:
    1. Data fusion for vision-based inference, role of feedback to vision, knowledge accumulation, observation validation
    2. Interactive learning: role of user in guiding the inference system, query and feedback from user, intuitive user interactions
    3. Semantic labeling based on user observations and context, grounding logic rules with observations
    4. Case study: activity recognition based on environmental context
  4. Human-centric inference
    1. Human-centric data: algorithms for detection of human pose, gesture, activity, facial expression, gaze, attention, mood, emotions
    2. Case study: human head pose and gaze analysis
    3. Case study: multimodal human emotion analysis
  5. Conclusions and new frontiers