Call for Regular and Special Session Papers (CLOSED)

The Call for Regular and Special Session Papers is now closed.

Authors are invited to submit their paper to the ICME 2020 regular track (including the Special Sessions). Follow carefully the submission instructions and upload your contributions to this submission website. Note the important dates related to the regular program.

In addition to the ICME 2020 broad topics of interest, papers can also now be submitted to specific Special Sessions:


  • Submission: 29 Nov 2019 13 Dec 2019, AoE
  • Author notification: 6 Mar 2020, AoE
  • Camera ready submission, copyright forms: 3 Apr 2020, AoE


Deep Representations for Visual Quality Assessment

Description:Finding good signal representations is a fundamental problem in visual quality assessment, where pixel-based quality metrics such as the mean squared error are not able to predict accurately quality scores given by humans. In the past few years, the advent of deep learning algorithms has enabled to learn these representations from data, rather than designing them manually, achieving state-of-the-art results in visual quality assessment. In addition to conventional 2D image/video content, data-driven quality metrics start to find applications in emerging video formats such as virtual reality, light fields, point clouds, etc., as well as in several traditional image processing and computer vision domains as a tool to accurately evaluate performance. Compared to existing deep-learning-based visual quality metrics, next-generation data-driven approaches will have to support a larger degree of immersion, including higher spatial/temporal resolution and interactivity. Furthermore, they should adapt to more realistic user experience, including variable viewing conditions and using multiple devices from 2D screens to head mounted displays, and different kinds of acquisition and coding artefacts. Finally, deep video representations for quality might be tailored to specific application scenarios, such as video surveillance or perception.

In this special session, we aim at gathering together and discussing some of the most recent and significant results on deep-learning-based methods for visual quality prediction, with a particular focus to applications going beyond conventional 2D video quality assessment.


  • Giuseppe Valenzise, Laboratoire des Signaux et Systèmes (L2S), CNRS, CentraleSupelec, Université Paris-Sud, France.
  • Aladine Chetouani, Laboratoire PRISME, Polytech’Orleans, Université d’Orléans, France.
Graph Neural Networks for Multimedia Representation Learning

Description: Graphs are flexible mathematical structures modeling pairwise relations between data entities, such as brain networks, social networks, transportation networks, knowledge graphs, irregular geometric data, etc. Graph Neural Networks (GNNs) are generalizations of CNNs to graph-structured data, which have been attracting increasing attention due to the great expressive power. Thanks to its convincing performance and high interpretability, GNNs have been widely employed to represent various kinds of multimedia data, such as social networks, 3D point clouds, natural languages, etc.

While GNNs have progressed in multimedia representation learning, GNN models still remain challenging in offering satisfactory representation for multimedia data theoretically, e.g., optimizing the underlying graph, designing deep GNN representation, addressing dynamic graph structures, etc. Moreover, how to develop scalable GNNs for large-scale multimedia data still requires many efforts.

This special session serves as a forum for researchers from academia and industry to share their insights and cutting-edge results in GNNs for multimedia representation learning. Both theoretical development and applications of GNNs are welcome for submission.

Topics of particular interest include, but are not limited to:

  • Graph learning in GNNs for multimedia representation
  • GNNs for deep multimedia representation learning
  • GNNs for time-varying multimedia representation learning
  • Scalable GNNs for large-scale multimedia data representation
  • Directed GNNs for asymmetric multimedia representation
  • Generative GNN models for multimedia data
  • GNNs for innovative multimedia applications, such as person re-identification, social network analysis, point cloud processing, image retrieval, recommendation systems and so on.


  • Wei Hu, Peking University, China.
  • Zheng Wang, National Institute of Informatics, Japan.
  • Shin’ichi Satoh, National Institute of Informatics, Japan.
  • Chia-Wen Lin, National Tsing Hua University, Taiwan.
Smart Camera

Description: The recent developments in camera design have led to a new definition of the camera, i.e., “a device that calculates images”. This new definition emerges from the complementary fields of computational imaging and computational photography, which combine camera hardware design with computational processing to significantly enhance the capabilities of a camera. The maturity of computational imaging system design, the continuing sophistication and miniaturization of embedded processing, and the recent explosive growth of artificial intelligence technology have driven the success of computational imaging. AI integration in cameras is a very new research area and is developing explosively, which has emerged over just the past 5 years, and we are certain that significant new insights will emerge over the next several years. Thus, discussing how artificial intelligence (AI) technology changes the physical nature of the camera is the primary focus of “Smart Camera” special session. We hope that this special session will be useful in creating a common understanding for camera designers and software practitioners.


  • Lu Fang, Tsinghua University, China.
  • Xing Lin, Tsinghua University, China.
  • Mohit Gupta, University of Wisconsin-Madison, US.
  • Sunil Jaiswal, K|Lens GmbH, Germany.
  • David J. Brady, Duke University, US.
Recent Advances in Immersive Imaging Technologies

Description: The emergence of 360° video capture devices, light field camera arrays and head-mounted displays has created an entirely new opportunity for content creators for delivering truly immersive experiences to audiences. Technical limitations of the capture devices, e.g. incorrect 3D-to-2D mapping and optical distortions in omnidirectional image acquisition, optical distortions or low spatial and angular resolution of light field images, reduce the quality of experience on the consumer side. Also, handling the sheer amount of data in all processing steps of light field imaging, 360-video and volumetric video from capture to display is a major challenge. Especially, streaming of the captured immersive contents to the audiences in high quality is still an unsolved problem. For this reason, coding, visual attention modelling, and new quality metrics play a significant role to ensure the delivery of high-quality immersive experiences.

The special session is dedicated to recent advances in immersive imaging technologies, in particular in the following research topics:

  • Capturing, processing and rendering of light fields, 360° content and/or volumetric video
  • Coding and streaming of light fields, 360° content and/or volumetric video
  • Visual attention/saliency in light fields, 360° content and/or volumetric video
  • Quality metrics in light fields, 360° content and/or volumetric video


  • Martin Alain, Trinity College Dublin, Ireland.
  • Cagri Ozcinar, Trinity College Dublin, Ireland.
  • Sebastian Knorr, Technical University of Berlin, Germany.
  • Emin Zerman, Trinity College Dublin, Ireland.
Multimodal people analytics in media

Description: Media content is diverse, complex and often multimodal. Traditionally, automated analysis of media content has been concerned with indexing, summarizing and organizing large and diverse data. To facilitate a more human-like understanding of media content, it is important to focus on the people depicted in the media content. For the majority of applications involving people analytics, joint learning and analysis of multimodal signals (e.g. audio and video) is critical. With the advances in deep learning and the availability of media data, the multimedia community is well equipped to address the challenges related to multimodal media processing and analysis, and to explore new applications. We invite full-length papers related (but not limited) to the following topics that extract people analytics in media content using at least two modalities:

  • Audiovisual diarization
  • Active speaker identification
  • Multimodal person (re)identification, character discovery
  • Multiview approaches to learning shared representations
  • Action retrieval and recognition in media
  • Affect (e.g., violence, emotion) prediction in media (including social media)
  • Algorithms, architectures and databases for people analytics in media


  • Tanaya Guha, University of Warwick, UK.
  • Naveen Kumar, Disney Research Los Angeles, US.
  • Shri Narayanan, University of Southern California, US.
  • Krishna Somandepalli, University of Southern California, US.
Learning-based Geometry Modeling from Light Fields and Beyond

Description: Light field (LF) imaging devices capture both the intensity and the direction of light rays from the target scene/object. And thus the resulting LF images record both the reflectance and the geometry of the scene/object. In recent years, commercial hand-held LF cameras have entered markets which enable affordable and convenient acquisition of LF data, making research on LF increasingly popular both in industry and academia. The geometry (or depth) information embedded in LF data provides valuable clues for scene understanding, and alternative solutions for various long-standing computer vision problems, such as segmentation, classification, saliency detection, and motion de-blurring. Additionally, the precise depth measuring capabilities enabled by LF imaging opens opportunities for more interesting applications, e.g., material recognition, industrial process control, autonomous navigation etc. However, the estimation of scene geometry from LF data remains a challenging problem due to possible sparse angular sampling, and complex factors such as occlusions, texture-less and reflective surfaces. To facilitate more efficient and reliable deployment of LF imaging technology in all the potential areas herein mentioned, this special session is focused on investigation of how learning-based models could help for extracting the underlying geometry of light field data against all the adversaries aforementioned, and what novel and interesting applications could be inspired by frameworks of machine learning.


  • Dr. Junhui Hou, City University of Hong Kong, China.
  • Dr. Jie Chen, OmniVision Technologies, Inc, Singapore.
  • Prof. Christine Guillemot, INRIA, France.
  • Prof. Jingyi Yu, ShanghaiTech University& University of Delaware, US.
Domain Adaptation for Multimedia Semantic Understanding

Description: Deep neural networks have achieved satisfying performance in various computer vision tasks with large-scale labeled training data. However, in many applications, it is prohibitively difficult to obtain large amount of labels, as labeling is expensive and time-consuming. Due to the presence of dataset bias or domain shift, directly generalizing the models trained on one large-scale labeled source domain to another related but unlabeled target domain may not perform well.

Meanwhile, recent progress in graphics and simulation infrastructure can create large amount of synthetic and labeled data, such as CARLA and GTA-V. Several recent efforts have studied models trained on synthetic data. Unfortunately, while models trained on synthetic data perform well on synthetic data, they often do not transfer well to real-world settings. While there are ongoing efforts to make simulations more realistic, it is very difficult to model all the characteristics of real data. Therefore, transferring the labeled data in the simulation domain to the real-world domain is a promising alternative.

There are some challenges for domain adaptation. First, the labeled source data may be collected from multiple domains. Second, the label set of the target domain is different from those of the source domains. Third, the labeled source data may be in different modalities with possible data incompleteness. This special session seeks original contributions reporting the most recent progress on domain adaptation to address the above challenges in multimedia semantic understanding applications.


  • Dr. Sicheng Zhao, University of California, Berkeley, US.
  • Prof. Guiguang Ding, Tsinghua University, China.
  • Prof. Kurt Keutzer, University of California, Berkeley, US.