Multimodal scene understanding algorithms, applications and deep learning

Multimodal Scene Understanding: Algorithms, Applications and Deep Learning presents recent advances in multi-modal computing, with a focus on computer vision and photogrammetry. It provides the latest algorithms and applications that involve combining multiple sources of information and describes th...

Descripción completa

Detalles Bibliográficos
Otros Autores: Yang, Michael Ying, editor (editor), Rosenhahn, Bodo, editor, Murino, Vittorio, editor
Formato: Libro electrónico
Idioma:Inglés
Publicado: London, England : Academic Press [2019]
Edición:1st edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009631015306719
Tabla de Contenidos:
  • Front Cover
  • Multimodal Scene Understanding
  • Copyright
  • Contents
  • List of Contributors
  • 1 Introduction to Multimodal Scene Understanding
  • 1.1 Introduction
  • 1.2 Organization of the Book
  • References
  • 2 Deep Learning for Multimodal Data Fusion
  • 2.1 Introduction
  • 2.2 Related Work
  • 2.3 Basics of Multimodal Deep Learning: VAEs and GANs
  • 2.3.1 Auto-Encoder
  • 2.3.2 Variational Auto-Encoder (VAE)
  • 2.3.3 Generative Adversarial Network (GAN)
  • 2.3.4 VAE-GAN
  • 2.3.5 Adversarial Auto-Encoder (AAE)
  • 2.3.6 Adversarial Variational Bayes (AVB)
  • 2.3.7 ALI and BiGAN
  • 2.4 Multimodal Image-to-Image Translation Networks
  • 2.4.1 Pix2pix and Pix2pixHD
  • 2.4.2 CycleGAN, DiscoGAN, and DualGAN
  • 2.4.3 CoGAN
  • 2.4.4 UNIT
  • 2.4.5 Triangle GAN
  • 2.5 Multimodal Encoder-Decoder Networks
  • 2.5.1 Model Architecture
  • 2.5.2 Multitask Training
  • 2.5.3 Implementation Details
  • 2.6 Experiments
  • 2.6.1 Results on NYUDv2 Dataset
  • 2.6.2 Results on Cityscape Dataset
  • 2.6.3 Auxiliary Tasks
  • 2.7 Conclusion
  • References
  • 3 Multimodal Semantic Segmentation: Fusion of RGB and Depth Data in Convolutional Neural Networks
  • 3.1 Introduction
  • 3.2 Overview
  • 3.2.1 Image Classi cation and the VGG Network
  • 3.2.2 Architectures for Pixel-level Labeling
  • 3.2.3 Architectures for RGB and Depth Fusion
  • 3.2.4 Datasets and Benchmarks
  • 3.3 Methods
  • 3.3.1 Datasets and Data Splitting
  • 3.3.2 Preprocessing of the Stanford Dataset
  • 3.3.3 Preprocessing of the ISPRS Dataset
  • 3.3.4 One-channel Normal Label Representation
  • 3.3.5 Color Spaces for RGB and Depth Fusion
  • 3.3.6 Hyper-parameters and Training
  • 3.4 Results and Discussion
  • 3.4.1 Results and Discussion on the Stanford Dataset
  • 3.4.2 Results and Discussion on the ISPRS Dataset
  • 3.5 Conclusion
  • References.
  • 4 Learning Convolutional Neural Networks for Object Detection with Very Little Training Data
  • 4.1 Introduction
  • 4.2 Fundamentals
  • 4.2.1 Types of Learning
  • 4.2.2 Convolutional Neural Networks
  • 4.2.2.1 Arti cial neuron
  • 4.2.2.2 Arti cial neural network
  • 4.2.2.3 Training
  • 4.2.2.4 Convolutional neural networks
  • 4.2.3 Random Forests
  • 4.2.3.1 Decision tree
  • 4.2.3.2 Random forest
  • 4.3 Related Work
  • 4.4 Traf c Sign Detection
  • 4.4.1 Feature Learning
  • 4.4.2 Random Forest Classi cation
  • 4.4.3 RF to NN Mapping
  • 4.4.4 Fully Convolutional Network
  • 4.4.5 Bounding Box Prediction
  • 4.5 Localization
  • 4.6 Clustering
  • 4.7 Dataset
  • 4.7.1 Data Capturing
  • 4.7.2 Filtering
  • 4.8 Experiments
  • 4.8.1 Training and Test Data
  • 4.8.2 Classi cation
  • 4.8.3 Object Detection
  • 4.8.4 Computation Time
  • 4.8.5 Precision of Localizations
  • 4.9 Conclusion
  • Acknowledgment
  • References
  • 5 Multimodal Fusion Architectures for Pedestrian Detection
  • 5.1 Introduction
  • 5.2 Related Work
  • 5.2.1 Visible Pedestrian Detection
  • 5.2.2 Infrared Pedestrian Detection
  • 5.2.3 Multimodal Pedestrian Detection
  • 5.3 Proposed Method
  • 5.3.1 Multimodal Feature Learning/Fusion
  • 5.3.2 Multimodal Pedestrian Detection
  • 5.3.2.1 Baseline DNN model
  • 5.3.2.2 Scene-aware DNN model
  • 5.3.3 Multimodal Segmentation Supervision
  • 5.4 Experimental Results and Discussion
  • 5.4.1 Dataset and Evaluation Metric
  • 5.4.2 Implementation Details
  • 5.4.3 Evaluation of Multimodal Feature Fusion
  • 5.4.4 Evaluation of Multimodal Pedestrian Detection Networks
  • 5.4.5 Evaluation of Multimodal Segmentation Supervision Networks
  • 5.4.6 Comparison with State-of-the-Art Multimodal Pedestrian Detection Methods
  • 5.5 Conclusion
  • Acknowledgment
  • References
  • 6 Multispectral Person Re-Identi cation Using GAN for Color-to-Thermal Image Translation.
  • 6.1 Introduction
  • 6.2 Related Work
  • 6.2.1 Person Re-Identi cation
  • 6.2.2 Color-to-Thermal Translation
  • 6.2.3 Generative Adversarial Networks
  • 6.3 ThermalWorld Dataset
  • 6.3.1 ThermalWorld ReID Split
  • 6.3.2 ThermalWorld VOC Split
  • 6.3.3 Dataset Annotation
  • 6.3.4 Comparison of the ThermalWorld VOC Split with Previous Datasets
  • 6.3.5 Dataset Structure
  • 6.3.6 Data Processing
  • 6.4 Method
  • 6.4.1 Conditional Adversarial Networks
  • 6.4.2 Thermal Segmentation Generator
  • 6.4.3 Relative Thermal Contrast Generator
  • 6.4.4 Thermal Signature Matching
  • 6.5 Evaluation
  • 6.5.1 Network Training
  • 6.5.2 Color-to-Thermal Translation
  • 6.5.2.1 Qualitative comparison
  • 6.5.2.2 Quantitative evaluation
  • 6.5.3 ReID Evaluation Protocol
  • 6.5.4 Cross-modality ReID Baselines
  • 6.5.5 Comparison and Analysis
  • 6.5.6 Applications
  • 6.6 Conclusion
  • Acknowledgments
  • References
  • 7 A Review and Quantitative Evaluation of Direct Visual-Inertial Odometry
  • 7.1 Introduction
  • 7.2 Related Work
  • 7.2.1 Visual Odometry
  • 7.2.2 Visual-Inertial Odometry
  • 7.3 Background: Nonlinear Optimization and Lie Groups
  • 7.3.1 Gauss-Newton Algorithm
  • 7.3.2 Levenberg-Marquandt Algorithm
  • 7.4 Background: Direct Sparse Odometry
  • 7.4.1 Notation
  • 7.4.2 Photometric Error
  • 7.4.3 Interaction Between Coarse Tracking and Joint Optimization
  • 7.4.4 Coarse Tracking Using Direct Image Alignment
  • 7.4.5 Joint Optimization
  • 7.5 Direct Sparse Visual-Inertial Odometry
  • 7.5.1 Inertial Error
  • 7.5.2 IMU Initialization and the Problem of Observability
  • 7.5.3 SIM(3)-based Model
  • 7.5.4 Scale-Aware Visual-Inertial Optimization
  • 7.5.4.1 Nonlinear optimization
  • 7.5.4.2 Marginalization using the Schur complement
  • 7.5.4.3 Dynamic marginalization for delayed scale convergence
  • 7.5.4.4 Measuring scale convergence.
  • 7.5.5 Coarse Visual-Inertial Tracking
  • 7.6 Calculating the Relative Jacobians
  • 7.6.1 Proof of the Chain Rule
  • 7.6.2 Derivation of the Jacobian with Respect to Pose in Eq. (7.58)
  • 7.6.3 Derivation of the Jacobian with Respect to Scale and Gravity Direction in Eq. (7.59)
  • 7.7 Results
  • 7.7.1 Robust Quantitative Evaluation
  • 7.7.2 Evaluation of the Initialization
  • 7.7.3 Parameter Studies
  • 7.8 Conclusion
  • References
  • 8 Multimodal Localization for Embedded Systems: A Survey
  • 8.1 Introduction
  • 8.2 Positioning Systems and Perception Sensors
  • 8.2.1 Positioning Systems
  • 8.2.1.1 Inertial navigation systems
  • 8.2.1.2 Global navigation satellite systems
  • 8.2.2 Perception Sensors
  • 8.2.2.1 Visible light cameras
  • 8.2.2.2 IR cameras
  • 8.2.2.3 Event-based cameras
  • 8.2.2.4 RGB-D cameras
  • 8.2.2.5 LiDAR sensors
  • 8.2.3 Heterogeneous Sensor Data Fusion Methods
  • 8.2.3.1 Sensor con guration types
  • 8.2.3.2 Sensor coupling approaches
  • 8.2.3.3 Sensors fusion architectures
  • 8.2.4 Discussion
  • 8.3 State of the Art on Localization Methods
  • 8.3.1 Monomodal Localization
  • 8.3.1.1 INS-based localization
  • 8.3.1.2 GNSS-based localization
  • 8.3.1.3 Image-based localization
  • 8.3.1.4 LiDAR-map based localization
  • 8.3.2 Multimodal Localization
  • 8.3.2.1 Classical data fusion algorithms
  • 8.3.2.2 Reference multimodal benchmarks
  • 8.3.2.3 A panorama of multimodal localization approaches
  • 8.3.2.4 Graph-based localization
  • 8.3.3 Discussion
  • 8.4 Multimodal Localization for Embedded Systems
  • 8.4.1 Application Domain and Hardware Constraints
  • 8.4.2 Embedded Computing Architectures
  • 8.4.2.1 SoC constraints
  • 8.4.2.2 IP modules for SoC
  • 8.4.2.3 SoC
  • 8.4.2.4 FPGA
  • 8.4.2.5 ASIC
  • 8.4.2.6 Discussion
  • 8.4.3 Multimodal Localization in State-of-the-Art Embedded Systems.
  • 8.4.3.1 Example of embedded SoC for multimodal localization
  • 8.4.3.2 Smart phones
  • 8.4.3.3 Smart glasses
  • 8.4.3.4 Autonomous mobile robots
  • 8.4.3.5 Unmanned aerial vehicles
  • 8.4.3.6 Autonomous driving vehicles
  • 8.4.4 Discussion
  • 8.5 Application Domains
  • 8.5.1 Scene Mapping
  • 8.5.1.1 Aircraft inspection
  • 8.5.1.2 SenseFly eBee classic
  • 8.5.2 Pedestrian Localization
  • 8.5.2.1 Indoor localization in large-scale buildings
  • 8.5.2.2 Precise localization of mobile devices in unknown environments
  • 8.5.3 Automotive Navigation
  • 8.5.3.1 Autonomous driving
  • 8.5.3.2 Smart factory
  • 8.5.4 Mixed Reality
  • 8.5.4.1 Virtual cane system for visually impaired individuals
  • 8.5.4.2 Engineering, construction and maintenance
  • 8.6 Conclusion
  • References
  • 9 Self-Supervised Learning from Web Data for Multimodal Retrieval
  • 9.1 Introduction
  • 9.1.1 Annotating Data: A Bottleneck for Training Deep Neural Networks
  • 9.1.2 Alternatives to Annotated Data
  • 9.1.3 Exploiting Multimodal Web Data
  • 9.2 Related Work
  • 9.2.1 Contributions
  • 9.3 Multimodal Text-Image Embedding
  • 9.4 Text Embeddings
  • 9.5 Benchmarks
  • 9.5.1 InstaCities1M
  • 9.5.2 WebVision
  • 9.5.3 MIRFlickr
  • 9.6 Retrieval on InstaCities1M and WebVision Datasets
  • 9.6.1 Experiment Setup
  • 9.6.2 Results and Conclusions
  • 9.6.3 Error Analysis
  • 9.6.3.1 Visual features confusion
  • 9.6.3.2 Errors from the dataset statistics
  • 9.6.3.3 Words with different meanings or uses
  • 9.7 Retrieval in the MIRFlickr Dataset
  • 9.7.1 Experiment Setup
  • 9.7.2 Results and Conclusions
  • 9.8 Comparing the Image and Text Embeddings
  • 9.8.1 Experiment Setup
  • 9.8.2 Results and Conclusions
  • 9.9 Visualizing CNN Activation Maps
  • 9.10 Visualizing the Learned Semantic Space with t-SNE
  • 9.10.1 Dimensionality Reduction with t-SNE
  • 9.10.2 Visualizing Both Image and Text Embeddings.
  • 9.10.3 Showing Images at the Embedding Locations.