Multimodal scene understanding algorithms, applications and deep learning
Multimodal Scene Understanding: Algorithms, Applications and Deep Learning presents recent advances in multi-modal computing, with a focus on computer vision and photogrammetry. It provides the latest algorithms and applications that involve combining multiple sources of information and describes th...
Otros Autores: | , , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
London, England :
Academic Press
[2019]
|
Edición: | 1st edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009631015306719 |
Tabla de Contenidos:
- Front Cover
- Multimodal Scene Understanding
- Copyright
- Contents
- List of Contributors
- 1 Introduction to Multimodal Scene Understanding
- 1.1 Introduction
- 1.2 Organization of the Book
- References
- 2 Deep Learning for Multimodal Data Fusion
- 2.1 Introduction
- 2.2 Related Work
- 2.3 Basics of Multimodal Deep Learning: VAEs and GANs
- 2.3.1 Auto-Encoder
- 2.3.2 Variational Auto-Encoder (VAE)
- 2.3.3 Generative Adversarial Network (GAN)
- 2.3.4 VAE-GAN
- 2.3.5 Adversarial Auto-Encoder (AAE)
- 2.3.6 Adversarial Variational Bayes (AVB)
- 2.3.7 ALI and BiGAN
- 2.4 Multimodal Image-to-Image Translation Networks
- 2.4.1 Pix2pix and Pix2pixHD
- 2.4.2 CycleGAN, DiscoGAN, and DualGAN
- 2.4.3 CoGAN
- 2.4.4 UNIT
- 2.4.5 Triangle GAN
- 2.5 Multimodal Encoder-Decoder Networks
- 2.5.1 Model Architecture
- 2.5.2 Multitask Training
- 2.5.3 Implementation Details
- 2.6 Experiments
- 2.6.1 Results on NYUDv2 Dataset
- 2.6.2 Results on Cityscape Dataset
- 2.6.3 Auxiliary Tasks
- 2.7 Conclusion
- References
- 3 Multimodal Semantic Segmentation: Fusion of RGB and Depth Data in Convolutional Neural Networks
- 3.1 Introduction
- 3.2 Overview
- 3.2.1 Image Classi cation and the VGG Network
- 3.2.2 Architectures for Pixel-level Labeling
- 3.2.3 Architectures for RGB and Depth Fusion
- 3.2.4 Datasets and Benchmarks
- 3.3 Methods
- 3.3.1 Datasets and Data Splitting
- 3.3.2 Preprocessing of the Stanford Dataset
- 3.3.3 Preprocessing of the ISPRS Dataset
- 3.3.4 One-channel Normal Label Representation
- 3.3.5 Color Spaces for RGB and Depth Fusion
- 3.3.6 Hyper-parameters and Training
- 3.4 Results and Discussion
- 3.4.1 Results and Discussion on the Stanford Dataset
- 3.4.2 Results and Discussion on the ISPRS Dataset
- 3.5 Conclusion
- References.
- 4 Learning Convolutional Neural Networks for Object Detection with Very Little Training Data
- 4.1 Introduction
- 4.2 Fundamentals
- 4.2.1 Types of Learning
- 4.2.2 Convolutional Neural Networks
- 4.2.2.1 Arti cial neuron
- 4.2.2.2 Arti cial neural network
- 4.2.2.3 Training
- 4.2.2.4 Convolutional neural networks
- 4.2.3 Random Forests
- 4.2.3.1 Decision tree
- 4.2.3.2 Random forest
- 4.3 Related Work
- 4.4 Traf c Sign Detection
- 4.4.1 Feature Learning
- 4.4.2 Random Forest Classi cation
- 4.4.3 RF to NN Mapping
- 4.4.4 Fully Convolutional Network
- 4.4.5 Bounding Box Prediction
- 4.5 Localization
- 4.6 Clustering
- 4.7 Dataset
- 4.7.1 Data Capturing
- 4.7.2 Filtering
- 4.8 Experiments
- 4.8.1 Training and Test Data
- 4.8.2 Classi cation
- 4.8.3 Object Detection
- 4.8.4 Computation Time
- 4.8.5 Precision of Localizations
- 4.9 Conclusion
- Acknowledgment
- References
- 5 Multimodal Fusion Architectures for Pedestrian Detection
- 5.1 Introduction
- 5.2 Related Work
- 5.2.1 Visible Pedestrian Detection
- 5.2.2 Infrared Pedestrian Detection
- 5.2.3 Multimodal Pedestrian Detection
- 5.3 Proposed Method
- 5.3.1 Multimodal Feature Learning/Fusion
- 5.3.2 Multimodal Pedestrian Detection
- 5.3.2.1 Baseline DNN model
- 5.3.2.2 Scene-aware DNN model
- 5.3.3 Multimodal Segmentation Supervision
- 5.4 Experimental Results and Discussion
- 5.4.1 Dataset and Evaluation Metric
- 5.4.2 Implementation Details
- 5.4.3 Evaluation of Multimodal Feature Fusion
- 5.4.4 Evaluation of Multimodal Pedestrian Detection Networks
- 5.4.5 Evaluation of Multimodal Segmentation Supervision Networks
- 5.4.6 Comparison with State-of-the-Art Multimodal Pedestrian Detection Methods
- 5.5 Conclusion
- Acknowledgment
- References
- 6 Multispectral Person Re-Identi cation Using GAN for Color-to-Thermal Image Translation.
- 6.1 Introduction
- 6.2 Related Work
- 6.2.1 Person Re-Identi cation
- 6.2.2 Color-to-Thermal Translation
- 6.2.3 Generative Adversarial Networks
- 6.3 ThermalWorld Dataset
- 6.3.1 ThermalWorld ReID Split
- 6.3.2 ThermalWorld VOC Split
- 6.3.3 Dataset Annotation
- 6.3.4 Comparison of the ThermalWorld VOC Split with Previous Datasets
- 6.3.5 Dataset Structure
- 6.3.6 Data Processing
- 6.4 Method
- 6.4.1 Conditional Adversarial Networks
- 6.4.2 Thermal Segmentation Generator
- 6.4.3 Relative Thermal Contrast Generator
- 6.4.4 Thermal Signature Matching
- 6.5 Evaluation
- 6.5.1 Network Training
- 6.5.2 Color-to-Thermal Translation
- 6.5.2.1 Qualitative comparison
- 6.5.2.2 Quantitative evaluation
- 6.5.3 ReID Evaluation Protocol
- 6.5.4 Cross-modality ReID Baselines
- 6.5.5 Comparison and Analysis
- 6.5.6 Applications
- 6.6 Conclusion
- Acknowledgments
- References
- 7 A Review and Quantitative Evaluation of Direct Visual-Inertial Odometry
- 7.1 Introduction
- 7.2 Related Work
- 7.2.1 Visual Odometry
- 7.2.2 Visual-Inertial Odometry
- 7.3 Background: Nonlinear Optimization and Lie Groups
- 7.3.1 Gauss-Newton Algorithm
- 7.3.2 Levenberg-Marquandt Algorithm
- 7.4 Background: Direct Sparse Odometry
- 7.4.1 Notation
- 7.4.2 Photometric Error
- 7.4.3 Interaction Between Coarse Tracking and Joint Optimization
- 7.4.4 Coarse Tracking Using Direct Image Alignment
- 7.4.5 Joint Optimization
- 7.5 Direct Sparse Visual-Inertial Odometry
- 7.5.1 Inertial Error
- 7.5.2 IMU Initialization and the Problem of Observability
- 7.5.3 SIM(3)-based Model
- 7.5.4 Scale-Aware Visual-Inertial Optimization
- 7.5.4.1 Nonlinear optimization
- 7.5.4.2 Marginalization using the Schur complement
- 7.5.4.3 Dynamic marginalization for delayed scale convergence
- 7.5.4.4 Measuring scale convergence.
- 7.5.5 Coarse Visual-Inertial Tracking
- 7.6 Calculating the Relative Jacobians
- 7.6.1 Proof of the Chain Rule
- 7.6.2 Derivation of the Jacobian with Respect to Pose in Eq. (7.58)
- 7.6.3 Derivation of the Jacobian with Respect to Scale and Gravity Direction in Eq. (7.59)
- 7.7 Results
- 7.7.1 Robust Quantitative Evaluation
- 7.7.2 Evaluation of the Initialization
- 7.7.3 Parameter Studies
- 7.8 Conclusion
- References
- 8 Multimodal Localization for Embedded Systems: A Survey
- 8.1 Introduction
- 8.2 Positioning Systems and Perception Sensors
- 8.2.1 Positioning Systems
- 8.2.1.1 Inertial navigation systems
- 8.2.1.2 Global navigation satellite systems
- 8.2.2 Perception Sensors
- 8.2.2.1 Visible light cameras
- 8.2.2.2 IR cameras
- 8.2.2.3 Event-based cameras
- 8.2.2.4 RGB-D cameras
- 8.2.2.5 LiDAR sensors
- 8.2.3 Heterogeneous Sensor Data Fusion Methods
- 8.2.3.1 Sensor con guration types
- 8.2.3.2 Sensor coupling approaches
- 8.2.3.3 Sensors fusion architectures
- 8.2.4 Discussion
- 8.3 State of the Art on Localization Methods
- 8.3.1 Monomodal Localization
- 8.3.1.1 INS-based localization
- 8.3.1.2 GNSS-based localization
- 8.3.1.3 Image-based localization
- 8.3.1.4 LiDAR-map based localization
- 8.3.2 Multimodal Localization
- 8.3.2.1 Classical data fusion algorithms
- 8.3.2.2 Reference multimodal benchmarks
- 8.3.2.3 A panorama of multimodal localization approaches
- 8.3.2.4 Graph-based localization
- 8.3.3 Discussion
- 8.4 Multimodal Localization for Embedded Systems
- 8.4.1 Application Domain and Hardware Constraints
- 8.4.2 Embedded Computing Architectures
- 8.4.2.1 SoC constraints
- 8.4.2.2 IP modules for SoC
- 8.4.2.3 SoC
- 8.4.2.4 FPGA
- 8.4.2.5 ASIC
- 8.4.2.6 Discussion
- 8.4.3 Multimodal Localization in State-of-the-Art Embedded Systems.
- 8.4.3.1 Example of embedded SoC for multimodal localization
- 8.4.3.2 Smart phones
- 8.4.3.3 Smart glasses
- 8.4.3.4 Autonomous mobile robots
- 8.4.3.5 Unmanned aerial vehicles
- 8.4.3.6 Autonomous driving vehicles
- 8.4.4 Discussion
- 8.5 Application Domains
- 8.5.1 Scene Mapping
- 8.5.1.1 Aircraft inspection
- 8.5.1.2 SenseFly eBee classic
- 8.5.2 Pedestrian Localization
- 8.5.2.1 Indoor localization in large-scale buildings
- 8.5.2.2 Precise localization of mobile devices in unknown environments
- 8.5.3 Automotive Navigation
- 8.5.3.1 Autonomous driving
- 8.5.3.2 Smart factory
- 8.5.4 Mixed Reality
- 8.5.4.1 Virtual cane system for visually impaired individuals
- 8.5.4.2 Engineering, construction and maintenance
- 8.6 Conclusion
- References
- 9 Self-Supervised Learning from Web Data for Multimodal Retrieval
- 9.1 Introduction
- 9.1.1 Annotating Data: A Bottleneck for Training Deep Neural Networks
- 9.1.2 Alternatives to Annotated Data
- 9.1.3 Exploiting Multimodal Web Data
- 9.2 Related Work
- 9.2.1 Contributions
- 9.3 Multimodal Text-Image Embedding
- 9.4 Text Embeddings
- 9.5 Benchmarks
- 9.5.1 InstaCities1M
- 9.5.2 WebVision
- 9.5.3 MIRFlickr
- 9.6 Retrieval on InstaCities1M and WebVision Datasets
- 9.6.1 Experiment Setup
- 9.6.2 Results and Conclusions
- 9.6.3 Error Analysis
- 9.6.3.1 Visual features confusion
- 9.6.3.2 Errors from the dataset statistics
- 9.6.3.3 Words with different meanings or uses
- 9.7 Retrieval in the MIRFlickr Dataset
- 9.7.1 Experiment Setup
- 9.7.2 Results and Conclusions
- 9.8 Comparing the Image and Text Embeddings
- 9.8.1 Experiment Setup
- 9.8.2 Results and Conclusions
- 9.9 Visualizing CNN Activation Maps
- 9.10 Visualizing the Learned Semantic Space with t-SNE
- 9.10.1 Dimensionality Reduction with t-SNE
- 9.10.2 Visualizing Both Image and Text Embeddings.
- 9.10.3 Showing Images at the Embedding Locations.