Last October, I attended ICCV 2019 (in the motherland!). Here’s some thoughts on the conference, cool papers, & key takeaways.
By the numbers
ICCV ’19: 7500 attendees (x 2.41 growth since ’17), 4300 submissions, 1000 accepted papers.
CVPR ’19: 9200 attendees (x 1.4 growth since ’18), 5100 submissions, 1300 accepted papers.
CVPR is still the big brother to ICCV w.r.t attendance & # of papers (both w/ acceptance rates ~25%). ICCV’s attendance spike in 2019 is likely buffed by being held in Seoul (easy for Koreans & Chinese).
PAMI TC Awards (for 10-year old papers with significant impact on CV)
- Building Rome in a Day
- This was a sparse 3D-reconstruction project of ground breaking scale. Unlike Google Earth where images are controlled (& geographic info readily available), this project used photos in the “wild” (Flickr) to “reconstruct” Rome & other cities in a day.
- Attribute & Simile Classifiers for Face Verification
- This was also a ground breaking project for face verification via SVMs to binarily classify “attributes” & “similies”. The classifications were then used for face verification/matching.
- Labeled Faces in the Wild (LFW)
- Unlike previous facial datasets (collected from controlled environments), this dataset contained faces from the “wild”. LFW long-served as de-facto benchmark for face verification, similarly to what ImageNet has been for object recognition.
Similarly to CVPR, applications of deep learning to classical methods dominated at ICCV. As seen above, common themes were: object detection, recognition, unsupervised/self-supervised learning, few-shot learning, & GANs.
Best paper: SinGAN
Google/Technion designed a GAN capable of 1) training on a single image, & 2) generating extremely realistic variations of that image. Unlike previous methods, SinGAN can “single-shot train”, to generate realistic, complicated outputs (more than simple textures).
Note, unlike NVIDIA’s face generator, SinGAN is not designed primarily for faces. When applied to faces, SinGAN (via granularity/patch size from user) can cause minor/realistic variations seen above.
- FSGAN: Subject Agnostic Face Swapping and Reenactment
- Deepfake research was popular at ICCV. FSGAN can face swap/reenact without training on source or target image. This means everyday users (for which there are limited pictures available online) can be targeted.
- Detecting Photoshopped Faces by Scripting Photoshop
- Adobe released a method to detect whether an image has been Photoshopped. Helpful for Tinder maybe?
- StructureFlow: Image Inpainting via Structure-Aware Appearance Flow
- This DL-based image inpainting achieves SOTA results on extremely fine-grained textures.
- Fashion++: Minimal Edits for Outfit Improvement
- Facebook released a GAN that can propose “minimal adjustments” to a model’s outfit to “maximize fashionability“. The “fashionable” data for training the GAN is gathered online (not Facebook/Instagram). The resulting GAN suggests garments, accessories, fit (ex. baggy pants), & style (rolling up the sleeves).
- FACSIMILE: Fast and Accurate Scans From an Image in Less Than a Second
- Amazon Body Labs released a method to generate albedo-texture body scans from a single RGB image.
- Everybody Dance Now
- This was the coolest paper from the conference. This Berkeley paper uses a GAN to “motion transfer” from a source video to a target. The resulting footage is a deepfake that turns the amateur/target into a professional dancer. Just watch:
Interesting robotics papers
- Visual SLAM: Why Bundle Adjust?
- This workshop paper suggests rotational-averaging (RA) over BA. It states RA performs as well as BA, but ~2 orders of magnitude more efficient.
- GLAMpoints: Greedily Learned Accurate Match points
- This is a CNN-based feature descriptor that on the surface, does better than SIFT. However, its detection success rates are evaluated on FIRE dataset, which contains medical images of retinas (high motion blur, limited GT) which look very different than robotic data.
- Due to the difficulties of gathering GT on retina images, the researchers used self-supervised learning for evaluations.
- GSLAM: A General SLAM Framework and Benchmark
- This is a great opensource library from China that allows SLAM researchers to build & test SLAM frameworks. See: https://github.com/zdzhaoyong/GSLAM
- Privacy preserving image queries for camera localization
- This is a very interesting paper that obfuscates image features prior to uploading them to the cloud, thereby protecting user privacy against image inversion/reconstruction.
- There was a similar paper from the author at CVPR 2019 which proposed converting 3D points → 3D lines. This ICCV paper improves on this by proposing to convert 2D points → 2D lines.
- SLAMANTIC – Leveraging Semantics to Improve VSLAM in Dynamic Environments
- This was a Naver Labs (the “Korean Google”) paper. Naver has a group in France doing indoor robotics research. While they don’t produce consumer robots, they work on everything from SLAM, semantic segmentation, object detection, to VO.
- This paper attacks the problem of localizing in dynamic environments. If a car that was previously assumed static begins to move (but the robot is stationary), VSLAM assumes the robot itself is moving backwards. To solve this problem, this paper performs semantic segmentation & classifies each object as “static”, “dynamic”, or “semi-dynamic”. These labels/values are used during VSLAM optimization & to improve localization in dynamic environments.
- Daniel Cremers (“Kobe Bryant” of VO/SLAM research) discussed the SOTA in VO/SLAM.
- Per Cremers, DL-based VO/SLAM is promising but not yet SOTA. However, dense VO is all the rage in research & his company (ArtiSense) seems keen on this.
- He is a strong advocate for VIO & seemed skeptical of the prospects of a full end-to-end learning-based VO/SLAM solutions.
- Marc Pollefeys (ETH Zurich/Microsoft) discussed his “AutoVision” project. This comes out of ETH Zurich & aims to do pure, camera-based navigation (LIDAR is used only for GT).
- They leverage sparse & dense VO for localization & Pollefeys harped on the difficulties of mapping as features change over time.
- China dominates computer vision research. More than 1/3 of accepted papers came from China, while ~30% came from U.S.
- This is likely a combination of China’s investment towards vision/AI research, as well as masters students in China being required to publish papers for graduation (unlike US).
- “Big 3” (Google, Facebook, Amazon) are everywhere. These 3 accounted for 10+% of all papers.
- Notably, Google had the best paper (SinGAN) & discussed MobileNet V3 (which had been around for some time). All 3 companies released papers in the major-theme-areas from the word cloud seen earlier.
- Autonomous driving companies & research are everywhere. Most DL-applied-papers were clearly intended for autonomous driving. It was clear this is the hottest area in vision.
- Deepfakes & privacy concerns are real. Methods for deep fake generation, & corresponding detection methods were of high interest.
- On a related note, of both excitement & concern, was the SOTA in surveillance/re-ID. There were strong implications of such papers potentially being used for state-sponsored surveillance of humans & vehicles.
- Applications to embedded platforms are limited. It was hard to find research that targeted, or has tested on embedded platforms. While DL methods are attractive for potential robotic applications, nearly all methods assume a powerful GPU (e.g. NVIDIA 1080) & are not yet ready for real-time use on battery-powered robots.
Overall, a fantastic conference, & as a Korean-American, I felt great pride of such a conference being held in Seoul. Thanks to all who helped make it happen!