[ECCV 2024] D-SCo paper review

[Title] D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction

[Keyword] 3D Object shape reconstruction, Diffusion

[Journal] ECCV, 2024

[arXiv] https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/04261.pdf

[Summary]

3d obj point cloud → diffusion
obj pcd의 centroid 대한 contraint을 걸어서 implausible한 hoi가 recon되지 않도록 방지
unified dual-stream embedding(Hand-Obj semantic & Geometric emb.)을 condition으로 사용하여 self-occlusion에 대한 prior로 사용

1. IntroPermalink

기존 category-agnostic 3D obj. reconstruction task의 경우 sdf(+matching cube), occupancy network등을 주로 사용함
- but, 이러한 method들로 recon한 obj는 over-smoothed되어 있거나, detail한 부분들이 빠져있는 경우가 많음
diffusion을 사용하여 obj point cloud를 직접 generation하면 이러한 문제점들을 해결할 수 있음. but, 여기에는 다음과 같은 2가지 main challenge가 존재함.
1. sampling 과정에서, denoising이 진행됨에 따라 obj point cloud의 center가 계속 변화함
  - obj center가 손등 위 or 손 내부에 위치하게 되는 경우, implausible한 results & penetraion 발생!!
2. 대부분의 연구에서, 2D img feature만을 condition으로 사용하는 single stream sampling을 진행함
  - hand에 의한 self-occlusion등에 잘 대처x
이를 해결하기 위해,
1. hand-conditioned obj center estimator 학습
  - sampling 시, 이를 guide로 사용하여 obj point cloud의 center가 바뀌는 것을 방지
  - obj center가 주어졌기 때문에, obj shape/position 둘 다 recon할 필요 없음. shape만 recon하는 간단한 diffusion만 학습하면 됨 → computational expense 감소
2. semantic embedding + geometric embedding -> unified hand-obj semantic embedding (dual-stream)
  - sementic prior와 geometric prior를 합친 unified hoi embedding을 condition으로 사용
  - 이는, self-occluded part에 대해 strong prior로서 작용

notation
- $X_t$ : 3D obj points at timestep t
- $\bar{X}_t$ : center of $X_t$
- $\mathcal{M}$ : GT obj transl.
- $\hat{\mathcal{M}}$ : pred obj transl.

Forward process (train)
- target points( $X_0$ )의 center( $\bar{X}_0$ )를 gt obj transl( $\mathcal{M}$ )으로 강제시켜 준 뒤, forward process 진행
- scheduling method는 DDPM을 따름

Reverse process (inference)
- obj centroid estimate
  1. off-the-shelf model로 input img로부터 camera pose, hand pose estimate
  2. (1)에서 얻은 hand pose → MANO → hand vertex 얻음
  3. input img는 backbone(ResNet-18)에, hand vertex는 PointNet 기반의 network에 넣어서 각각 img feature와 point feature 추출
  4. 두 feature를 concat → 2개의 MLP → 2D & 3D obj centroid( $\hat{\mathcal{M}}$ ) estimate
  5. 매 step마다 obj centroid를 $\hat{\mathcal{M}}$ 으로 맞춰줌
- 이 때, obj rot은 따로 고려해주지 않음.
  - 일반적으로, transl에 비해 rot prediction이 훨씬 어려움
  - 선행 연구에서, canonical rot이 아니어도 transl만 잘 맞춰주면 diffusion이 나머지는 잘 recon해주는 것을 확인함.
  - 따라서 굳이 rot에 대한 constraint은 걸어주지 않음

Unified Hand-Object Semantic Embedding
- 2d img feature는 obj reconstruction task에 중요한 cue로써 작용함.
- but, 단순히 global image embedding을 사용하는 것보다는, 각 point에 대해 mapping된 point-wise deep img feature를 쓰는 것이 point cloud denoising에 직접적인 도움을 줄 수 있음
- 이를 위해,
  1. img → ResNet (or ViT) → img feature
  2. hand/obj point cloud → Rasterizer → projected 2D point cloud
  3. (2)에서 얻은 2D point cloud에 대해, 각 point의 pixel이 가지고 있는 img feature를 (1)에서 찾은 뒤, 해당 feature를 point에 mapping
  4. hand/obj에 대한 one-hot encoding 값을 마지막 channel에 추가
- 이로써, hand와 obj의 semantic한 정보를 point-wise하게 담고 있는 feature emb을 얻을 수 있음

Hand Articulation Geometric Embedding
- obj shape은 hand pose에도 많은 영향을 받음. 다시 말해, hand pose 정보가 obj shape을 recon하는 데 아주 중요한 constraint으로 사용될 수 있음
- 이를 위해,
  1. hand의 15개 joint에 대해 rotation과 translation 구함 (by forward kinematic)
  2. 이를 obj point cloud에 전부 곱하고 더해 줌(R * point + transl)
  3. 최종적으로 $\mathbb{R}^{N \times 45}$ dim의 point-wise pose feature를 얻을 수 있음

Dual-Stream Denoiser
- 지금까지 얻은 2개의 embedding을 condition으로 사용하여 point cloud denoising.
- 위 2개의 embedding을 naive하게 활용할 수 있는 방법으로, 단순히 2개의 emb을 concat한 뒤 이들을 하나의 head에 넣어서 noise를 control하는 방법이 있음.
  - but, 이렇게 서로 다른 domain을 단순히 concat하여 하나의 head에 넣는 것은 model의 performance 감소를 유발함. 이는 이전 많은 연구들에서도 증명되어 왔음
- 따라서, 각 emb에 대해 각각 한 개씩 head를 달아주고, 각 head에서 나온 emb을 concat한 뒤, 이를 MLP에 통과시켜 나온 feature를 사용
  - 이 방법을 사용하면 각 domain에 맞게 specialize 된 head를 학습할 수 있음.

Diffusion model
- 여타 diffusion과 동일하게, perturbed point cloud에 대해 added noise를 예측하도록 학습
- 이에 더해, obj shape에 추가적인 supervision을 걸어주기 위해 Lmask도 사용함.
  - rasterizer를 사용하여 gt, pred point를 2d로 projection 시킨 뒤, 둘 간의 L1 loss 걸어줌.

Centroid prediction network
- 3d point, 2d point에 대해 각각 supervision을 걸어주고, 추가적으로 projected 3d point와 2d point가 많이 어긋나지 않도록 regularizer 달아줌.