The story behind the “IKEA Assembly dataset” paper

Every paper has a story behind it. That story is like a cake. It usually has a few layers of science, several layers of people, a lot of emotions mixed into the batter, and the frosting on top is the acceptance, publication and recognition it gets. Good papers don’t always feel like they are going to end up like that, even if you think the recipe is good, you always wonder what will come out of the oven. This is the first in a series of “behind the paper” war stories – sharing the mess in the kitchen behind the sweet taste of my publications.

Paper: https://arxiv.org/abs/2007.00394 (accepted to WACV 2021 in Hawaii online 🙁 )
Project website: https://ikeaasm.github.io/

How it all began

The story begins in July 2019. I just arrived in Australia with my wife and two kids (2yo, 6yo), fresh out of my PhD at the Technion (Isreal). I was super excited and motivated to start my first “grown-up” job as a research fellow at the ANU node of ACRV. I met Steve (Prof. Stephen Gould), my supervisor, and we walked to the campus outdoor coffee shop. It was freezing but I was fired up from the project he described. We were going to build a perception system that will help people assemble IKEA furniture. At the time, I had a few IKEA furniture in my apartment that I had to put together more than once because I put one of the pieces the wrong way, so it was an easy sell. The general idea was to create something holistic that will capture and extract as much information as we can – multi-view RGB images, Depth images, 3D point clouds, human poses, hand poses, part poses, geometry, actions, CAD models, human-object interactions, track everything and everyone, estimate what they had for breakfast and why are they so grumpy. I’m sure Steve’s list was shorter and well defined, this is how it felt in my head. I was about to be overwhelmed with the amount of work coming my way when Steve mentioned the team.

Meeting the team

The team includes Xin (Dr. Xin Yu), Fatemeh (Dr. Fatemeh Saleh), Dylan (Dr. Dylan Campbell), Cristian (soon to officially be Dr. Cristian Rodriguez), Hongdong (Prof. Hongdong Li), and Steve (Prof. Stephen Gould). The first time we all were in one room was the first “Human, Robots, and Actions” project meetings, but, mostly we were two doors away from each other.

Before I arrived in Australia, the team decided to use the Kinect V2, it is almost like they thought, which sensor Itzik likes best? (the Kinect V2 creates really good point clouds for its price tag). It had an IR sensor that we though about utilizing for automatic labelling using IR ink. The idea was to have something invisible to the human eye (i.e. RGB camera) but visible to the IR sensor so we will be able to label the parts automatically. Well, that didn’t really work. Introducing exhibit A:

Exhibit A: me, holding an IKEA Lack table leg that has a black and white fiducial and a strip of IR ink above it. Can you see the strip?

So, we decided to manually label it later.

In my first month into the project, Xin and I were inseparable. We were leading the charge of collecting the data.

The IKEA dataset guys

We became “The IKEA Dataset guys” – the superheroes that will tap on your shoulder and get you to assemble furniture on camera. It really felt like a superpower when we got Prof. Richard Hartley to assemble one:

Richard Hartley assembling IKEA furniture

The capturing set up was simple yet annoying. It included a huge tripod with 2 Kinects (top and front view) and an additional tripod with another Kinect (side view) and Xin’s computer. Sounds like not too bad. but in practice, there were so many cables. Speaking of annoying, we initially wanted to use Microsoft’s SDK but soon discovered that it is impossible to capture feeds from multiple sensors symultaniousely so moved on to using libfreenect2 that didn’t have all of the SDK features implemented. We also wanted to capture at a high frame so we prioritized most of the processing to be done off-line.

Xin and I checked the file names, frame rate, camera feeds, pushed to git, fixed bugs, did more trial runs, pushed to git, fixed some more issues, pushed to git, scheduled people to assemble, documented, pushed to git (Steve loves git). Fatemeh, Dylan and Cristian helped us pre-assemble fasteners, carry the capturing setup and furniture parts around the building, and disassemble the furniture (someone had to disassemble all of the furniture that people assembled). Sometimes I think we could have also created the “IKEA furniture disassembly dataset” at no additional cost.

We were finally ready. We spent the next two weeks collecting about 200 furniture assembly videos. We collected all day every day. We even collected data at the ACRV’s major events like RVSS and RoboVis, where we got all of the big Profs to assemble furniture (even Prof. Peter Corke, the centre’s director).

The highest high and the lowest low

The climax of the data collection extravaganza was when Steve invited everyone (team+student assemblers) to his house for a grand finale of collecting as much data as we can while munching pizza.

Pizza is always a great motivator to collecot more data

The sheer amount of people around and the fact that we could set up 3 unique scenes within that single space allowed us to capture an additional 170 assemblies in one day.

We had Steve and his kids assemble. We had my kids assemble. We had a dinosaur assemble. Everyone had a lot of fun.

A dinosaur assembling IKEA furniture

Later that afternoon we loaded up the cars and returned all of the gear back to ANU. Everyone was tired but satisfied, it was one of those moments when you could just feel that things are going the right way.

A few hours later, I got the text.

My eyes teared up when I read it. I went through all of the stages of greaf. First I was in denial, I thought someone is playing a prank. Then I was angry, mostly at myself – how did we miss it? Finally, I accepted reality – due to a glitch in the capture setup we were only able to capture one depth stream instead of three. It was the lowest low.

I vividly remember the next morning. Steve came into my office. Clearly, he didn’t know yet. He was still smiling. I told him. He was still smiling and said “Come on… really?” (denial). He skipped the angry stage (as far as I could tell) and after a few questions of trying to understand how it happened and what we do have, he accepted it. I know one thing. The way Steve responded to the bad news was an inspiration to us all. Its like he knew how badly we all felt and understood that the right thing to do was to pick up the pieces. A true leader (I know he is probably reading this. I am not kissing ass. It was truly a special moment).

The team’s reaction was also amazing. No one blamed anyone. Everyone was supportive. I remember thinking how lucky I was to be a part of this amazing team. The decision was to accept the imperfection and continue with what we have to the next step.

Put a label on it

The next months were focused on labelling the data. Dylan was in charge of human pose, Fatemeh on segmentation and tracking, and I took on the action annotations. Cristian was the teams “go-to mechanical turk” extraordinaire, and Xin got a position at UTS in Sydney (bye Xin, we still miss you!).

Labelling took a lot of time. We built labelling tools, got students to label, screened turkers, validating annotations, creating pseudo ground truths by overfitting the data, corrected labels, validated again, until…

It was finally ready.

We set a goal to submit to WACV 2021 (who doesn’t want a fully-funded trip to Hawaii?). I created a shared overleaf latex document and everyone poured their content in.

We subdivided the dataset into train-test sets. We trained SOTA models as benchmarks. Then we subdivided it again (after realizing that my subdivision code had a bug in it). Then we trained the models again, and then again (to fix the parameters to be as reported in the papers). I really wanted to squeeze in a new action recognition method that can use all of the data streams together, there was just not enough time.

We submitted the paper.

I remember the relief I felt when clicking the “submit” button. At that point, I couldn’t see the beauty and contribution in the dataset, only its compromises and flaws.

Every end is a new beggining

I think one of the things I enjoyed most about this project is the comradery and bonding that happened between the members of the team. Personally, it was also a great way to get to know everyone – no need for a conversation starter if you can simply ask “Hey, did I get you to assemble some furniture yet?”. It was clear to me that the submission of the paper was one of those “every end is a new beginning” moments. The IKEA ASM dataset paper was done, but it was the starting point for (hopefully) many more papers to come.

The paper was accepted.

Remember: Your papers are like your children, even if they are annoying sometimes you love them all.

Bonus memory:
Kids make research more… interesting (Keshet, my 2yo joining the assembly, while Shaked, my 6yo is a dinosaur in the background).

Whats next?

Working on a live demo (with a robot!):

Photo credit: Sadegh Aliakbarian, Dylan Campbell, and Yicong Hong