Julian Chibane, MPI-INF: 3D reconstruction using implicit functions

Generally Intelligent

0:00

-49:08

Julian Chibane, MPI-INF: 3D reconstruction using implicit functions

Imbue

Mar 05, 2021

Julian Chibane is a PhD student at the Real Virtual Humans group at the Max Planck Institute for Informatics in Germany. His recent work centers around implicit functions for 3D reconstruction, and his most recent paper at NeurIPS is “Neural Unsigned Distance Fields for Implicit Function Learning.” He also introduced Implicit Feature Networks (IF-Nets) in Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion. Julian is open to collaborators interested in similar work, so feel free to reach out!

Highlights

How, surprisingly, the IF-Net architecture learned reasonable representations of humans & objects without priors
A simple observation that led to Neural Unsigned Distance Fields, which handle 3D scenes without a clear inside vs. outside (most scenes!)
Navigating open questions in 3D representation, and the importance of focusing on what's working

Transcript

Kanjun Qiu: How did you get to where you are and develop your initial research interests?

Julian Chibane: As a child, I was very fascinated by image processing. For example, 3D rendering — that was really something I loved and played around with at a really early age — and video editing. I did this with computers. My mother was a computer teacher. In university, I had courses on statistics, computer vision, and 3D point cloud processing. I felt that this was something really cool and was really drawn into this.

Kanjun: What did you find cool about it?

Julian: In 3D point cloud processing, we learned from images, we can reconstruct the world in 3D and it was pretty amazing for me to go from like a 2D projection of the world into something that's 3D interpretable that you can render from different viewpoints. This representation of the world is much more interactive than just an image. You can look at an image, but that's it, you cannot really move the viewpoint, adjust stuff like that. And the 3D representation, you can do just so much more. You might even be able to let an agent walk in this environment or interact with the 3D environment and things like that. And that was part of this course. And that was just so fascinating.

Kanjun / Josh: And so after you took this course, you decided to do your PhD in a similar field?

Julian: I also had another course, it was a computer vision course. It was not so much neural networks and deep learning. It was more of the traditional stuff. My professor told me about this MPI where, all over Germany that's a known institution for doing high quality research. The professor told me about MPI and after my masters, he pointed me towards the MPI and I followed this suggestion and that's how I became a PhD, I guess.

Kanjun / Josh: And it seems like you've been pretty focused on [00:02:00] 3D reconstruction this whole time so far. What do you feel like has been most exciting to you in the field?

Julian: Something like the output of a Kinect, which gives you a point cloud of a front of a person, for example, which might have some nice details, capturing these details, but also completing the shape of the full human body. That's something I really think is pretty cool.

Kanjun / Josh: Yeah, actually going into, I think it was your first paper, that was one of the things that it looked like it was possible to do. And I also, as I was reading that paper, I think probably Julian will do a better job than me at describing the paper. So I was going to let him describe that. That was one of the things that I thought was the most interesting from that. It did such a good job on this problem that was actually a really hard problem. Like you can't see anything in the back, like you can't do like a normal - you have a surface with a little hole and you like smooth over it and fill it in. You can't like smooth it over the back and fill it in and it just looks really good. So yeah, do you want to tell us a little bit about that first paper? The title of that one, if I remember correctly, was "Implicit Functions in Feature Space for 3D Shape [00:03:00] Reconstruction and Completion."

Julian: Exactly. It's about 3D reconstruction. You have some deficient input. So you maybe have a point cloud just captured from one side and you're missing the back. Such data can come from Xbox Kinect for example, which is used for gaming or for a bunch of other applications. There are other scanners, also 3D scanners, like LiDAR scanners that are pretty exact, right? But it gives you a sparse point cloud from objects, which are further away. And then you're interested in getting the full surface out of things like that. That's the application scenario of the paper.

Kanjun / Josh: And what was the story about that paper where you started your PhD? Did you join an existing project or did you come there and wander around for a little while, thinking about what you wanted to do or were you just handed this idea from on high, like go do this? How'd this come about?

Julian: It was my master thesis. I was going to the MPI and I had several different project proposals from which I could choose from. I really liked that proposal and initially it was a bit different. It developed during the course of the work.

Kanjun / Josh: [00:04:00] What was it at the beginning? And then what happened?

Julian: At the beginning it was actually using implicit functions. It was a thing already, or there were three papers on implicit functions. So it was a really early idea, I would say. These three papers were, if I remember correctly, in the same conference at the same year. So it was Occupancy Networks, DeepSDF and IM-NET. We wanted to build on this idea of representing 3D geometry in order to model garments on top of humans. That was the original idea. We have a human body model, which is parametrized. You can choose the parameters such that you have a specific pose or a specific shape of a person. But this person would not have any clothing on. So it would be just the human body model without clothing. And then we would like to enhance that and put clothing on and having a nice representation for that would be very important. That was the start of that idea.

Kanjun / Josh: And then what happened? How did it change?

Julian: At the end of the day it didn't change so much. I [00:05:00] guess we had humans with clothing. But we didn't have this body model anymore in the formulation, but we needed to have a step in between. Because the problem was these first three papers turned out not really enabling us to model highly detailed clothing. Because you have so much information there, you have wrinkles, you have tiny things on top of the clothing and stuff. And also humans are very different than the datasets that were used within these first three papers. They used mostly rigid objects. So something like a car or something like an airplane, and that's just so much easier to represent than a human body who can have any pose or any shape. So the first real problem that we had to tackle at that point and that's what we did with IF-Nets. Later on, then we could follow the path on integrating the human body model. But first we had to solve the problem of reconstructing humans.

Kanjun / Josh: The IF-Net architecture is somewhat unique. How did you come up with that?

Julian: First we [00:06:00] applied these other papers, for example, Occupancy Networks paper, and we found that it really has a lot of trouble representing humans because of their variance and poses. We needed to come up with a formulation that somehow uses the input data that we have, leveraging all the information that is in this input. And we felt like somehow the encoding of the inputs in the previous papers wasn't strong enough in order to really nail the reconstruction. So we had to change the encoding and that's how we started.

Kanjun / Josh: What are some of that sharp edges or hacks that you're a little bit uncomfortable with that are still in there? Or do you feel like, oh, it's like pretty robust and we're pretty much done with it?

Julian: I would say actually it's really pretty robust. That's also why we have quite a few follow-up papers on that. Something that's - so the network has to learn for each point in 3D space, if this point is inside the object that you want to reconstruct or outside. That's the basic idea of this implicit learning. So instead of predicting the full [00:07:00] output at once, you predict per point, if this point is inside or if this point in 3D space is outside of the object. It can happen that if you do not sample during training time of the network, if you sample not enough points far away from the surface or far away from the object, the network can be uncertain of the further surroundings, and then you might have some artifacts. That's something you can find. But it's simple to get rid of those. You just need to sample points there as well. And then the network learns not to put artifacts there. That's the only hacky thing that you can find. Other than that, I would say it does what you expect it to do.

Kanjun / Josh: Is there anything you feel like you learned that was surprising from doing this work?

Julian: How strong neural networks can be. Yes, we started with simpler tasks for the paper. We started with doing something like super resolution where we had a low resolution object and we just wanted to have a bit of a higher resolution of the object and this worked well. But then we started to do [00:08:00] some more crazy stuff. As I mentioned, we just completed the human body, which was not there in any part of the human body. Like an arm was completely missing and things like that. And the network really does something very reasonable there. So it really learns how a human looks like and should look like. That was really surprising for me actually, that these convolutional layers really do learn something pretty reasonable.

Kanjun / Josh: I was really surprised looking at that result and I was like, oh, I wonder was this something that took a long time for you to figure out and get working? Or was it something that actually the architecture just produced? And it sounds like you were also surprised. And so it seems like the architecture actually just learned some reasonable, fundamental representations of the actual objects.

Julian: Exactly. At least not in that degree. I think it was also for my supervisor a bit surprising. In a lot of research papers you always use the prior of human body models, so you have a human body model and then you do something with this. This gives you a prior on how the human looks [00:09:00] like. I was thinking that this is something you really need in order to do some good completions. But it turns out the network can really learn a lot of things just from data without using a parametric model.

Kanjun / Josh: Did you ever test it on non-human objects or like weird objects, like a leaf or nature, trees or anything like that? Because I know those are notoriously hard to reconstruct.

Julian: We did have quite a bit of experiments on other objects than humans. That's actually something that, yeah, many people looking at IF-Nets probably miss that. It's not only for humans. It basically can work on a bunch of objects on arbitrary classes. We used it on the ShapeNet dataset and the ShapeNet dataset consists of manmade objects. There are cars in it, airplanes and things like that. We used this dataset to show that we're on these rigid objects, we can still outperform the prior work. And we used the human data to show that our approach is also working on non-rigid objects. We took part in the [00:10:00] ECCV challenge with IF-Nets. There were two challenges. The one challenge was on human data and the other challenge was on arbitrary data. In both cases it was 3D data of humans or arbitrary objects and the arbitrary object class was really anything. So there were some fuzzy toys in there. There were some computers in there, also some human statues, anything. We reconstructed those 3D scans. We applied the model. The input was colored 3D scans, but with holes, either of humans or arbitrary objects and the task was to complete the surfaces such that these scans do not have holes anymore, but also complete the color. So that was new, that it was an extension of the IF-Net paper. We extended the reconstruction from only surface to surface and color. Surprising to me was first of all that the color completion is very reasonable. If there's a hole between your shirt and your jeans, it will not only complete the surface there, but also will complete the [00:11:00] colors, the texture and it will segment your t-shirt from your shorts and it will look like a reasonable completion also in the texture space. We trained really on these arbitrary objects and we tested the model on objects it has never seen before and they had holes, right? Their reconstruction still could remove these holes and do a pretty reasonable job. Although the model has never seen these objects or categories. And even for me as a human, it wasn't easy actually. That was also quite impressive and quite interesting.

Kanjun / Josh: One of the things I was going to ask about the earlier paper, did you ever try training on the human dataset and then testing on the ShapeNet one or reverse to see how much of the dataset bias actually impacts it?

Julian: Not on the IF-Net paper, but we did something along those lines for the NDF paper. And we used a very similar network architecture. So I guess it applies also to some extent to IF-Net and we found that it does generalize quite a bit. We trained on ShapeNet and we tested it on real scenes. So that's not an experiment that actually made it to the [00:12:00] paper. But we tried that just in our experiments and we found that it does quite a reasonable job already.

Kanjun / Josh: That lines up with what you were saying about the second paper as well about working on unseen object classes. That's pretty challenging. I think a lot of these 3D reconstruction things that I've seen in the past are like, oh yeah, it looks pretty good, but you only trained on planes or cars or something. So they all look the same anyway. It's a lot harder to do it the other way.

Kanjun: Why do you think it works so much better than many of the methods out there, or are there other methods that you feel like, oh, this is just as good?

Julian: One very important reason is that we include local information of the inputs. So if we have an input, for example, this point cloud of a human, which is where you have a point not only in the front of the human, then if we want to reconstruct points near the surface of the human, then we really have local information in the network that tells the network how close it is to the inputs. And we have an encoding of this local region of the input, but we also encode more global information. So if you want to complete the backside of a human, [00:13:00] you really need to know where you are relative to the input point cloud, which has maybe only the front. There you need something very global. The main contribution of the paper is to show how to combine local as well as global information from the input.

Kanjun / Josh: That's done with the different multi-level 3D convolutions as well as I think there was some sort of sampling along the axes or something nearby, but it's not just looking at the one point, but there's also some points near it or something.

Julian: Exactly. So first of all, yes, we have a 3D convolutional encoding. The 3D input is encoded by using just the regular 3D convolutional neural network and this convolutional network for each layer of the convolution gives you a feature grid. It's a 3D feature grid encoding your input. The cool thing of these feature grids is that they are aligned with the input, which means you know where you find the corresponding feature to one 3D point in space. For one 3D point in space, you can look up the corresponding features [00:14:00] within these deep feature grids. That's one trick. We can just really look up the features that correspond to your point, which you want to decode in the feature grids. And then we leveraged our knowledge in the machine learning community that the receptive field is growing the more layers you have and early layers will have a low receptive field, which means you have very localized information of your input. And the further you go up the hierarchy, the more global it will be. And we extract features from all the layers which then yields an encoding of your point, which is global, but also local.

Kanjun: How are you thinking about follow-up work or building on this? How far can you take it and what are additional problems that you might want to solve with IF-Nets or modifications?

Julian: We've made some follow-up work. A colleague of mine, Bharat, used IF-Nets and made a follow-up paper with a variant of it. And what he did there is he segmented scans of human bodies [00:15:00] into body parts, and this helped to register a parametric human body model to the scans. Why is that useful to have a registered parametric human body model? It's useful because if you have a good registration, you can then start to move your static scan and instead of a static scan, you then have a dynamic scan, which you can change the pose of, for example. And IF-Net was really helpful there in order to predict which body part is present at what location in 3D space. That's one extension of that. That was pretty cool. It was an oral at ECCV, and he really showed that then from these static scans, he could really make crazy poses and that worked quite well.

Kanjun / Josh: One other question on some of the earlier work in the IF-Nets, that's just in general, how long does it take to train and how long does it take for inference? How long did it take to train? How stable was the training? How long does it take for inference time? How much data is in the training set?

Julian: I see. So for the human dataset, we used only around 3,000 scans if I remember correctly.

Kanjun / Josh: Not a gigantic dataset.

Julian: [00:16:00] No, actually not because it's quite expensive to get a lot of high quality human scans actually. Then we used a relatively small one. The ShapeNet dataset, that's quite large, but the objects that you have are pretty rigid, pretty prototypical and many objects are near duplicates of themselves. It's a very different setup actually. So it worked on both. We trained one model separately for ShapeNet and for our human dataset. Training time is approximately two days, if you want to have the best possible quality or if you want to have the best number at the end of the day in your paper. But the results are reasonable way before that. You can see that it's converging to something very reasonable where the improvements are minor. That's already after some hours.

Kanjun / Josh: And how many GPUs?

Julian: We trained on a single GPU.

Kanjun / Josh: And then similarly inference, does it do the prediction for one model?

Julian: Prediction of one model is not real time. So it's around 30 seconds, in order to get your full prediction, which [00:17:00] means that's meshed then, and that's the final result. And the reason for that is the model is implicit, which means you have to query a lot of points. If you want to have a high resolution and we used the resolution of 256 to the power of three, you have to query a lot of points. And for each of these points, the network has to decide if the point is inside or outside. The inference time depends highly on your output resolution that you want to have. That's really a matter of what you want to have.

Kanjun / Josh: But that's good to have like a ballpark order of magnitude. I know, I think like for some of the NeRF stuff, it took a surprisingly long time to do some of those things. Like maybe even hours or something. You have to make that little video or whatever. It looks so cool, but it took a lot of compute to actually even just do the inference part for inference, if that makes sense. Do you think there's any other downsides of this particular approach?

Julian: No, actually I would say that's pretty stable. One limitation and that's addressed with the follow-up work is: is it useful to represent objects by occupancy? Is it useful for each point [00:18:00] in 3D space to classify this point as being inside or outside of the shape? There might be some shapes that just do not separate the 3D space only into two regions inside and outside, but separate the space into more regions and have a layering of surfaces. Then it's not really applicable. You might have data where you do not really have ground truth inside and outside. It might be that you have holes in your 3D scans. So it's not really easy to find out when you are inside of an object or outside of an object. Maybe you have a thin wall or you have a very paper thin object. So there, it might be very tricky to define inside and outside. And that's one limitation that IF-Nets shares with all the other approaches that predict occupancy or predict signed distance, because signed distance at the end of the day also has this binary classification into inside and outside.

Kanjun / Josh: That idea, I think is probably what led to your work at NeurIPS, right? The Neural Unsigned Distance Fields for Implicit Function Learning. Do you want to [00:19:00] talk a little bit about that paper and the approach there and some of the results?

Julian: That addresses exactly this problem. We wanted to tackle 3D scenes, which often have holes. Often you do not have a fully complete perfect scan of your 3D scene, but you have holes or you have just a partial scan and it's not really easy to train a model with occupancies where you need inside and outside. That was the setting. What we did with Unsigned Distance Fields, NDF for short, so what we did there is switching from the SDF representation, which is pretty common. So signed distance to unsigned distance because the sign tells you if you're inside or outside of the object. And if you want to get rid of this inside and outside, because you do not really have it, it's the first idea of course, to just remove the sign.

Kanjun / Josh: Makes sense.

Julian: And of course as trivial as this sounds, then at the end of the day brings some challenges. If you want to use this representation [00:20:00] because for SDFs, of these signed distances and for occupancies, you have this marching cubes algorithm. It converts your implicit representation, which tells you inside and outside for each point in 3D space into something that's more applicable to real world scenarios or use cases. Having a mesh, it converts this implicit representation to a mesh. And that's what you cannot really easily do with an unsigned distance field. So in this paper NDF we showed some algorithms that still allow for visualization, for direct rendering into images, or to create point clouds, very dense point clouds. So a lot of points describing the surface of the object that you're reconstructing. Also, because we can predict these very dense point clouds, we can then use very classical algorithms that connect these dots and create a mesh from there. So we can also mesh then at the end of the day.

Kanjun / Josh: For anyone who hasn't seen it, it's definitely worth checking it out. The audio description of how cool this paper is doesn't quite do it justice. There's some really great images in [00:21:00] there that kind of show how well this works with things that are a lot more realistic. There's like a drawing of a bus where there's like some seats, it's like a cross section of a bus. And it's really hard to tell what's inside or outside of any of these given objects. And it just does a great job of kind of mapping out the contours that feel sort of intuitively correct. Or there was another really good example. I think you guys had one with a car that had a bunch of internal structure, like the internal seats and the windows and the outside of the car. And you can see the surfaces for this entire thing in a way. And it's like almost impossible to do with these other algorithms that are just looking at the outside of it. It seems much more robust to do things this way. I was very excited about it.

Julian: Thanks. Yeah, thanks. Appreciate it.

Kanjun / Josh: Do you also find it to be like actually a pretty useful, robust representation and like a pretty good extension of these IF-Nets?

Julian: Yes, I would say so, especially if you have these scenarios where you do not really have inside and outside. So there it's really applicable. But of course, if you have inside and outside and you have this ground truth data, then you can just go with the classical style of Occupancy Networks and then you can rely on marching cubes, which is well-developed and fast. [00:22:00] And that's also a very valid approach, right? It depends on your scenario.

Kanjun / Josh: I think the exciting thing for me in a lot of real-world scenarios, it's really hard to get the inside and outside. So like Kanjun's example earlier about trees or leaves, yeah, sure. There isn't an inside to a leaf. You really don't want to be representing that.

Julian: Exactly.

Kanjun / Josh: Or cloth, right? Or a scan of a room? Yes. There's technically an inside and outside, but you're not going to walk to the other side of every single wall to have to get that outside of the entire thing. It's so obnoxious. So it's really nice to be able to not worry about that in a practical setting. I think in a lot of cases.

If I were a practitioner and I was deciding between these unsigned fields versus a signed field, when would you recommend I use one or the other?

Julian: As soon as you have ground truth for occupancies, I would go for that because you can really use these classical marching cubes, which does a great job in [00:23:00] creating surfaces. Because most application scenarios at the end are interested in having a mesh as an output, because it's very lightweight. You can also get this with NDFs but you have to do a step in between. So first you predict a very dense point cloud, and then you mesh, so that takes more time. Therefore, if you can use occupancies and you have good data and no holes and high quality data and everything is perfect, then of course go with an easy approach. But in a more real case scenario, the NDF approach would be the thing to go with.

Kanjun / Josh: So if I don't care about meshes at all, and I maybe only care about representation of objects, the NDF approach is a good one to go with.

Julian: Yeah. Then in any case, so if you're interested in images, so if you want to render or if you want to have a point cloud, that's perfect. Yes. Or again, if you want to have meshes, but you do not have ground truth with occupancies, then it's still the way to go.

Kanjun / Josh: One other thing that I thought was pretty cool about the paper was the connection to regression, the experiments at the end with just random 2D point data and lines and things like that. I thought that was a really [00:24:00] interesting idea. Do you think that's practical at all? Maybe it's also probably worth describing like what exactly some of those setups were and how that works. But then also the question: is this practical or useful in any setting or is it just like a neat trick?

Julian: So yeah, I've actually also been very excited about this. But reviewers didn't like it so much. I didn't appreciate this idea of doing something that's coming originally from computer graphics, which is ray tracing. And then using ray tracing for regression. It's just an interesting connection, right? To the applicability, we didn't do very thorough experiments on any kinds of regression or function that you could regress. Representing functions as unsigned distance fields is actually really very interesting and doing this regression with this ray tracing idea, I think it's really worth looking into this and maybe following up on it. Because I think with this formulation, the neural network can really represent any function or also other 2D surfaces. Functions are not so complex. You could have a binary [00:25:00] classification. Everything above the function is inside. Everything below is outside, you could do that. But there might be some curves that do not really define an inside and outside, like a spiral, for example. If you apply neural networks to learn to connect these surfaces using this implicit approach, that's very exciting.

Kanjun / Josh: Are you currently doing any follow-up work?

Julian: Everybody can follow up on these ideas.

Kanjun / Josh: Switching topics a little bit. If you're done with the papers. One thing I'm really curious about is in this field, whose work has impacted you the most?

Julian: I would say it's these three papers that I mentioned, like it's Occupancy Networks from Mescheder's group from Lars as a first author, DeepSDF and IM-NET and also PiFU of course. That's not to forget about that one. They made it popular because it's not really true that these are the first papers. There's a paper, which is one year before them. And it's never cited. But it actually does everything that we're in this implicit realm are doing. It's called "Deep Volumetric Video from Very Sparse Multi-view Performance [00:26:00] Capture." It's a paper from Habermann from our research group. They actually already do all these things. They predict in a pointwise manner inside and outside, and then they do marching cubes to get a mesh and they predict this from images. So the idea was there one year beforehand, but it got popular only with these three papers, I would say. I guess none of these papers actually cite that paper. So at least I would say it's under appreciated.

Kanjun / Josh: When you first digested these papers and felt like they were very impactful, what were the key salient points that you got out of it? How did it change your view?

Julian: I was doing my master thesis, so it was for me really the first time getting into this 3D vision research. It was not so easy to understand the impact this really has. And I'm not sure if other people were so certain of the impact of these ideas. Because I would say now with one or one and a half or two years after these papers, I would say they had a really [00:27:00] significant impact on representing 3D scenes. I'm not sure if this was clear at that point already. These many follow-up papers that use these ideas for other stuff, and then show that they are applicable. Also of course our paper, IF-Net and many others, they really made a success out of this idea and NeRF, right? That was this big paper everyone's talking about. And the next conferences will probably mainly be NeRF conferences. Yes. Also builds on this idea of predicting pointwise. So it was really impactful.

Kanjun / Josh: How this group of papers impacted - it sounds just generally the methodology of predicting pointwise is really what told you like oh this is how someone can take it further?

Julian: Yes, that was a breakthrough in representation of 3D scenes, because representing 3D scenes is not so easy as representing a 2D image. For 2D images, we know how to represent them. We use grids and for each point in the grid, we have the RGB [00:28:00] value and that's it. We all agree that this is a good representation for an image, or at least maybe not all of us, but it's a very common representation for an image. But that's a bit different for representing 3D scenes. Especially for learning, because if you want to learn, it's not so easy to learn with a mesh. For example, a mesh is a very dominant representation of 3D surfaces, but it's not so easy to learn with those. Applying a neural network and things like that. That was why these implicit papers had such an impact because this representation wasn't so clear or it's maybe still not so clear. And the major drawback of this volumetric approach, where you extend a 2D image to 3D by just concatenating a lot of images, one after another. And then it's a grid as a 3D volume. So you have a voxel grid. So this approach is really memory intensive. So you have to store a lot of values and that's not so easy to do on our GPUs. Maybe in sometime in the future, we will go back to voxels, but currently although our hardware resources are increasing, [00:29:00] that's not that easy. That's why there is such success of this idea.

Kanjun / Josh: It seems like voxels - I guess my opinion maybe is like the discrete representations are definitely non-ideal and this is an issue with 2D as well.

Julian: Probably not the hierarchical way. In our brain, we are focusing on where more details are.

Kanjun / Josh: To your point there, I think there is something really interesting about how do we decide what to pay attention to and then how do we allocate detail in the area that we choose to pay attention to? It's just a little bit separate from the representation that we use. Some of that implicit prior has been captured in the work that you did right in the IF-Nets and even in the unsigned distance fields. I noticed there's forcing it to pay attention to the region that's very close to the surface and spend most of its resources there, as opposed to, you don't want to be encoding. Oh yeah. It's still air. Yep. Still lots of empty air.

Julian: Exactly.

Kanjun / Josh: One thing I was curious about in this field in general is what are some of the most important open questions that you feel like, as a field, it's important to solve?

Julian: It's not really real time. At inference time, we still [00:30:00] discretize. So we still are creating a voxel grid. Then at the end of the day, we just defer it. We do not do it in the learning phase. And we are not storing this voxel grid within the GPU, but we are predicting at the end of the day a voxel grid, point by point in order to grid apply marching cubes for example. That goes for this implicit function work. In NeRF, of course we're doing it differently, but also we have to query many points in order to render an image, for example. One crucial aspect of the next time will be, how can we do this more efficiently? So coming back to our example, if we are in free space we do not need so many samples there, right? We do not want to sample many points in the 3D space of air where nothing is. We have to find something more clever. Neglecting free space, for example, and paying more attention to surfaces.

Kanjun / Josh: Pay more attention to surfaces and even more attention to areas where there's a lot of texture or something like that.

Julian: Exactly. It might not be always as easy as that. If you want to represent something that [00:31:00] has no surface or no clear surface, something like fog or something volumetric, then you need also samples in free space or then it becomes ambiguous. What is free space and what not. But maybe sampling more points where you have a higher density of objects.

Kanjun / Josh: Or sampling more points where it's more important for some downstream tasks, right? Like using attention to allocate towards, do you want to pay attention to the raw shape of this object or just this like little particular piece over here or that sort of stuff as well?

Julian: Exactly. There's quite a bit of work to be done there. Because of course it would be cool to have these upsides of these predicting pointwise to have these also in real time applications such that we can really use these things instead of just having research results. One question should be - go back to the previous question because there are, of course more things that can be done. For example, making these implicit representations more interpretable or more interactive, and people are starting to work on this [00:32:00] already. If we think of NeRF which is learning some continuous representation based on your images, then it would be cool if this scene representation is more interpretable, more interactive. So to be able to relight a scene would be very cool. There are papers coming out in this direction, but of course also maybe changing the pose of objects or changing the pose of humans maybe is interesting as well. Not only capture the real world, but also being able to manipulate maybe capturing the real world or the scene of interest at different timestamps, maybe if an outdoor scene is rainy, for example, one time it's sunny. And then maybe being able to interpolate between the settings of your scene and interactively manipulate the scene. I think there would be so much cool opportunities that would have so many applications for gaming. For example, maybe creators can capture more of the scenes instead of recreating our artificially doing some [00:33:00] things like that.

Kanjun / Josh: What is the current state of scene representation and being able to modify some of these parameters?

Julian: So there are follow-up papers to the NeRF representation allowing for relighting and other papers looking into dynamic scenes. So there is work on that. But we can still explore this direction much more in order to really elaborate stuff.

Kanjun / Josh: What are you interested in here?

Julian: Actually this relighting - before I've seen that paper that it's already being done, that was something I wanted also to look into. But using these representations for larger scenes, for example, is also something interesting. And for large scenes, making them as dynamic and as interactive as possible and also doing augmentation, putting different objects into the scene.

Kanjun / Josh: Thinking about how to make a larger scale, practical version, where you can swap out the objects and change the location and change the general parameters of the scene and things are like able to deform and [00:34:00] it's like a really robust, good, large scale kind of scene representation would be really interesting.

Julian: Something like we have for human body models, which are of course hand engineered, but maybe discovering some latent parameters of scenes, and then manipulating those parameters to get different styles of the scene or objects.

Kanjun / Josh: That'd be so fun as a designer. Yeah, that'd be super fun to play around with. Blurring the line between the real world and digital objects.

Julian: Exactly. Maybe effectively helping creators to less focus on some redundant work, but put more attention on the creative aspects of their work, where they can easily get a sunny scene or rainy scene. Stuff is just handled by some cool mechanism in the background. That would be something very cool.

Kanjun / Josh: There was an Adobe paper demo, where they're doing retexturing rewriting of nature scenes. That one specifically. Separating the appearance of the scene and the structure in the central 2D images. So they were able to very quickly swap [00:35:00] out like rainy, sunny, or take the mountains over here and put them over here. I'm like, it's controlling those things separately, like with a brush kind of being able to select the thing and paint it somewhere else. It's pretty interesting. It is really cool. It'd be really cool to be able to get that ingredient. Exactly. Yeah. That'd be great to see in 3D.

Julian: So I guess that's not the only thing that would be nice to transport from 2D to 3D. There are many cool ideas in 2D that we would like to have in 3D. This 3D field has a lot of potential to grow. And there are many applications that really use and really need these 3D representations. I'm exploring this more and more how to get information from the real world into the 3D representation and allowing to use this real world information to design and to alter 3D representations and to apply them to different use cases.

Kanjun / Josh: What are some controversial opinions you have about research or fields that you feel like other people don't necessarily agree with?

Julian: One that can come up in rebuttals. For example, when you have to defend your research or people around [00:36:00] me defend their research is that the approach is simple. So sometimes that's criticism actually. So the idea is simple. It's too simple or something. That's something which might not be that good actually, because simple approaches - that's not a bad thing. That's a good thing. So it's a good thing to have something simple, as long as it does something reasonable, as long as it really helps. We do not really want to have very elaborate architectures with a lot of parameters doing something. It's a black box. We don't really know. It's better to have something simple and to make simple ideas work and to do some things in smaller steps, but really understanding.

Kanjun / Josh: I couldn't really agree with you more. Towards this opinion over time, no need to add any extra complexity. We barely even understand how or why neural networks work. I think the idea of taking the sign off of a signed distance field and making an unsigned distance field, I think that's great. That's brilliant. Why didn't people who are working with signed distance fields think about that?

Julian: I don't want to claim that we are the first using unsigned distance. This stuff [00:37:00] has been around.

Kanjun / Josh: Making it actually work and showing how well it works and where it works and getting it to work for rendering and for meshing and all these other things. I think that's very valuable. It's a simple idea, but there's a lot of work that goes into making it really actually work. In product development, we have a saying that there's this curve of how much effort you put in. When you have put in very little effort, then your work is basic. And when you have put a medium amount of effort, your work is complex. And then when you have put in an enormous amount of effort, your work is simple. Simple is at the end, and it's not at the beginning. It's true in a lot of design disciplines, which research is a design, like a, yeah, it's a creative discipline.

Julian: And actually this goes not only for the idea themselves, but also for the presentation, I would say. That's not so much in the English language actually, but for writing German, people often disagree. I really had to relearn how to write a bit in English because it's very different actually. So in German, if you state a sentence, it already should reflect almost [00:38:00] all possible circumstances. It should reflect the full truth. And of course makes the sentence very large. You have to use a lot of commas and it's very hard. So you have to have really complex sentences in order to nail down what you want to say. And that's very different to the writing in English. And I really enjoy that. And I just had the discussion with a friend of mine who's writing his master thesis. People in Germany sometimes really tend to have these really large sentences and make really hard to understand texts, which are, of course they are very to the point still. They're very exact and it's very expressive. These are definitely upsides, but for the reader at the end of the day, it's really just hard to parse. I really have a hard time understanding what the people are trying to tell you. So I really liked this also this idea of simplest better. It also goes for me and writing. And even if I write German now I write very short, clear sentences. If people like it or not, that's the way it is.

Kanjun / Josh: Yeah, I kind of have this theory that constructing good explanations of [00:39:00] things is almost an entirely separate discipline than continuing making forward research progress and doing novel work. The field needs both. To push the novel work forward and also to construct a ladder for other people to understand what the novel work is. Otherwise you end up with too much understanding debt. It's very difficult for people to really know what's going on and at some point it becomes very hard to make progress. What other controversial research opinions?

Julian: Another thing is that I really tend to like small steps in research better than like having highly engineered really big architectures that solve one task very well, but it's not really interpretable where this is coming from. I think that's why NeRF is so popular actually. It's because it has a very simple representation. It uses stuff that's absolutely clear to readers for example, a fully connected network. That's very clear for readers what that is. So they really just use this representation and show that it really is powerful. Also including volumetric rendering, which is [00:40:00] also known. So they really take key things that are known, but do something very important and very helpful with it. They have something which is very clear why it's working and why it's not working. And instead of sometimes you have papers solving some task but they have four different networks in there. They even do not have any real names. They are just doing something, nothing interpretable and yeah, I really tend to have a hard time parsing papers like that.

Kanjun / Josh: Josh has really on the side of small modular networks. No giant big architectures. I think it's also interesting to see what happens when you take all of these small things and put them together in some other larger way. Before you do that, you want to understand what are each of these pieces for and why are we really doing this? And the resulting work - it can be really hard to interpret what the different contributions of the different pieces are, unless, if you're just taking one fully connected network and something out of it, okay, sure. Like we have, can see, like what's contributing why of this. I think my favorite though, are the ones that are simple approaches and they get great performance like the IF-Nets for example, right? Oh, it's a pretty [00:41:00] simple approach. Works really well. I think it's always interesting to see when you can take something so simple and make it work so well.

Julian: But of course that's not always easy to get there. And I guess also it depends highly on where the research of the field is. At some points you just have so many papers already solving problem A and then the next paper of course, has to beat a very high baseline already. So it's much, much harder to make big steps there. It's very valid to have larger capacity models or more engineered models. I fully agree with what you said. It's good to start small, right? Try to have a good understanding of what you're doing or what your model is doing and then start to extend the complexity, maybe not start with a very complex model at the beginning.

Kanjun / Josh: What do you think is something you've learned or something you've observed about what makes a great researcher great that would be unusual for other people to say?

Julian: Being patient with other people's ideas, being patient listening to other ideas. And [00:42:00] first trying to understand theirs before pushing your own yeah. Or before fighting for yours or having patience also to develop your own idea, taking your time to do the things that you want to do?

Kanjun / Josh: How did this come to you? Were there some experiences where you learned or you were not as patient before?

Julian: Yes, starting with my master thesis, I was quite anxious to get some results. But then the course of the work changed. It wasn't as predictable as I hoped that we could just engineer the architecture and everything is working out at the end. It's not really like that. I had to adapt to change the setting. This wasn't so easy for me. But being patient helped. And of course also having people who believe in you telling you that what you're doing is still worthy. After coming up with this IF-Nets but having had a different goal, it wasn't so clear for me at that point after working so much that this is still something useful. But then you have to be reminded again that if you change the setting, this is really useful for other people or for a different setting. [00:43:00] So that was also something I had to learn.

Kanjun / Josh: There are these like troughs of sorrow. Oh, I'm so excited about this. Oh, it doesn't work. And like having someone to pull you out of the trough of sorrow is very valuable.

Julian: Yeah, maybe focusing on the positive side. Also, that's something which can make people great, people greater. Many people tend to try to, if something is not working to improve what is not working, they forget that other stuff is really working well. And instead of pushing, what's working, they try to engineer what's not working. Maybe sometimes shifting the focus is also very helpful.

Kanjun / Josh: That resonates with me. It's about being open to finding these new things. I think some of the most and some of the most interesting discoveries also come from like, oh, we were trying to do this thing. Didn't really work and like kind of worked. But we noticed that, this other thing over here was like really interesting. And so we just went off in this other direction.

Julian: That's also what research somehow should be right in research. If you already know that everything is working beforehand, then it's not really research.

Kanjun / Josh: It's a discovery process.

Julian: You have to be open for this discovery process. Which is not always as easy as [00:44:00] it sounds.

Kanjun / Josh: Totally. Yeah. What are some tips or tricks or hacks that you've learned?

Julian: You start a project, it's always good to start with the smallest possible architecture over the smallest possible portion of your idea. Even though you have a much higher goal and you want to achieve A, B, C, and D, you start just by achieving A and trying to figure out what needs to be done to do this instead of trying to hammer the full problem at once. You do it like that, it's also much easier to test your assumptions, to test your code. Because many ideas fail or do not really come up that well because maybe there's some implementation error or there's some error in your formulation and you really need a small formulation or a small project in order to track down such issues. And then develop your problem further based on what you've learned in the beginning. So simplifying your own idea at the beginning of the project is something very helpful.

Kanjun / Josh: [00:45:00] Yeah. But also resonates with me. Having something nice and simple to start with is a great place to start. Did you apply this in either of your papers and how did that work?

Julian: This strategy, I'm applying it in each paper. Always. So I always try to start the smallest possible portion of the idea or with the easiest setup. Even if I want to tackle a very complicated setup, I would try to first use it on the easiest setup and check what difficulties might be there that I need to solve.

Kanjun / Josh: I was just going to say for IF-Nets, for example, like if you had to get just very concrete, like what was the easiest setup?

Julian: The easiest - the easiest setup, there was comparing if these Occupancy Networks results, for example, that cannot really represent humans. These humans had missing arms, for example. From the input that we had, it was maybe a coarse human. If we just interpolate the information that we had there with the very, very tiny network or no neural network at all. If [00:46:00] we interpolate these input information, shouldn't this already give you much more accurate pose of the person in the input, there is some arm, for example, that Occupancy Networks for example, is missing, if we do some simple interpolation, shouldn't we get there something more reasonable already by just paying more attention to the data without even doing a very large learning machinery, relying more on local features. So that was the place to start with there. And then we wanted to extend this. We wanted to complete shapes. So we had to use also more global features. And then we really use the 3D CNN.

Kanjun / Josh: That's a very simple starting point.

Is there any ideas that you thought should work that did not work as you expected?

Julian: Yes. One further goal was to go from this unsigned distance field to a mesh directly. Having a variant of marching cubes which can do it for occupancies but doing it on NDFs. So you have the unsigned distance field and still, you want to mesh it directly using some variant of marching cubes. [00:47:00] We worked on that and we found some reasonable results, but it was still creating holes here and there, and it wasn't really working out all that well. We didn't put it in the paper. So I guess this was something I imagined to be easier after having a quite good prediction. But this hand engineering should not be underestimated. So having a good hand engineered algorithm that really nails the meshing, that's not to be underestimated.

Kanjun / Josh: One thing we generally ask is, are there any things that you would like other people to reach out to you about looking for collaborators or specific ideas or projects?

Julian: Yeah. So the things that we talked about, these applications of 3D scene representations, their use cases or that further development, that's what I'm really interested in. Especially also incorporating some geometric or some physical principles, within those ideas, that's something I really like. People interested in similar things are always welcome to [00:48:00] contact me and I'm very happy to find collaborators to work on cool stuff.

Kanjun / Josh: Awesome. Great. I really enjoyed our conversation today. I think there was a lot of really cool stuff in there and the papers are great. I would encourage anyone listening to go check them out because our descriptions don't necessarily do the pretty pictures justice. You have code online for all of them, I think as well.

Julian: Yes. Yes.

Kanjun / Josh: Which is also great. I always really appreciate that. So if people want to go play around with them, it's all there. So I really enjoyed this then. Hopefully people listening did too. Thanks for chatting with us. Thanks a ton, Julian. This was great.

Julian: Thanks, I enjoyed it too.

References

About Imbue

Imbue is an independent research company developing a better way to build personal software. Our mission is to empower humans in the age of AI by creating powerful computing tools controlled by individuals.

LinkedIn: https://www.linkedin.com/company/imbue_ai/

Twitter/X: @imbue_ai

Bluesky: https://bsky.app/profile/imbue-ai.bsky.social

YouTube: https://www.youtube.com/@imbue_ai/