
The (in)accurate Gradients of Neural Representations

Students: Gabriele Dominici, Daniel Perazzo, Munshi Sanowar Raihan, João Teixeira

TA: Nursena Koprucu

Mentors: Peter Yichen Chen, Rundi Wu, Honglin Chen, Ishit Mehta, Eitan Grinspun

1. Introduction

Implicit neural representation promises infinite resolution, automatic gradients, and memory efficiency. In practice, however, these promises often do not hold. Our project explored one specific drawback of implicit neural representations: the noisy gradient. The source code of this project is available on our GitHub repository.

The noisy gradient problem of neural representation has been observed in the context of solving PDEs [1], geometry processing [2,3], topology optimization [4], and 3D reconstruction [5].

1.1. Motivational Example: 1D advection

Our goal is to solve time-dependent PDEs on neural network-based spatial representations. Let’s consider the classic 1D advection equation:

This equation describes the passive advection of some scalar field u carried along at a constant speed a.

Fig 1: Advected scalar field (left), and gradient of the advected quantity (right) over time.

We will parameterize each time-discretized spatial field with a neural network. The field quantity at an arbitrary location can be queried via network inference f(x). The weights of this network are updated at each time step with optimization-based time integration [1]. 

1.2. The Problem: Noisy Gradients

   Fig 2: Comparisons of advected values and gradients of different neural representations. Ground truth (top row), the sine activation (middle row), and the Gaussian activation (bottom row).

Following the INSR literature, we explore different activation functions for our neural network-based scalar field. The figure above compares the predicted scalar quantity and their gradients for both sine [8] and Gaussian [12] activation networks. The predicted scalar quantity matches the ground truth well in all cases. But the gradient of the advected quantity is extremely noisy regardless of the choice of the activation function.

In the subsequent sections, we will tackle this problem using several approaches that fall into two categories: pure neural representation (Section 2) and hybrid grid-neural representations (Section 3).

2. Pure Neural Representations

2.1 Tuning Omega

Fig 3: Large omega values cause noisy gradients, while small omega values give smoother gradients.

The omega hyperparameter controls the frequency range that the SIREN network learns [8]. In general, if we reduce the value of omega, the gradients learned by the networks become less noisy. As seen in the figure above, choosing omega = 5 ensures the network has smooth gradients, while choosing a larger value like 50 gives us very noisy gradients. However, finding the optimal omega value is a non-trivial task. Users need to tune this parameter for each problem.

2.2. Finite Differences

Fig 4: Gradients estimated by finite difference method.

Instead of calculating the gradients by taking the derivative of the output with respect to the inputs using autodiff, we can also use finite differences to approximate the gradients. Because finite difference stencils have local compact supports, they are less susceptible to noise. Indeed, the gradients of the finite difference solution are much less noisy than the autodiff version (see figure above).

2.3 Averaging

Fig 5: Evolving of the gradient of the function over time in the mean gradients approach.

Instead of using the gradients computed by autodiff, we spatially average them by calculating the mean at four more neighboring points for each location.  Although the gradients seem smoother at the very beginning, it is easy to see that after a few timesteps, the gradient’s peaks become fictitiously taller, degrading the results (Fig 5). 

2.4 Initializing with Ground Truth Gradients

Fig 6: Evolving of the gradient of the function over time where the gradients are initialized to be the ground truth gradients.

At timestep 0, the neural network is trained to predict the initial values of the function we are trying to optimize. In addition, similarly to what is done in Sobolev training [6], we force the model to have the same gradients as the desired function. 

This method does reduce the noise in the gradient over time (Fig 6), but the information gathered from the extra supervision used in the initialization steps slowly dissolves. Furthermore, these gradients are not always available or not suitable for setting up a loss function in such a manner, making this solution not always applicable. 

3. Hybrid grid-neural representations

Fig 7: Description of the multi-resolution hashgrid (reproduced from [7]).

To address neural representations’ long training time, Muller et al. [7] developed a multi-resolution hash grid approach: instantNGP. In the following section, we explore instantNGP’s gradient quality. We benchmark it on a 2D fluid example. The figure below shows the example’s initial condition:

Fig 8: Vortex velocity field (reproduced from [1]).

When we try to fit this initial condition using a pure neural representation (i.e., NOT instantNGP), we observe highly noisy gradients, consistent with the results reported in Section 2.1.

Fig 9: Gradient we obtained using pure neural representations for the proposed 2D vortex problem.

Next, we investigated hybrid neural-grid representations, varying the type of the grids and the type of the hash (See Fig 10). 

Fig 10: Different gradients for different resolutions.

The parameters outlined in green obtained the best result. In this case, we use a base resolution of 256 and a 3-level hash grid. 

However, those results only accounted for the initial conditions. If we let the gradients evolve for 99 timesteps according to the PDE, we observe that the gradients start diverging, as can be seen in the video below.

Fig 11: Evolution of the gradients for the 2D fluids for the two vortices example. This test was done with a dense grid of 512 resolution.

In summary, although our hybrid approach seems excellent for the first timesteps, it also appears to limit the expressivity of the network during subsequent evolution.

Due to these problems in the 2D Fluid scenario, we investigated the gradients’ performance in another environment: the 2D Advection problem.

Fig 12: Comparison of Gradient Magnitudes for Different methods.

We used the same network configurations (base resolution, number of levels, number of hidden layers, and hidden features) as the 2D Fluid problem. The hybrid neural-grid representation performs significantly worse than the pure neural representation using SIREN (see Fig 12). Here the network is asked to fit the initial condition only.

Next, we tried to tune the configuration parameters. We discovered that the models performed better for the 2D Advection scenario when the number of level is increased. The best result is given by the following parameters:  number of Levels: 16; per Level Scale: 1.5; Base resolution: 16.

Fig 13: Comparison of Gradient Magnitudes for Hash grid’s best parameters.

Even though the hash grid’s performances improved, the figure above demonstrates that the gradients are still much noisier than the pure neural representation (SIREN). 

As such, we conclude that the hybrid grid-neural representation’ performance is not consistently better than the pure neural representation. In fact, it’s sometimes worse, as demonstrated in the 2D advection example.

4. Future Work

Since our tests and other SIREN works [8] show that tuning the omega hyperparameter is essential for the gradient results, one possible next step is to perform automatic gradient tuning, as proposed in meta-learning techniques [9].

Another possible approach is supervising the gradients. Instead of using the original function to compute the gradients, it is possible to use a separate loss function that certifies the correct gradient values over time. Essentially, we aim to write down the evolution equation of the gradient itself and couple this equation with the original PDE.

Future work should also consider a theoretical understanding of the gradient problem. One possible cause for this problem is the global non-compact support of neural networks. This initially motivates us to explore hybrid grid-neural networks. Other works [7, 10, 11] used similar approaches to extract local features to feed into the network, enforcing a locality to the neural network.

5. References

Exploring Temporal Consistency and Cross-Modal Interaction of Latent NeRF

Authors: Sana Arastehfar, Erik Ekgasit, Berfin Inal, Maria Stuebner


Latent NeRF (Neural Radiance Fields) is a state-of-the-art generative model capable of synthesizing 3D-consistent 2D images from a combination of 3D-sketch and text guides. In this report, we investigate the temporal consistency of Latent NeRF by generating 3D-consistent images with different sketch guides and exploring the interplay between 3D-sketch guidance and text prompts in the generation process; with a broader motivation of using diffusion models in animation and dynamic scene generation. Through our experiments, we discover the importance of incorporating both 3D sketch and text information to achieve accurate and consistent results. Additionally, we propose potential enhancements, including the integration of geometry/shape loss and automatic generation of descriptive text from geometry, to improve the model’s performance.


Neural Radiance Field (NeRF) is a relatively new representation of geometry. They use a neural network to approximate a function F: (x, θ) → (c, σ) that can model the appearance of a single scene. This function takes a sampled input 3D point x and a viewing direction θ derived from a 2D image (with camera information) and returns the color c = (r, g, b) and volume density σ of the shape at that point. This is enough to encode the shape and color of a scene as well as view-dependent lighting effects in the scene with a radiance field.

Traditionally, NeRFs have been trained with sets of images from the real world or images that are rendered using computer graphics software. Recently, image diffusion models such as Stable Diffusion have been able to generate coherent images from just text prompts. Combining diffusion models and NeRFs has led to Latent NeRF, a cutting-edge generative model that uses an image diffusion model to train NeRFs using just a text prompt and an optional guiding 3D shape.

We investigate the use of latent-nerf to create sequential images and animation. One major challenge in doing so is temporal consistency. It is currently challenging to ensure that diffusion models recreate the same object/character between runs. This week, we attempt to achieve temporal consistency of the model’s output and the influence of combining different input modalities.


We conducted a series of experiments to evaluate the temporal consistency of the Latent NeRF model and explore the interaction between sketches and text prompts during image synthesis.

Temporal Consistency Assessment:

We first decided to test consistency between two poses under the most naive approach to get a baseline for consistency in latent-nerf. We started by using sketch shapes to guide the shape of the NeRFs. Each sketch shape is a collection of simple triangle meshes arranged in roughly the same shape as the desired output. Our Blender-master of the team, Erik, created sketch shapes of a teddy bear in different poses and configurations to guide the NeRF generation. Figure 1 depicts the original sketch-shape of a teddy bear with its left arm raised, and Figure 2 shows the same teddy bear sketch with both arms down. Given the same text prompt, would the NeRFs generated from these sketch shapes look like the same object in different poses? According to the results in Figure 4 and Figure 5, the answer is no. The models have different colors, making them unsuitable to use as frames of animation.

Figure 1: Default teddy bear sketch with left arm up
Figure 2: Teddy bear sketch with both arms down
Figure 3: Teddy bear sketch holding a sword
Figure 4: Teddy with right arm up (default in Figure 1) sketch with the text prompt ‘a lego man’
Figure 5: Teddy with both hands down sketch (Figure 2) with the text prompt ‘a lego man’

Cross-Modal Interaction:

To understand the interplay between sketches and text prompts, we augmented the base sketch with a sword (Figure 3). We experimented with two different text prompts: “a lego man” (since this is the convention in previous papers to call it a “lego man” instead of a “lego human”) and “a lego man holding a sword”. Figure 6 illustrates the output when using the “a lego man” text prompt, which resulted in only the phantom of the sword being generated. However, when using the “a lego man holding a sword” text prompt, the model successfully generated a visible sword (Figure 7). The presence of both image and text prompts is crucial for generating visible new objects, suggesting the importance of cross-modal interaction.

Figure 6: Teddy holding a sword sketch with the text prompt ‘a lego man’
Figure 7: Teddy holding a sword sketch with the text prompt ‘a lego man holding a sword’

Additionally, we wanted to assess the level of support required from sketch-guidance while using text-guidance. To test this, we generated various lego humans based on specific instructions, such as ‘a lego man with right arm up’, ‘a lego man with left arm up’, while keeping the sketch guidance consistent with the same teddy sketch in Figure 1. As seen in Figures 8 and 9, despite the text prompts requesting opposite actions, the outputs are quite similar to each other, suggesting that text-guidance itself is not sufficient to generate the desired output.

Figure 8 : Default teddy sketch (Fig. 1) with the text prompt ‘a lego man with right arm up’
Figure 9: Default teddy sketch (Fig. 1) with the text prompt ‘a lego man with left arm up’

Our observations indicate that Latent NeRF’s temporal consistency can be improved by incorporating additional constraints and interactions between different input modalities. This suggests that using a long description of the desired character could enforce temporal consistency between poses.


Building on our observations and discussions, we have some ideas as to how to improve temporal consistency in Latent-NeRF. We could integrate a geometry/shape loss to enhance the model’s ability to maintain consistency between generated images. Additionally, we could develop a mechanism to automatically extract descriptive word descriptors from geometry to use as complementary text prompts. Finally, we could use Stable Diffusion concepts to guide consistency.

Potential Experiments:

Moving forward, we would like to explore the consistency in generating objects with similar geometry (e.g., different keywords as, sword, stick, etc. with the same sketch). To that end, we plan to repeat the same experiment, but with different combinations of geometry and text. Additionally, we would like to investigate the model’s performance when utilizing text prompts unrelated to the geometry, such as “stick” or “apple,” (when the sketch, in fact, displays a sword) to evaluate the robustness of cross-modal interactions. Lastly, we think it would be interesting to find/create a Stable Diffusion concept and apply it to the generation of two different poses.


Our exploration of Latent NeRF’s temporal consistency and cross-modal interactions highlights the importance of combining sketch and text guides to achieve consistent and accurate 3D image synthesis. By addressing the observed issues and implementing potential enhancements, the model can be further refined to generate even more realistic and consistent images across different inputs. Eventually, this will help animators, modelers, and content creators to easily generate dynamic NeRFs with articulated characters and scenes.

Technical Journey:

Neural networks are renowned for their substantial computational demands, leading to extended training and inference times. Despite notable advancements in the realm of NeRFs research, which have contributed to improved training and inference speeds, these models continue to exhibit high memory requirements. Similarly, diffusion models also exhibit a pronounced appetite for memory resources. Consequently, the execution of Latent-NeRF necessitates access to machines equipped with GPUs boasting a minimum of 12 GB of VRAM.

The training of a single NeRF entails a substantial time investment, ranging from 30 minutes when using NVIDIA RX 3090 to 3 hours using Google Colab. Moreover, the implementation of latent NeRF entailed the integration of numerous dependencies, a process that demanded considerable troubleshooting efforts to ensure proper installation. Given the convergence of the two distinct areas of graphics research, namely NeRFs and Diffusion models, encountering technical challenges during dependency management was a foreseeable aspect of the endeavor. Such technical trouble shooting constitutes an inevitable and crucial facet of the overall research process.

Fortunately, the mentors of this project Sainan Liu, Ilke Demir and Olga Guțan, provided valuable guidance in navigating these technical complexities, which significantly expedited the resolution of issues and allowed us to focus more efficiently on the core aspects of our project.


