Improving Synthetic Data Augmentation and Human Action Recognition with SynthDa

Human action recognition is a capability in AI systems designed for safety-critical applications, such as surveillance, eldercare, and industrial monitoring. However, many real-world datasets are limited by data imbalance, privacy constraints, or insufficient coverage of rare but important actions such as falls or accidents. In this post, I present SynthDa, a modular synthetic data augmentation pipeline that bridges the gap between data scarcity and model generalization.

A SynthDa descriptive image shows how the various outputs look. — *Figure 1: SynthDa enables users to use existing real-world data to create synthetic data from different viewpoints and in different lighting choices*

By using synthetic motion sequences and viewpoint randomization, SynthDa improves model robustness without requiring large-scale manual annotations. This deep dive explores how SynthDa generates realistic training data, integrates with the NVIDIA TAO Toolkit, and shows potential improvements in action recognition performance.

This is a NVIDIA AI Technology Centre (Asia-Pacific South) project in collaboration with the Singapore Institute of Technology.

Why synthetic data for human action recognition

Human action recognition is central to applications such as surveillance, assistive robotics, and human—computer interaction. However, traditional video datasets often struggle with three persistent challenges: class imbalance, limited action diversity, and high annotation costs. These issues are especially pronounced when dealing with rare, unusual, or privacy-sensitive behaviors that are difficult or expensive to capture in the real world.

Synthetic data offers a scalable and privacy-preserving solution. Instead of relying on massive real-world collections, we generate variations from a smaller set of real-motion samples. This approach enables us to systematically enrich the dataset with examples of uncommon or critical actions, improving model performance in edge cases without overfitting.

By strategically augmenting the training set with up to a controlled proportion of synthetic videos, we strengthen recognition of minority classes while maintaining the natural distribution of actions. This helps avoid the pitfalls of over-balancing, such as increased false positives, and ensures models are better prepared for real-world deployment.

To mitigate this, we introduce SynthDa, a usable framework designed to mitigate the problem of data scarcity in real-world situations.

A diagram showing how the SynthDa pipeline uses the prompts generated from real-world videos to introduce more variation in the SynthDa synthetic pose generation framework. — *Figure 2. The SynthDa pipeline also uses prompts generated from real-world videos to introduce more human pose variation in the SynthDa synthetic pose generation framework*

What is SynthDa?

SynthDa is a data augmentation pipeline designed for human action recognition tasks. It’s compatible with existing computer models, such as the NVIDIA TAO Toolkit (action recognition net), and provides two primary augmentation modes:

Synthetic mix: Interpolate between real-world videos and generative AI sequences to produce synthetic variations.
Real mix: Interpolate between pairs of real-world sequences to produce synthetic variations.

The pipeline consists of modular components for skeleton retargeting, 3D human rendering, scene randomization, and output video synthesis. Figure 1 shows the high-level architecture of SynthDa.

SynthDa mitigates the challenge of data scarcity by generating synthetic human motion videos and interpolated motion blends between real actions. The pipeline supports randomized environments and customizable camera angles to simulate varied conditions.

Over the years, this project has evolved to incorporate generative AI into its workflow, significantly expanding the diversity and realism of human pose variations. Its modular design makes it highly adaptable—users can integrate SynthDa components into existing pipelines or replace specific modules with their models, tailoring the system to fit their unique requirements or specialized use cases.

An animated GIF showing how the quality of the interpolated actions changes as the weights are adjusted. — *Figure 3. Adjusting the interpolation weights between two videos (synthetic mix), from an existing dataset* *GMDCSA24, depicting the ‘falling’ action*

Augmentation types and pipeline components

To generate diverse and realistic motion data, AutoSynthDa incorporates multiple augmentation strategies across a modular pipeline.

Synthetic mix: breathing life into motion with synthetic avatars

Synthetic mix begins with something very real: a human motion captured as a 3D posed skeleton. But instead of stopping there, we pass this motion through a creative transformation pipeline. Using the joints2smpl tool, each pose sequence is transferred to a synthetic avatar that moves convincingly within a virtual environment.

To boost diversity and robustness, every scene is randomized. Avatars appear in different environments, under varying lighting conditions, and from unique camera angles. The result is a rich collection of synthetic videos that retain real-world motion patterns while introducing visual variety that helps models generalize better.

Think of it as motion-capture cosplay unfolding across a dozen digital worlds at once.

Real mix: stitching motions into new realities

Not all data augmentation needs synthetic avatars. Real mix takes pairs of real motion sequences and smoothly blends them into new, naturalistic transitions. This interpolation process does more than just increase volume. It creates plausible variations that could occur in the real world but were never recorded.

By adjusting the interpolation weights and mixing different actions, we can produce subtle intra-class variations or realistic transitions between distinct motions. It’s like choreographing new routines from familiar dance moves: creative, grounded, and data-efficient.

Randomized scenes and camera viewpoints

Whether the source is synthetic or real, visual diversity improves robustness. That’s why we randomize object layouts, surface textures, and camera perspectives for every rendered video. These variations help the model handle complex, real-world scenarios without becoming overfitted to narrow conditions.

Find more details about randomizing scenes and customizing options.

Getting started with SynthDa

SynthDa can be installed as a Python package. The following steps summarize the setup process. Further details are on our GitHub repository:

1. Install dependencies

pip install -r requirements.txt

2. Clone the required repositories

Clone and place the following repositories in the project directory:

StridedTransformer-Pose3D
text-to-motion
joints2smpl
SlowFast
Download Blender 3.0 from the official Blender release archive

3. Configuration setup

Create an .env file in the root directory with paths to each cloned repository and your API keys.

4. Download pretrained models

Download all pretrained checkpoints as instructed in the GitHub documentation. Place them in their respective folders for each repository.

5. Verify each component

Before running the full SynthDa pipeline, test each submodule:

StridedTransformer: python demo/vis.py --video sample_video.mp4
text-to-motion: python gen_motion_script.py --name Comp_v6_KLD01 --text_file input.txt
joints2smpl: python fit_seq.py --files test_motion2.npy
Blender: Test rendering with ./blender -b -P animation_pose.py

6. Generate samples of synthetic data using running.py as your guide

A detailed diagram showing the steps from input to pose extraction and selectoin to augmentation optimization. — *Figure 4. A system architecture diagram depicting how the components come together for AutoSynthDa*

7. Custom optimization loop

A key feature of AutoSynthDa is its support for iterative optimization. Instead of relying on static augmentation, users can dynamically refine synthetic data quality using feedback from downstream model performance.

Generate pose pairs once
- At the start, generate synthetic poses either from:
  – A real and real motion source pair, or
  – A real and synthetic motion source pair (e.g., using generative AI for pose generation)
  
  These pose pairs are used repeatedly for interpolation; you don’t need to regenerate poses for each iteration.
Define the loop logic
- Use the model accuracy to guide interpolation weight selection. Start with a default (e.g., 0.5), then evaluate nearby weights and move toward the one that performs best
Implement calculate_new_weight (Finite Difference Approximation)
- To adjust the interpolation weight, we use a simple finite difference approximation. This estimates the direction of improvement by evaluating model accuracy at slightly higher and lower weights.

8. Train and Test with the NVIDIA TAO Toolkit

Assuming that you have chosen the NVIDIA TAO Toolkit, you can choose to train and test with the following command in Docker:

# Sample command for training

action_recognition train -e ./specs/{experiment.yaml} results_dir=./results

action_recognition evaluate -e ./specs/{experiment}.yaml results_dir=./results dataset.workers=0 evaluate.checkpoint=./{path to trained model}.tlt evaluate.batch_size=1 evaluate.test_dataset_dir=./{path to directory} evaluate.video_eval_mode=center

Learn more about the NVIDIA TAO Toolkit action recognition net.

Real-world users of SynthDa

SynthDA has been adopted and tested by real-world users across research, academia, and industry, demonstrating its practical value in diverse action recognition scenarios.

National Institute of Education (Nanyang Technological University) and Hwa Chong Institution, Singapore

One of the most promising real-world applications of SynthDa is in the education sector, led by researchers at the National Institute of Education (NIE), part of Nanyang Technological University in Singapore. SynthDa is being adopted in an ongoing project to generate synthetic video data for science experiments in school laboratories. This effort aims to train computer vision models that can help teachers by monitoring student activity and identifying critical actions in real-time, such as misuse of lab equipment or procedural errors.

This project directly addresses the issue of limited access to real student video data, a common challenge due to privacy regulations and data scarcity. By using synthetic video data that simulates real student behaviors, the team at NIE is working to build safer, more responsive AI-powered instructional tools.

The project has since been extended to include Hwa Chong Institution, a leading secondary school in Singapore, where researchers are planning to pilot AI models trained with synthetic data in actual classroom situations. Hwa Chong Institution is currently working with the research team at NIE with real-world footage of students performing science experiments for training and research purposes. Through this collaboration, SynthDa plays a foundational role in demonstrating how synthetic data can make AI systems viable and ethical in educational environments, where protecting student privacy is paramount.

Shiga University, Japan

Researchers from Yoshihisa Lab at Shiga University (Hikone, Shiga, Japan) have begun exploring using SynthDa for generating different human motion data. By using SynthDa’s unique SDG-based interpolation method, they can produce rich pose variations exportable as .mp4 or .fbx files. Their project is focused on using AI technologies like SynthDa and IoT to ensure safe and secure bicycle use.

Matsuo-Iwasawa Lab, University of Tokyo, Japan

The Matsuo-Iwasawa Laboratory at the University of Tokyo focuses on cutting-edge deep learning research, nurturing pioneers to drive future innovations through an ecosystem of education, research, and entrepreneurship. Tokyo, Japan. SynthDa is currently being explored in collaboration with the Matsuo-Iwasawa Lab at the University of Tokyo, with a focus on its potential for imitation learning in real-world robotic systems. Beyond robotics, the project also aims to support broader use cases such as adaptive world modeling and multimodal learning in immersive environments.

Future work

Future work includes extending SynthDa to support multi-person actions and fine-tuning for specific domains such as healthcare and robotics. SynthDa is designed to be improved and integrated with your existing workflows; hence, our repository contains the individual components that you can swap and piece together to create your version of SynthDa.Show us your creations and utilization of SynthDa, even for your personal research projects, or research works.

For more information, visit the SynthDa GitHub repository and read more about synthetic data. The SynthDa demo is available on HuggingFace.

If you try SynthDa in your projects or benchmarks, let us know and connect through our repository. What applications and use cases can you deploy SynthDa for today?

Acknowledgements

National Institute of Education (Nanyang Technological University), Singapore, Shiga University, Japan, University of Tokyo, Japan, Chek Tien Tan (SIT), Indri Atmosukarto (SIT), and Simon See (NVIDIA)

Special thanks to all the other colleagues who helped SynthDa in various ways.

Improving Synthetic Data Augmentation and Human Action Recognition with SynthDa

Why synthetic data for human action recognition

What is SynthDa?

Augmentation types and pipeline components

Synthetic mix: breathing life into motion with synthetic avatars

Real mix: stitching motions into new realities

Randomized scenes and camera viewpoints

Getting started with SynthDa

Real-world users of SynthDa

National Institute of Education (Nanyang Technological University) and Hwa Chong Institution, Singapore

Shiga University, Japan

Matsuo-Iwasawa Lab, University of Tokyo, Japan

Future work

Acknowledgements

Related resources

Tags

About the Authors

Improving Synthetic Data Augmentation and Human Action Recognition with SynthDa

Why synthetic data for human action recognition

What is SynthDa?

Augmentation types and pipeline components

Synthetic mix: breathing life into motion with synthetic avatars

Real mix: stitching motions into new realities

Randomized scenes and camera viewpoints

Getting started with SynthDa

Real-world users of SynthDa

National Institute of Education (Nanyang Technological University) and Hwa Chong Institution, Singapore

Shiga University, Japan

Matsuo-Iwasawa Lab, University of Tokyo, Japan

Future work

Acknowledgements

Related resources

Tags

About the Authors

Comments

Related posts

Scaling Action Recognition Models with Synthetic Data

Building a Computer Vision Application to Recognize Human Activities

Developing and Deploying Your Custom Action Recognition Application Without Any AI Expertise Using NVIDIA TAO and NVIDIA DeepStream

Training and Optimizing a 2D Pose Estimation Model with NVIDIA TAO Toolkit, Part 1

Fast-Tracking Hand Gesture Recognition AI Applications with Pretrained Models from NGC

Related posts

Forecasting the Weather Beyond Two Weeks Using NVIDIA Earth-2

From Terabytes to Turnkey: AI-Powered Climate Models Go Mainstream

Accelerating Video Production and Customization with GliaCloud and NVIDIA Omniverse Libraries

NVIDIA cuQuantum Adds Dynamics Gradients, DMRG, and Simulation Speedup?

NVIDIA Omniverse: What Developers Need to Know About Migration Away From Launcher