Experience Handling Large CSV Uploads (5GB) with Complex Validation

15 May 2025

Requirements

Allow users to upload CSV files (up to 5GB).
Validate the data and log all error rows:
- Columns exceeding 200 characters.
- Presence of disallowed special characters.
- A set of columns must be unique (no duplicates within the file and compared to the database).
Return to the user:
- The rows with errors.
- Type of error.
- The problematic value.

Overall Solution

The processing flow is divided into three main phases:

1. User Uploads File to S3

Frontend uploads the file to S3 using a pre-signed URL or S3 SDK.
Once upload is successful, trigger the validation process via EventBridge or by sending a request to the backend.

2. Batch Processing: Data Validation

The validation process is broken into multiple steps to optimize performance and reduce timeout/memory leak risks.

a. Create a Temporary Table in the Database

Create a temp table mirroring the CSV structure.
Add extra columns to log errors: column_1_length, column_1_regex, is_error, line_number, etc.

b. Load File from S3 into the Temp Table

Use LOAD DATA LOCAL INFILE to import the file directly into the database (if S3 is mounted to local).

Optionally preprocess to remove newline characters (\r\n) embedded in values.
Temporarily disable indexes to speed up data loading, then re-enable them.

LOAD DATA LOCAL INFILE ‘/path_to_file.csv’
INTO TABLE temp_table
FIELDS TERMINATED BY ‘,’
ENCLOSED BY ‘”‘
LINES TERMINATED BY ‘\n’
(column_1, column_2)
SET
column_1 = REPLACE(REPLACE(column_1, ‘\r’, ”), ‘\n’, ”),
column_2 = REPLACE(REPLACE(column_2, ‘\r’, ”), ‘\n’, ”),
column_1_length = CHAR_LENGTH(column_1),
column_2_length = CHAR_LENGTH(column_2),
column_1_regex = column_1 REGEXP ‘[^a-zA-Z0-9+\-_.:=?&/]+’,
is_error = IF (column_1_length > 200 OR column_2_length > 200 OR column_1_regex > 0, 1, 0)

c. Retrieve Error Rows and Generate an Error Report

Select rows with is_error = 1, write them to an error file with row number and error description.

d. Check for Duplicates Within the File

SELECT id_key, GROUP_CONCAT(line_number) AS duplicated_lines
FROM temp_table
GROUP BY id_key
HAVING COUNT(*) > 1;

e. Check for Duplicates Against the Database

Compare id_key in temp_table against the main table in your system.

3. Insert Valid Data into the Main Database

Only include rows without errors.
Use INSERT INTO or REPLACE INTO depending on the use case.
Optionally add logic for default values, audit fields, etc.

4. Clean Up Temporary Table

Drop the temp table after processing to free up resources.

Tips & Best Practices

Disable indexes during large data loads, then re-enable afterward (significantly reduces loading time).
Ensure the CSV is correctly encoded (UTF-8, without BOM).
Avoid line-by-line validation in code — use SQL for efficient bulk validation.
Log all errors in one pass so users can correct them all at once, rather than iteratively.

Conclusion

Handling large file uploads isn’t just about raising size limits. You need a smart, efficient processing flow that ensures performance and scalability. With this approach, you can handle even larger files if needed.

Would you like help implementing a similar solution in code or cloud architecture?

Credit: Ton That Bach (Technical Leader in Mynavi TechTus Vietnam)

What do you think?

Show comments / Leave a comment

Partner with Us for Comprehensive IT Solutions

We’re happy to answer any questions you may have and assist you in determining which of our services best meet your needs.

Your benefits:

What’s the next step?

We schedule a call at your convenience.

We conduct a discovery and consultation meeting.

We prepare a proposal based on your needs.

Book a free consultation

First name

Last name

Company / Organization

Company email

Phone

How Can We Help You?

Message

Experience Handling Large CSV Uploads (5GB) with Complex Validation

Requirements

Overall Solution

1. User Uploads File to S3

2. Batch Processing: Data Validation

a. Create a Temporary Table in the Database

b. Load File from S3 into the Temp Table

c. Retrieve Error Rows and Generate an Error Report

d. Check for Duplicates Within the File

e. Check for Duplicates Against the Database

3. Insert Valid Data into the Main Database

4. Clean Up Temporary Table

Tips & Best Practices

Conclusion

What do you think?

Leave a Reply Cancel reply

Related articles

When ENV is no longer ENV – Tricks from the runtime environment

Digital Learning Platform: Integrating a streamlined UI/UX with a robust infrastructure

Here’s how Mynavi TechTus Vietnam helps implement IoT in your Business (Part 2)

Partner with Us for Comprehensive IT Solutions

Your benefits:

What’s the next step?

Book a free consultation

Solutions

Company

LinkedIn

Youtube

Facebook

Medium

Techie

Digital Transformation

Data Privacy and Security

Automation

Integration with Existing Systems

Interoperability

Scalability

Experience Handling Large CSV Uploads (5GB) with Complex Validation

Requirements

Overall Solution

1. User Uploads File to S3

2. Batch Processing: Data Validation

a. Create a Temporary Table in the Database

b. Load File from S3 into the Temp Table

c. Retrieve Error Rows and Generate an Error Report

d. Check for Duplicates Within the File

e. Check for Duplicates Against the Database

3. Insert Valid Data into the Main Database

4. Clean Up Temporary Table

Tips & Best Practices

Conclusion

What do you think?

Leave a Reply Cancel reply

Related articles

Partner with Us for Comprehensive IT Solutions

Your benefits:

What’s the next step?

Book a free consultation

Simplifying IT for a complex world.

Platform partnerships

Services

Business Challenges

Digital Transformation

Data Privacy and Security

Automation

Integration with Existing Systems

Interoperability

Scalability

Industry Focus

Simplifying IT
for a complex world.