Experience Handling Large CSV Uploads (5GB) with Complex Validation

Experience Handling Large CSV Uploads (5GB) with Complex Validation

Requirements

  • Allow users to upload CSV files (up to 5GB).
  • Validate the data and log all error rows:
    • Columns exceeding 200 characters.
    • Presence of disallowed special characters.
    • A set of columns must be unique (no duplicates within the file and compared to the database).
  • Return to the user:
    • The rows with errors.
    • Type of error.
    • The problematic value.

Overall Solution

The processing flow is divided into three main phases:

1. User Uploads File to S3

  • Frontend uploads the file to S3 using a pre-signed URL or S3 SDK.
  • Once upload is successful, trigger the validation process via EventBridge or by sending a request to the backend.

2. Batch Processing: Data Validation

The validation process is broken into multiple steps to optimize performance and reduce timeout/memory leak risks.

a. Create a Temporary Table in the Database

  • Create a temp table mirroring the CSV structure.
  • Add extra columns to log errors: column_1_length, column_1_regex, is_error, line_number, etc.

b. Load File from S3 into the Temp Table

  • Use LOAD DATA LOCAL INFILE to import the file directly into the database (if S3 is mounted to local).
  • Optionally preprocess to remove newline characters (\r\n) embedded in values.
  • Temporarily disable indexes to speed up data loading, then re-enable them.

LOAD DATA LOCAL INFILE ‘/path_to_file.csv’
INTO TABLE temp_table
FIELDS TERMINATED BY ‘,’
ENCLOSED BY ‘”‘
LINES TERMINATED BY ‘\n’
(column_1, column_2)
SET
  column_1 = REPLACE(REPLACE(column_1, ‘\r’, ), ‘\n’, ),
  column_2 = REPLACE(REPLACE(column_2, ‘\r’, ), ‘\n’, ),
  column_1_length = CHAR_LENGTH(column_1),
  column_2_length = CHAR_LENGTH(column_2),
  column_1_regex = column_1 REGEXP ‘[^a-zA-Z0-9+\-_.:=?&/]+’,
  is_error = IF (column_1_length > 200 OR column_2_length > 200 OR column_1_regex > 0, 1, 0)

c. Retrieve Error Rows and Generate an Error Report

  • Select rows with is_error = 1, write them to an error file with row number and error description.

d. Check for Duplicates Within the File

SELECT id_key, GROUP_CONCAT(line_number) AS duplicated_lines
FROM temp_table
GROUP BY id_key
HAVING COUNT(*) > 1;

e. Check for Duplicates Against the Database

Compare id_key in temp_table against the main table in your system.

3. Insert Valid Data into the Main Database

  • Only include rows without errors.
  • Use INSERT INTO or REPLACE INTO depending on the use case.
  • Optionally add logic for default values, audit fields, etc.

4. Clean Up Temporary Table

  • Drop the temp table after processing to free up resources.

Tips & Best Practices

  • Disable indexes during large data loads, then re-enable afterward (significantly reduces loading time).
  • Ensure the CSV is correctly encoded (UTF-8, without BOM).
  • Avoid line-by-line validation in code — use SQL for efficient bulk validation.
  • Log all errors in one pass so users can correct them all at once, rather than iteratively.

Conclusion

Handling large file uploads isn’t just about raising size limits. You need a smart, efficient processing flow that ensures performance and scalability. With this approach, you can handle even larger files if needed.

Would you like help implementing a similar solution in code or cloud architecture?

Credit: Ton That Bach (Technical Leader in Mynavi TechTus Vietnam)

Tags

What do you think?

Leave a Reply

Logged in as admin. Edit your profile. Log out? Required fields are marked *

Related articles

Contact us

Partner with Us for Comprehensive IT Solutions

We’re happy to answer any questions you may have and assist you in determining which of our services best meet your needs.

Your benefits:
What’s the next step?
1

We schedule a call at your convenience.

2

We conduct a discovery and consultation meeting.

3

We prepare a proposal based on your needs.

Book a free consultation