Requirements
- Allow users to upload CSV files (up to 5GB).
- Validate the data and log all error rows:
- Columns exceeding 200 characters.
- Presence of disallowed special characters.
- A set of columns must be unique (no duplicates within the file and compared to the database).
- Columns exceeding 200 characters.
- Return to the user:
- The rows with errors.
- Type of error.
- The problematic value.
- The rows with errors.
Overall Solution
The processing flow is divided into three main phases:
1. User Uploads File to S3
- Frontend uploads the file to S3 using a pre-signed URL or S3 SDK.
- Once upload is successful, trigger the validation process via EventBridge or by sending a request to the backend.
2. Batch Processing: Data Validation
The validation process is broken into multiple steps to optimize performance and reduce timeout/memory leak risks.
a. Create a Temporary Table in the Database
- Create a temp table mirroring the CSV structure.
- Add extra columns to log errors: column_1_length, column_1_regex, is_error, line_number, etc.
b. Load File from S3 into the Temp Table
- Use LOAD DATA LOCAL INFILE to import the file directly into the database (if S3 is mounted to local).
- Optionally preprocess to remove newline characters (\r\n) embedded in values.
- Temporarily disable indexes to speed up data loading, then re-enable them.
LOAD DATA LOCAL INFILE ‘/path_to_file.csv’
INTO TABLE temp_table
FIELDS TERMINATED BY ‘,’
ENCLOSED BY ‘”‘
LINES TERMINATED BY ‘\n’
(column_1, column_2)
SET
column_1 = REPLACE(REPLACE(column_1, ‘\r’, ”), ‘\n’, ”),
column_2 = REPLACE(REPLACE(column_2, ‘\r’, ”), ‘\n’, ”),
column_1_length = CHAR_LENGTH(column_1),
column_2_length = CHAR_LENGTH(column_2),
column_1_regex = column_1 REGEXP ‘[^a-zA-Z0-9+\-_.:=?&/]+’,
is_error = IF (column_1_length > 200 OR column_2_length > 200 OR column_1_regex > 0, 1, 0)
c. Retrieve Error Rows and Generate an Error Report
- Select rows with is_error = 1, write them to an error file with row number and error description.
d. Check for Duplicates Within the File
SELECT id_key, GROUP_CONCAT(line_number) AS duplicated_lines
FROM temp_table
GROUP BY id_key
HAVING COUNT(*) > 1;
e. Check for Duplicates Against the Database
Compare id_key in temp_table against the main table in your system.
3. Insert Valid Data into the Main Database
- Only include rows without errors.
- Use INSERT INTO or REPLACE INTO depending on the use case.
- Optionally add logic for default values, audit fields, etc.
4. Clean Up Temporary Table
- Drop the temp table after processing to free up resources.
Tips & Best Practices
- Disable indexes during large data loads, then re-enable afterward (significantly reduces loading time).
- Ensure the CSV is correctly encoded (UTF-8, without BOM).
- Avoid line-by-line validation in code — use SQL for efficient bulk validation.
- Log all errors in one pass so users can correct them all at once, rather than iteratively.
Conclusion
Handling large file uploads isn’t just about raising size limits. You need a smart, efficient processing flow that ensures performance and scalability. With this approach, you can handle even larger files if needed.
Would you like help implementing a similar solution in code or cloud architecture?
Credit: Ton That Bach (Technical Leader in Mynavi TechTus Vietnam)