Experience Handling Large CSV Uploads (5GB) with Complex Validation

Experience Handling Large CSV Uploads (5GB) with Complex Validation

Requirements

  • Allow users to upload CSV files (up to 5GB).
  • Validate the data and log all error rows:
    • Columns exceeding 200 characters.
    • Presence of disallowed special characters.
    • A set of columns must be unique (no duplicates within the file and compared to the database).
  • Return to the user:
    • The rows with errors.
    • Type of error.
    • The problematic value.

Overall Solution

The processing flow is divided into three main phases:

1. User Uploads File to S3

  • Frontend uploads the file to S3 using a pre-signed URL or S3 SDK.
  • Once upload is successful, trigger the validation process via EventBridge or by sending a request to the backend.

2. Batch Processing: Data Validation

The validation process is broken into multiple steps to optimize performance and reduce timeout/memory leak risks.

a. Create a Temporary Table in the Database

  • Create a temp table mirroring the CSV structure.
  • Add extra columns to log errors: column_1_length, column_1_regex, is_error, line_number, etc.

b. Load File from S3 into the Temp Table

  • Use LOAD DATA LOCAL INFILE to import the file directly into the database (if S3 is mounted to local).
  • Optionally preprocess to remove newline characters (\r\n) embedded in values.
  • Temporarily disable indexes to speed up data loading, then re-enable them.

LOAD DATA LOCAL INFILE ‘/path_to_file.csv’
INTO TABLE temp_table
FIELDS TERMINATED BY ‘,’
ENCLOSED BY ‘”‘
LINES TERMINATED BY ‘\n’
(column_1, column_2)
SET
  column_1 = REPLACE(REPLACE(column_1, ‘\r’, ), ‘\n’, ),
  column_2 = REPLACE(REPLACE(column_2, ‘\r’, ), ‘\n’, ),
  column_1_length = CHAR_LENGTH(column_1),
  column_2_length = CHAR_LENGTH(column_2),
  column_1_regex = column_1 REGEXP ‘[^a-zA-Z0-9+\-_.:=?&/]+’,
  is_error = IF (column_1_length > 200 OR column_2_length > 200 OR column_1_regex > 0, 1, 0)

c. Retrieve Error Rows and Generate an Error Report

  • Select rows with is_error = 1, write them to an error file with row number and error description.

d. Check for Duplicates Within the File

SELECT id_key, GROUP_CONCAT(line_number) AS duplicated_lines
FROM temp_table
GROUP BY id_key
HAVING COUNT(*) > 1;

e. Check for Duplicates Against the Database

Compare id_key in temp_table against the main table in your system.

3. Insert Valid Data into the Main Database

  • Only include rows without errors.
  • Use INSERT INTO or REPLACE INTO depending on the use case.
  • Optionally add logic for default values, audit fields, etc.

4. Clean Up Temporary Table

  • Drop the temp table after processing to free up resources.

Tips & Best Practices

  • Disable indexes during large data loads, then re-enable afterward (significantly reduces loading time).
  • Ensure the CSV is correctly encoded (UTF-8, without BOM).
  • Avoid line-by-line validation in code — use SQL for efficient bulk validation.
  • Log all errors in one pass so users can correct them all at once, rather than iteratively.

結論

Handling large file uploads isn’t just about raising size limits. You need a smart, efficient processing flow that ensures performance and scalability. With this approach, you can handle even larger files if needed.

Would you like help implementing a similar solution in code or cloud architecture?

Credit: Ton That Bach (Technical Leader in Mynavi TechTus Vietnam)

タグ

どう思いますか?

Leave a Reply

Logged in as admin. Edit your profile. Log out? Required fields are marked *

関連記事

お問い合わせ

総合的なITソリューションのために私たちと提携しましょう

皆様のご質問にお答えし、どの求人があなたのキャリアに最も適しているかをお手伝いできることを嬉しく思います。

貴社のメリット
次のステップは?
1

ご都合に合わせてお電話を調整いたします。

2

ディスカバリーとコンサルテーション会議を実施します。

3

貴社のニーズに基づき提案を作成します。

無料相談を予約