RGB Coders

Split a Big JSON File into Multiple CSV

In today’s data-driven world, handling large JSON files is a common task for many developers and data analysts. JSON (JavaScript Object Notation) files are often used to store structured data, but when these files become too large, it can be challenging to work with them efficiently. One effective solution is to split a big JSON file into multiple CSV files, which are more manageable and can be easily imported into various data analysis tools. In this article, we will walk you through the process of splitting a big JSON file into multiple CSV files, step by step.

Handling large JSON files efficiently is crucial for data professionals. By splitting these files into smaller, manageable CSV files, you can streamline your data analysis process. In this guide, we will demonstrate how to do this in a few simple steps.

Full Vido Tutorial

Prerequisites

Before we begin, ensure that you have the following prerequisites in place:

  • A big JSON file that you want to split.
  • Python installed on your system.
  • Pip, a package manager for Python.
  • Two required libraries: tqdm and pandas.

Step 1: Install Python

If you haven’t already installed Python, visit the official Python website (https://www.python.org/downloads/) and download the latest version for your operating system.

Step 2: Install Pip and Required Libraries

After installing Python, you need to install Pip, the package manager for Python. You can do this by opening your command prompt or terminal and running the following command:

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3 get-pip.py
pip install tqdm
pip install pandas

This command installs the necessary libraries, tqdm and pandas, which are essential for splitting the JSON file.

Step 3: Create a Python File

Now, navigate to the directory where your big JSON file is located. Create a new Python file named split.py in the same directory. You can use any text editor or integrated development environment (IDE) to create this file.

Step 4: Copy and Paste Python Code

In the split.py file, copy and paste the following Python code:

Method 1 (for processing big simple JSON data without nested elements):

An example of a simple data structure:

[
  {"user_id": 1, "user_name": "John", "age": 30, "gender": "Male", "location": "New York"},
  {"user_id": 2, "user_name": "Alice", "age": 25, "gender": "Female", "location": "Los Angeles"},
  {"user_id": 3, "user_name": "Bob", "age": 35, "gender": "Male", "location": "Chicago"},
  {"user_id": 4, "user_name": "Eva", "age": 28, "gender": "Female", "location": "Houston"}
]
import json
import os
import pandas as pd
from tqdm import tqdm

# Define the path to the JSON file
json_file_path = "data.json"
# Define the maximum number of records per CSV file
records_per_csv = 50000
# Create the output folder if it doesn't exist
output_folder = "output"
os.makedirs(output_folder, exist_ok=True)

# Get the total number of lines in the JSON file
with open(json_file_path, "r", encoding="utf-8") as json_file:
    total_lines = sum(1 for _ in json_file)

# Open the JSON file for reading
with open(json_file_path, "r", encoding="utf-8") as json_file:
    # Create a JSON parser
    json_parser = json.JSONDecoder()
    
    # Initialize variables for batch processing
    current_csv_index = 1
    current_records_count = 0
    current_csv_data = []
    
    # Create a progress bar
    progress_bar = tqdm(total=total_lines)
    
    while True:
        # Read a line from the JSON file
        line = json_file.readline()
        if not line.strip():  # Skip empty lines
            continue
        if not line:
            break  # End of file
        
        # Update the progress bar
        progress_bar.update(1)
        
        try:
            # Parse the JSON data for the current line
            record = json_parser.raw_decode(line)[0]
            current_csv_data.append(record)  # Add the record to the batch
        except json.JSONDecodeError:
            print(f"Skipping invalid JSON line: {line}")
            continue  # Skip invalid lines and continue with the next line

        # Check if the maximum number of records per CSV is reached
        current_records_count += 1
        if current_records_count >= records_per_csv:
            # If batch is ready, convert it to a DataFrame and write to CSV
            if current_csv_data:  # Only process if there is data
                df = pd.DataFrame(current_csv_data)
                csv_file_path = os.path.join(output_folder, f"part_{current_csv_index}.csv")
                df.to_csv(csv_file_path, index=False)
                
                # Reset variables for the next batch
                current_csv_index += 1
                current_records_count = 0
                current_csv_data = []

    # Check if there are remaining records in the last batch
    if current_csv_data:
        df = pd.DataFrame(current_csv_data)
        csv_file_path = os.path.join(output_folder, f"part_{current_csv_index}.csv")
        df.to_csv(csv_file_path, index=False)

    # Close the progress bar
    progress_bar.close()
Method 2 (for handling JSON data with nested elements):

An example of a nested array structure:

{
  "users": [
    {
      "user_id": 1,
      "user_name": "John",
      "age": 30,
      "gender": "Male",
      "location": "New York",
      "date": "2023-09-30",
      "vehicles": [
        {
          "vehicle_id": 101,
          "make": "Toyota",
          "model": "Camry",
          "year": 2022
        },
        {
          "vehicle_id": 102,
          "make": "Honda",
          "model": "Civic",
          "year": 2021
        }
      ]
    },
    {
      "user_id": 2,
      "user_name": "Alice",
      "age": 25,
      "gender": "Female",
      "location": "Los Angeles",
      "date": "2023-09-30",
      "vehicles": [
        {
          "vehicle_id": 103,
          "make": "Ford",
          "model": "Escape",
          "year": 2022
        }
      ]
    },
    {
      "user_id": 3,
      "user_name": "Bob",
      "age": 35,
      "gender": "Male",
      "location": "Chicago",
      "date": "2023-09-30",
      "vehicles": []
    },
    {
      "user_id": 4,
      "user_name": "Eva",
      "age": 28,
      "gender": "Female",
      "location": "Houston",
      "date": "2023-09-30",
      "vehicles": [
        {
          "vehicle_id": 104,
          "make": "Tesla",
          "model": "Model 3",
          "year": 2023
        }
      ]
    }
  ]
}
import json
import os
import pandas as pd
from tqdm import tqdm

# Define the path to the JSON file
json_file_path = "data.json"

# Define the maximum number of records per CSV file
records_per_csv = 1000

# Create the output folder if it doesn't exist
output_folder = "output"
os.makedirs(output_folder, exist_ok=True)

# Create a progress bar
progress_bar = None

# Initialize variables
current_csv_index = 1
current_records_count = 0
current_csv_data = []

# Open the JSON file
with open(json_file_path, "r", encoding="utf-8") as json_file:
    data = json.load(json_file)

    if isinstance(data, list):
        # If the JSON data is a list, assume it's a list of JSON objects
        for record in data:
            # Update the progress bar
            if progress_bar is None:
                progress_bar = tqdm(total=len(data))
            progress_bar.update(1)

            # Append the record to the current CSV batch
            current_csv_data.append(record)

            # Check if the maximum number of records per CSV is reached
            current_records_count += 1
            if current_records_count >= records_per_csv:
                # Convert the current batch to a DataFrame using pandas
                df = pd.json_normalize(current_csv_data)

                # Export the DataFrame to CSV in the output folder
                csv_file_path = os.path.join(output_folder, f"part_{current_csv_index}.csv")
                df.to_csv(csv_file_path, index=False)

                # Reset variables for the next batch
                current_csv_index += 1
                current_records_count = 0
                current_csv_data = []

    elif isinstance(data, dict):
        # If the JSON data is a dictionary, check if there's only one key (parent object)
        if len(data) == 1:
            parent_key = list(data.keys())[0]
            parent_data = data[parent_key]

            if isinstance(parent_data, list):
                # If the parent data is a list, assume it contains child objects
                for child_record in parent_data:
                    # Update the progress bar
                    if progress_bar is None:
                        progress_bar = tqdm(total=len(parent_data))
                    progress_bar.update(1)

                    # Append the child record to the current CSV batch
                    current_csv_data.append(child_record)

                    # Check if the maximum number of records per CSV is reached
                    current_records_count += 1
                    if current_records_count >= records_per_csv:
                        # Convert the current batch to a DataFrame using pandas
                        df = pd.json_normalize(current_csv_data)

                        # Export the DataFrame to CSV in the output folder
                        csv_file_path = os.path.join(output_folder, f"part_{current_csv_index}.csv")
                        df.to_csv(csv_file_path, index=False)

                        # Reset variables for the next batch
                        current_csv_index += 1
                        current_records_count = 0
                        current_csv_data = []

# Check if there are remaining records in the last batch
if current_csv_data:
    # Convert the remaining batch to a DataFrame using pandas
    df = pd.json_normalize(current_csv_data)

    # Export the DataFrame to CSV in the output folder
    csv_file_path = os.path.join(output_folder, f"part_{current_csv_index}.csv")
    df.to_csv(csv_file_path, index=False)

# Close the progress bar
if progress_bar:
    progress_bar.close()

Step 5: Change file name

Change the variable json_file_path to match the name of your JSON file.

Change the records_per_csv variable to specify the number of records for each CSV you want.

Step 6: Run the Python Script

Open your terminal or command prompt, navigate to the directory where your split.py file is located, and run the Python script using one of the following commands:

python3 split.py

Or

python split.py

This will execute the script and split the JSON file into multiple CSV files.

Wait until the process is completed.

Step 7: Check the Output Directory

After running the script, check the specified output directory for the generated CSV files. You can now use these smaller CSV files for your data analysis tasks.

admin

leave a comment