In today’s data-driven world, handling large JSON files is a common task for many developers and data analysts. JSON (JavaScript Object Notation) files are often used to store structured data, but when these files become too large, it can be challenging to work with them efficiently. One effective solution is to split a big JSON file into multiple CSV files, which are more manageable and can be easily imported into various data analysis tools. In this article, we will walk you through the process of splitting a big JSON file into multiple CSV files, step by step.
Handling large JSON files efficiently is crucial for data professionals. By splitting these files into smaller, manageable CSV files, you can streamline your data analysis process. In this guide, we will demonstrate how to do this in a few simple steps.
Full Vido Tutorial
Prerequisites
Before we begin, ensure that you have the following prerequisites in place:
- A big JSON file that you want to split.
- Python installed on your system.
- Pip, a package manager for Python.
- Two required libraries: tqdm and pandas.
Step 1: Install Python
If you haven’t already installed Python, visit the official Python website (https://www.python.org/downloads/) and download the latest version for your operating system.
Step 2: Install Pip and Required Libraries
After installing Python, you need to install Pip, the package manager for Python. You can do this by opening your command prompt or terminal and running the following command:
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3 get-pip.py
pip install tqdm
pip install pandas
This command installs the necessary libraries, tqdm and pandas, which are essential for splitting the JSON file.
Step 3: Create a Python File
Now, navigate to the directory where your big JSON file is located. Create a new Python file named split.py
in the same directory. You can use any text editor or integrated development environment (IDE) to create this file.
Step 4: Copy and Paste Python Code
In the split.py
file, copy and paste the following Python code:
Method 1 (for processing big simple JSON data without nested elements):
An example of a simple data structure:
[
{"user_id": 1, "user_name": "John", "age": 30, "gender": "Male", "location": "New York"},
{"user_id": 2, "user_name": "Alice", "age": 25, "gender": "Female", "location": "Los Angeles"},
{"user_id": 3, "user_name": "Bob", "age": 35, "gender": "Male", "location": "Chicago"},
{"user_id": 4, "user_name": "Eva", "age": 28, "gender": "Female", "location": "Houston"}
]
import json import os import pandas as pd from tqdm import tqdm # Define the path to the JSON file json_file_path = "data.json" # Define the maximum number of records per CSV file records_per_csv = 50000 # Create the output folder if it doesn't exist output_folder = "output" os.makedirs(output_folder, exist_ok=True) # Get the total number of lines in the JSON file with open(json_file_path, "r", encoding="utf-8") as json_file: total_lines = sum(1 for _ in json_file) # Open the JSON file for reading with open(json_file_path, "r", encoding="utf-8") as json_file: # Create a JSON parser json_parser = json.JSONDecoder() # Initialize variables for batch processing current_csv_index = 1 current_records_count = 0 current_csv_data = [] # Create a progress bar progress_bar = tqdm(total=total_lines) while True: # Read a line from the JSON file line = json_file.readline() if not line.strip(): # Skip empty lines continue if not line: break # End of file # Update the progress bar progress_bar.update(1) try: # Parse the JSON data for the current line record = json_parser.raw_decode(line)[0] current_csv_data.append(record) # Add the record to the batch except json.JSONDecodeError: print(f"Skipping invalid JSON line: {line}") continue # Skip invalid lines and continue with the next line # Check if the maximum number of records per CSV is reached current_records_count += 1 if current_records_count >= records_per_csv: # If batch is ready, convert it to a DataFrame and write to CSV if current_csv_data: # Only process if there is data df = pd.DataFrame(current_csv_data) csv_file_path = os.path.join(output_folder, f"part_{current_csv_index}.csv") df.to_csv(csv_file_path, index=False) # Reset variables for the next batch current_csv_index += 1 current_records_count = 0 current_csv_data = [] # Check if there are remaining records in the last batch if current_csv_data: df = pd.DataFrame(current_csv_data) csv_file_path = os.path.join(output_folder, f"part_{current_csv_index}.csv") df.to_csv(csv_file_path, index=False) # Close the progress bar progress_bar.close()
Method 2 (for handling JSON data with nested elements):
An example of a nested array structure:
{
"users": [
{
"user_id": 1,
"user_name": "John",
"age": 30,
"gender": "Male",
"location": "New York",
"date": "2023-09-30",
"vehicles": [
{
"vehicle_id": 101,
"make": "Toyota",
"model": "Camry",
"year": 2022
},
{
"vehicle_id": 102,
"make": "Honda",
"model": "Civic",
"year": 2021
}
]
},
{
"user_id": 2,
"user_name": "Alice",
"age": 25,
"gender": "Female",
"location": "Los Angeles",
"date": "2023-09-30",
"vehicles": [
{
"vehicle_id": 103,
"make": "Ford",
"model": "Escape",
"year": 2022
}
]
},
{
"user_id": 3,
"user_name": "Bob",
"age": 35,
"gender": "Male",
"location": "Chicago",
"date": "2023-09-30",
"vehicles": []
},
{
"user_id": 4,
"user_name": "Eva",
"age": 28,
"gender": "Female",
"location": "Houston",
"date": "2023-09-30",
"vehicles": [
{
"vehicle_id": 104,
"make": "Tesla",
"model": "Model 3",
"year": 2023
}
]
}
]
}
import json import os import pandas as pd from tqdm import tqdm # Define the path to the JSON file json_file_path = "data.json" # Define the maximum number of records per CSV file records_per_csv = 1000 # Create the output folder if it doesn't exist output_folder = "output" os.makedirs(output_folder, exist_ok=True) # Create a progress bar progress_bar = None # Initialize variables current_csv_index = 1 current_records_count = 0 current_csv_data = [] # Open the JSON file with open(json_file_path, "r", encoding="utf-8") as json_file: data = json.load(json_file) if isinstance(data, list): # If the JSON data is a list, assume it's a list of JSON objects for record in data: # Update the progress bar if progress_bar is None: progress_bar = tqdm(total=len(data)) progress_bar.update(1) # Append the record to the current CSV batch current_csv_data.append(record) # Check if the maximum number of records per CSV is reached current_records_count += 1 if current_records_count >= records_per_csv: # Convert the current batch to a DataFrame using pandas df = pd.json_normalize(current_csv_data) # Export the DataFrame to CSV in the output folder csv_file_path = os.path.join(output_folder, f"part_{current_csv_index}.csv") df.to_csv(csv_file_path, index=False) # Reset variables for the next batch current_csv_index += 1 current_records_count = 0 current_csv_data = [] elif isinstance(data, dict): # If the JSON data is a dictionary, check if there's only one key (parent object) if len(data) == 1: parent_key = list(data.keys())[0] parent_data = data[parent_key] if isinstance(parent_data, list): # If the parent data is a list, assume it contains child objects for child_record in parent_data: # Update the progress bar if progress_bar is None: progress_bar = tqdm(total=len(parent_data)) progress_bar.update(1) # Append the child record to the current CSV batch current_csv_data.append(child_record) # Check if the maximum number of records per CSV is reached current_records_count += 1 if current_records_count >= records_per_csv: # Convert the current batch to a DataFrame using pandas df = pd.json_normalize(current_csv_data) # Export the DataFrame to CSV in the output folder csv_file_path = os.path.join(output_folder, f"part_{current_csv_index}.csv") df.to_csv(csv_file_path, index=False) # Reset variables for the next batch current_csv_index += 1 current_records_count = 0 current_csv_data = [] # Check if there are remaining records in the last batch if current_csv_data: # Convert the remaining batch to a DataFrame using pandas df = pd.json_normalize(current_csv_data) # Export the DataFrame to CSV in the output folder csv_file_path = os.path.join(output_folder, f"part_{current_csv_index}.csv") df.to_csv(csv_file_path, index=False) # Close the progress bar if progress_bar: progress_bar.close()
Step 5: Change file name
Change the variable json_file_path
to match the name of your JSON file.
Change the records_per_csv
variable to specify the number of records for each CSV you want.
Step 6: Run the Python Script
Open your terminal or command prompt, navigate to the directory where your split.py
file is located, and run the Python script using one of the following commands:
python3 split.py
Or
python split.py
This will execute the script and split the JSON file into multiple CSV files.
Wait until the process is completed.
Step 7: Check the Output Directory
After running the script, check the specified output directory for the generated CSV files. You can now use these smaller CSV files for your data analysis tasks.