Upload Your CSVs for Datatalk

Follow the steps below to prepare and upload your CSVs for Datatalk. This guide will help you prepare your CSVs, include table and column descriptions, and write a declaration file for Datatalk to use. This guide uses a subset of the 2023-2024 FEC data as an example, but you can use any CSVs you have.

Step 1: Put all CSVs in a folder

First, gather all the CSVs you'd want to expose to Datatalk in a folder. We will refer to this folder as workdir. The CSV file names would be used as the table names in Datatalk. We recommend giving them meaningful names (e.g. pac_and_party_summary_2023_2024.csv instead of csv_1.csv). Your workdir should have the following structure:

workdir/
├── pac_and_party_summary_2023_2024.csv
├── all_candidates_2023_2024.csv
├── candidate_master_2023_2024.csv
└── ...

Step 2 and 3 below are optional. Step 2 asks you to prepare a header file for each CSV describing the data types and descriptions of the columns, and Step 3 asks you to prepare a declaration file that describes the tables and where to find their associated header files.
You can choose to skip these steps and directly jump to Step 4 to upload these CSVs to Datatalk. However, providing these files will help the LLM understand the descriptions of the tables and their columns more accurately, especically with larger datasets where column names alone may not be enough to describe the data.

Step 2: Prepare a header file for each CSV (Optional)

It is optional to prepare a header file for each CSV. Each header file should be a CSV containing the following fields (Each field is optional. You can choose to include only the fields you need):

  • column_name, for example: CMTE_ID
  • data_type, which can be either in ORACLE or POSTGRESQL format, for example: VARCHAR(9)
  • description, for example:
    A 9-character alpha-numeric code assigned to a committee by the Federal Election Commission. Committee IDs are unique and an ID for a specific committee always remains the same.

  • If data_type is not provided, Datatalk will infer it from the data itself. We recommend providing the data_type for large datasets where inferring the data type may take a while. description is used as context for the LLM to choose the correct column(s) to operate on.

    For example, here is an example header file:
    column_name,data_type,description
    CMTE_ID,VARCHAR(9),A 9-character alpha-numeric code assigned to a committee by the Federal Election Commission. Committee IDs are unique and an ID for a specific committee always remains the same.
    ...
    

    You also do not need to include a header file for each CSV. If you do not provide a header file, Datatalk will infer the data type from the data itself. After you have prepared your header files, put them in your workdir folder. For example:
    workdir/
    ├── pac_and_party_summary_2023_2024.csv
    ├── pac_and_party_summary_2023_2024_header.csv
    ├── all_candidates_2023_2024.csv
    ├── candidate_master_2023_2024.csv
    └── candidate_master_2023_2024_header.csv
    └── ...
    

    Here, pac_and_party_summary_2023_2024_header.csv and candidate_master_2023_2024_header.csv are provided, but all_candidates_2023_2024.csv is not.
    Step 3: Prepare a declaration file (Optional)

    You can prepare a declaration file that describes the CSVs and their associated header files in your workdir. This step is optional if you skiped Step 2, but necessary if you provided header files for your CSVs. The declaration file should be a CSV file that contains the following fields for each CSV:

  • csv_filepath, the name of the CSV file in your workdir. For example: csv_1.csv
  • csv_filepath_header, the name of the header file for the CSV file in your workdir. For example: csv_1_header.csv
  • (Optional) description, a description for the table represented by the CSV file. For example:
    The PAC and party summary file contains summary financial information for each PAC and party committee. The file has one record per PAC and Party committee and shows information about the committee, total receipts and disbursements, receipts and disbursements broken down by type, contributions to other committees, independent expenditures made by the committee, and other financial summary information.
  • (Optional) delimiter, the delimiter for the CSV file. Defaults to , if not provided.
  • (Optional) special_processing_fcn, A special processing function (in Python) to apply to the CSV file if your CSV contains special characters. Defaults to None if not provided. For example: lambda x: x.replace(r'\|', '|')
  • (Optional) quotechar, the quote character for the CSV file. Defaults to " if not provided.

  • For example, here is an example declaration file:

    csv_filepath,csv_filepath_header,description,delimiter,special_processing_fcn,quotechar
    pac_and_party_summary_2023_2024.csv,pac_and_party_summary_2023_2024_header.csv,The PAC and party summary file contains summary financial information for each PAC and party committee. The file has one record per PAC and Party committee and shows information about the committee, total receipts and disbursements, receipts and disbursements broken down by type, contributions to other committees, independent expenditures made by the committee, and other financial summary information.,|,lambda x: x.replace(r'\|', '|'),
    all_candidates_2023_2024.csv,,The all candidates file contains summary financial information for each candidate who raised or spent money during the period, regardless of when they are up for election....,|,lambda x: x.replace(r'\|', '|'),
    candidate_master_2023_2024.csv,candidate_master_2023_2024_header.csv,The candidate master file contains basic information for each candidate, including:...,|,lambda x: x.replace(r'\|', '|'),
                            


    Here, for instance, we set the delimiter to | and the special processing function to lambda x: x.replace(r'\|', '|') due to the unique file format of the FEC data. Note: If you do include a declaration file, make sure to name it to datatalk_declaration.csv.
    Step 4: Upload Your CSVs Here

    If there are any errors, an error message will appear. Please correct the issues and try uploading again.

    Selected Files:

    This name will be used to identify your database in Datatalk. We suggest giving it a specific and descriptive name.

    Step 5s: Request Collection Addition

    After uploading your file, email genie@cs.stanford.edu to request the addition of your collection. You will receive a confirmation email with details on how to access the search API once your collection is added.