Follow the steps below to prepare and upload your CSVs for Datatalk. This guide will help you prepare your CSVs, include table and column descriptions, and write a declaration file for Datatalk to use. This guide uses a subset of the 2023-2024 FEC data as an example, but you can use any CSVs you have.
First, gather all the CSVs you'd want to expose to Datatalk in a folder. We will refer to this folder as workdir
.
The CSV file names would be used as the table names in Datatalk. We recommend giving them meaningful names (e.g. pac_and_party_summary_2023_2024.csv
instead of csv_1.csv
).
Your workdir
should have the following structure:
workdir/ ├── pac_and_party_summary_2023_2024.csv ├── all_candidates_2023_2024.csv ├── candidate_master_2023_2024.csv └── ...Step 2 and 3 below are optional. Step 2 asks you to prepare a header file for each CSV describing the data types and descriptions of the columns, and Step 3 asks you to prepare a declaration file that describes the tables and where to find their associated header files. You can choose to skip these steps and directly jump to Step 4 to upload these CSVs to Datatalk. However, providing these files will help the LLM understand the descriptions of the tables and their columns more accurately, especically with larger datasets where column names alone may not be enough to describe the data.
It is optional to prepare a header file for each CSV. Each header file should be a CSV containing the following fields (Each field is optional. You can choose to include only the fields you need):
CMTE_ID
VARCHAR(9)
If data_type is not provided, Datatalk will infer it from the data itself. We recommend providing the data_type for large datasets where inferring the data type may take a while. description is used as context for the LLM to choose the correct column(s) to operate on.
For example, here is an example header file:column_name,data_type,description CMTE_ID,VARCHAR(9),A 9-character alpha-numeric code assigned to a committee by the Federal Election Commission. Committee IDs are unique and an ID for a specific committee always remains the same. ...You also do not need to include a header file for each CSV. If you do not provide a header file, Datatalk will infer the data type from the data itself. After you have prepared your header files, put them in your
workdir
folder. For example:
workdir/ ├── pac_and_party_summary_2023_2024.csv ├── pac_and_party_summary_2023_2024_header.csv ├── all_candidates_2023_2024.csv ├── candidate_master_2023_2024.csv └── candidate_master_2023_2024_header.csv └── ...Here,
pac_and_party_summary_2023_2024_header.csv
and candidate_master_2023_2024_header.csv
are provided, but all_candidates_2023_2024.csv
is not.
You can prepare a declaration file that describes the CSVs and their associated header files in your workdir
.
This step is optional if you skiped Step 2, but necessary if you provided header files for your CSVs.
The declaration file should be a CSV file that contains the following fields for each CSV:
workdir
. For example:
csv_1.csv
workdir
. For example:
csv_1_header.csv
,
if not provided.
None
if not provided. For example:
lambda x: x.replace(r'\|', '|')
"
if not provided.
For example, here is an example declaration file:
csv_filepath,csv_filepath_header,description,delimiter,special_processing_fcn,quotechar pac_and_party_summary_2023_2024.csv,pac_and_party_summary_2023_2024_header.csv,The PAC and party summary file contains summary financial information for each PAC and party committee. The file has one record per PAC and Party committee and shows information about the committee, total receipts and disbursements, receipts and disbursements broken down by type, contributions to other committees, independent expenditures made by the committee, and other financial summary information.,|,lambda x: x.replace(r'\|', '|'), all_candidates_2023_2024.csv,,The all candidates file contains summary financial information for each candidate who raised or spent money during the period, regardless of when they are up for election....,|,lambda x: x.replace(r'\|', '|'), candidate_master_2023_2024.csv,candidate_master_2023_2024_header.csv,The candidate master file contains basic information for each candidate, including:...,|,lambda x: x.replace(r'\|', '|'),Here, for instance, we set the delimiter to
|
and the special processing function to lambda x: x.replace(r'\|', '|')
due to the unique file format of the FEC data.
Note: If you do include a declaration file, make sure to name it to datatalk_declaration.csv
.
If there are any errors, an error message will appear. Please correct the issues and try uploading again.
This name will be used to identify your database in Datatalk. We suggest giving it a specific and descriptive name.
After uploading your file, email genie@cs.stanford.edu to request the addition of your collection. You will receive a confirmation email with details on how to access the search API once your collection is added.