• Factchecks
  • Blog
  • Documentation
  • Browser
  • Home
  • dataCommons Python API

    Introduction

    This document introduces the dataCommons Python API, which is used to query dataCommons data using Python programs.

    The Python Query API provides functions for users to extract structured information from Datacommons programmatically as Pandas Dataframes. Dataframes allow access to all the data processing, analytical and visualization tools provided by packages such as Pandas, NumPy, SciPy, and Matplotlib.

    A tutorial and an introduction to dataCommons knowledge graph is here. This example uses the APIs to query data, and pandas dataframes for common data analysis tasks.

    The code for the Python Query API is here.

    The main components of the API are:

    These APIs are intended to be used in sequence, i.e. expand, followed by get_populations, followed by get_observations.

    Convenience functions: These functions are commonly used in various analysis related to cities and states, and are provided for convenience.

    APIs

    expand

    Function Signature

    expand(self, pd_table, arc_name, seed_col_name, new_col_name, outgoing=True, max_rows=100)

    For the given dataframe pd_table, add a new column with values for the given property. The existing pandas dataframe should include a column containing entity IDs for a certain schema.org type. This function populates a new column (new_col_name) with property values for the entities and adds additional rows if a property has repeated values.

    Example

    We begin with a dataframe with a column for countries containing one row for the United States.

    data = pd.DataFrame({'country' : ['Country', 'dc/2sffw13']})

    expand() may be used to add a 'state' column populated with US states.

    data = dc.expand(data, arc_name = 'containedInPlace', seed_col_name = 'country', new_col_name = 'state', outgoing = False)

    Arguments

    self: The dataCommons client.

    pd_table: The panda dataframe to add columns to.

    arc_name: The property to query. In other words, the relationship connecting the entities in the seed column to the new column entities.

    seed_col_name: The name of the column in the dataframe that contains entities on one side of the relationship.

    new_col_name: The name of the column to create. This column will be populated with entities on the other end of the relationship.

    outgoing: If true, the link between the entities points from the entities in the seed column to the new column. Defaults to ‘True’, which means by default, the property named in arc_name is a property of the seed column.

    max_rows: The maximum number of rows to add. Defaults to 100. Increasing this number might lead to slower performance and timeouts.

    Returns the dataframe with the added column.

    get_populations

    Function Signature

    get_populations(self, pd_table, seed_col_name, new_col_name, population_type, max_rows=100, **kwargs)

    Creates a new column in the dataframe for the described StatisticalPopulation.

    Arguments

    self: The dataCommons client.

    pd_table: The panda dataframe to add columns to.

    seed_col_name: The name of the column in the dataframe that contains the entities to query.

    new_col_name: The name of the column to create. This column will be populated with StatisticalPopulation entities.

    population_type: The schema.org type associated with the StatisticalPopulation. For example, if interested in statistics about people, use "Person".

    max_rows: The maximum number of rows to add.

    kwargs: The properties to use when defining the StatisticalPopulation. For example, gender="Female" to limit the population to women.

    Returns the dataframe with the added column.

    Example

    Add a column with StatisticalPopulations with populationType 'Person'. This is done for each row in the 'state' column we created earlier. We call the new column 'total_pop_dcid'.

    data = dc.get_populations(data, seed_col_name = 'state', new_col_name = 'total_pop_dcid', population_type = 'Person')

    Adds the ‘total_pop_dcid’ population associated with the state column.

    get_observations

    Function Signature

    get_observations(self, pd_table, seed_col_name, new_col_name, start_date, end_date, measured_property, stats_type, max_rows=100):

    Create a new column with values for an observation of the given property. The existing pandas dataframe (pd_table) should include a column (seed_col_name) containing entity IDs for a certain schema.org type. This function populates a new column (new_col_name) with property values (measured_property and stats_type) for the entities. The stats are relevant between the start and end_dates.

    Arguments

    pd_table: Pandas dataframe that contains entity information.

    seed_col_name: The column that contains the population dcid.

    new_col_name: New column name.

    start_year: The start year of the observation.

    end_year: The end year of the observation.

    measured_property: observation measured property.

    stats_type: Statistical type like "Median"

    max_rows: The maximum number of rows returned by the query results.

    kwargs: keyword properties to define the population.

    Returns the dataframe with the added column.

    Example

    get_observations is used to query for observations associated with the StatisticalPopulation entities. For this example, we're interested in the median age and total population count. Many statistical measures (e.g. median) are stored as properties in the graph. We use the get_observations function to get the total count and the median age for each population referred to in our total_pop_dcid column. For our purposes, we filter the data to only include statistics from 2012-2016. As an example, compare the function calls below to the browser page for the population of 'Person' in 'Ohio'. If you scroll down to the bottom, you can find the Observation nodes for 'median_age' and 'count'

    Add a 'pop_count' column representing the count of the data in the total_pop_dcid column we created earlier.

    data = dc.get_observations(data, seed_col_name = 'total_pop_dcid', new_col_name = 'pop_count', start_date = '2012-01-01', end_date = '2016-01-01', measured_property = 'count', stats_type = 'count')

    Add a 'median_age' column representing the median of the age data in the total_pop_dcid column we created earlier. data = dc.get_observations(data, 'total_pop_dcid', 'median_age', '2012-01-01', '2016-01-01', 'age', 'median')

    Convenience Functions

    These functions are commonly used in various analysis related to cities and states, and are provided for convenience.

    get_cities

    def get_cities(self, state, new_col_name, max_rows=100):

    Get a list of city dcids in a given state

    Args:

    Returns: A pandas.DataFrame with city dcids.

    get_states

    def get_states(self, country, new_col_name, max_rows=100):

    Get a list of state dcids.

    Args:

    Returns: