About the Data

Point map of Airbnb listings, color-coded by room type, in Los Angeles County.
Fig 1. “Los Angeles,” Inside Airbnb. Accessed 28 Feb. 2026.

The Inside Airbnb Los Angeles dataset was last scraped, and subsequently updated, on December 4, 2025 and is the archive version we will be referencing throughout our analysis. The datasets are organized by the primary platform objects, with information for each stored in five unique files: listings.csv.gz, calendar.csv.gz, reviews.csv.gz, neighborhoods.csv, and neighborhoods.geojson. For this project, we utilized the singular listings.csv.gz file, which contains information on each individual case of an Airbnb listing in the area. Prior to cleaning, it included 85 columns dedicated to listing-level and host-level details that Airbnb displays publicly such as listing identifiers, host identifiers and attributes, listing description, link, and geographical indicators. Included are additional listing attributes like property type, availability, capacity, minimum and maximum duration, and review-related measures. After cleaning with R, the dataset was drastically reduced to 10 columns that were relevant to producing our map and answering our research question, used concurrently with our second primary source. Additionally, columns were cleaned to change NA values to the default 0.

With this dataset, we can categorize Airbnb listings into a subdivision county, or CCD, and analyze concentration patterns in Los Angeles County using relevant columns. Using the neighbourhood_cleansed column, we can identify which areas at the neighborhood level contain the highest amount of active listings as well as determine its encompassing CCD. Using latitude and longitude, we create a dot density map to visualize spatial clusters and gradients, as seen here. The room_type column is comprised of two main listing types: entire-homes, which represent the rental of a whole dwelling, and private rooms which represent partial-home rentals. Finally, the reviews_per_month column gives us an idea of guest turnover and activity levels.

Among the five files pertaining to LA County and Airbnb’s in the area, information that would be pertinent in our understanding of the human experience is not included. The files contain no information about guest or resident demographics such as age, race, ethnicity, income, or household size. Additionally, the creator of Inside Airbnb launched this project in an attempt to expose the negative impacts of short-term rentals on residential communities. We encountered an issue with the absence of these variables in the datasets, despite the project’s purpose, as we cannot determine which community profiles align with Airbnb concentrations. As for the listings.csv.gz file, it does not contain cost details and inhibits our ability to analyze the financials of a listing, which would otherwise be the clearest indicator of how housing is an appraised commodity.

Due to the nature of a dataset being only one facet of information to capture a complex system or occurrence, the way information is transformed to be stored as “data” ideologically guides what we acknowledge and what we ignore. The main datasets that are selected to be displayed by Inside Airbnb are (1) listing and host information in listings.csv.gz, (2) listing-by-day records (availability, price, and rules by date) in calendar.csv.gz, (3) guest reviews in reviews.csv.gz, and (4) neighbourhood names and boundaries in neighbourhoods.csv and neighbourhoods.geojson. The ontology establishes an Airbnb listing as the principal unit of how we perceive the world, with all datasets serving as a detailed extension of that unit. Because the dataset centers listings, hosts, and platform-visible signals, it does not include off-platform consequences that have a direct impact on regular people such as increased displacement from unaffordable housing and decreased diversity in communities. If this dataset were our only source, we could determine patterns of Airbnb concentration across Los Angeles County, but we would miss many off-platform effects such as which communities are afflicted and who is ultimately accommodating the tourism.

Fig 2. “Los Angeles County, California,” United States Census Bureau. Accessed 28 Feb. 2026.

The ACS datasets are composed of estimates organized into tables at various geographic levels from state down to ZIP codes and for different types of data. For this project, we use ACS tables at the County Census Division (CCD) level, which offers a meaningful middle ground between the overly broad county level and the overly granular census tract level. Additionally, we selected the tables that would best pertain to our overarching question and finalized the following 3 tables. For tables DP05 and S1901, we extracted estimates and margins of error (MOE) while disregarding the percent estimates and MOE due to redundancy. For table S1501, we extracted only estimates and MOE for race breakdown tabulations and extracted only percent estimates and MOE for poverty status due to lack of tabulation for the latter. Included below are the rows that we have excluded from each table due to the lack of relevance in answering our research question.

With this collection of datasets, we can construct a demographic and socioeconomic profile of a community. We are able to analyze the age and sex distribution of a population as well as its racial and ethnic composition using DP05. With S1901, we can examine the income distribution of communities, identifying areas with severe cost burdens where household incomes are able to afford less compared to their counterparts in surrounding areas. Adding S1501 allows us to explore correlations between educational attainment levels and economic indicators, such as whether areas with higher levels of educational achievement also have higher median incomes or different Airbnb commercialization patterns. We can also compare these characteristics across subdivision counties, observing how a county’s profile differs from that of others. 

However, because the ACS is a sample survey, all figures are estimates with associated margins of error (MOE), not exact counts. This means that small differences between two areas or small changes over time may not be statistically significant as they fall within the 95% confidence interval established by the MOE. To protect respondent confidentiality, the ACS aggregated tables also anonymize all individual identifiers and reduce responses to tabulated counts. Furthermore, the datasets do not capture informal economies or the full experience of housing insecurity, such as the number of people doubled up in housing due to cost.

The ideological effect of this dataset’s ontology, or the way it divides the world into discrete categories, is to create a particular visibility that privileges countable, government-defined characteristics. The choice to measure educational attainment by degrees earned reflects an institutional definition of education, ignoring aptitudes learned through vocational training, apprenticeships, or self-study. The racial categories in DP05, while extensive in an attempt to be inclusive, are a product of a long political history and may not be how individuals personally identify. The financial characteristics in S1901 are presented in standardized income brackets that rely on 12-month recall periods and can miss the effects of recent economic changes. If this dataset were our only source, we would see a community neatly categorized by age, race, income levels, and degree attainment, but we would be blind to the effects of short-term rentals on the sustainability of their livelihoods. In this project, these tables are strongest for substantiating the human experience but they cannot, by themselves, substantiate Airbnb’s role in gentrifying and commodifying homes at the community’s expense.

Data Processing :

When we first set out to understand Airbnb’s impact on Los Angeles communities, we needed to contextualize two different kinds of information: the locations of thousands of individual listings and the demographics of the communities hosting them. Inside Airbnb provided us with the listing coordinates across LA and the U.S. Census Bureau’s American Community Survey provided detailed profiles of local populations. To put into perspective why we approached the datasets in the manner that we did, the Inside Airbnb listings span across 266 distinct neighborhoods. On the contrary, the Census data operated at an ideal scale, dividing the county into 20 CCDs. To address this, we created a custom mapping code program that assigned each of the 266 neighborhoods to its encompassing CCD, ensuring every listing in our final dataset was assigned a CCD and connected to the Census data. From cleaning and combining the datasets with R, in R Studio, we were able to finalize the dataset we would use for our map and visualizations in Tableau.

Word Count:

1,932 words

Read Time:

8–12 minutes