Generating Fake CSV Data With Python
Published: Aug 11, 2021
Last updated: Aug 11, 2021
This is Day 24 of the #100DaysOfPython challenge.
This post will use the Faker library to generate fake data and export it to a CSV file.
We wil be emulating some of the free datasets from Kaggle, in particular the Netflix original films IMDB score to generate something similar.
The final code can be found here.
Prerequisites
- Familiarity with Pipenv. See here for my post on Pipenv.
- Familiarity with JupyterLab. See here for my post on JupyterLab.
Getting started
Let's create the generating-fake-csv-data-with-python
directory and install Pillow.
# Make the `generating-fake-csv-data-with-python` directory $ mkdir generating-fake-csv-data-with-python $ cd generating-fake-csv-data-with-python # Create a folder to place your icons $ mkdir docs # Init the virtual environment $ pipenv --three $ pipenv install faker $ pipenv install --dev jupyterlab
At this stage, we have the packages that we
Now we can start up the notebook server.
# Startup the notebook server $ pipenv run jupyter-lab # ... Server is now running on http://localhost:8888/lab
The server will now be up and running.
Creating the notebook
Once on http://localhost:8888/lab, select to create a new Python 3 notebook from the launcher.
Ensure that this notebook is saved in generating-fake-csv-data-with-python/docs/generating-fake-data.ipynb
.
We will create four cells to handle four parts of this mini project:
- Importing Faker and generating data.
- Importing the CSV module and exporting the data to a CSV file.
Before generating our data, we need to look at what we are trying to emulate.
Emulating The Netflix Original Movies IMDB Scores Dataset
Looking at the preview for our dataset, we can see that it contains the following columns and example rows:
Title | Genre | Premiere | Runtime | IMDB Score | Language |
---|---|---|---|---|---|
Enter the Anime | Documentary | August 5, 2019 | 58 | 2.5 | English/Japanese |
Dark Forces | Thriller | August 21, 2020 | 81 | 2.6 | Spanish |
We only have two rows for example, but from here we can make a few assumptions about how we want to emulate it.
- In our langauges, we will stick to a single language (unlike the example English/Japanese).
- IMDB scores are between 1 and 5. We won't be too harsh on any movies and go from 0.
- Runtimes should emulate a real movie - we can set it to be between 50 and 150 minutes.
- Genres may be something we need to write our own Faker provider for.
- We are going to be okay with non-sense data, so we can just use a string generator for the names.
With this said, let's look at how we can fake this.
Emulating a value for each column
We will create seven cells - one to import Faker and one for each column.
For the first cell, we will import Faker.
from faker import Faker fake = Faker()
Secondard, we will fake a movie name with words:
def capitalize(str): return str.capitalize() words = fake.words() capitalized_words = list(map(capitalize, words)) movie_name = ' '.join(capitalized_words) print(movie_name) # Serve Fear Consider
Third, we will generate a date this decate and use the same format as the example:
from datetime import datetime date = datetime.strftime(fake.date_time_this_decade(), "%B %d, %Y") print(date) # April 30, 2020
Fourth, we will create our own fake data geneartor for the genre:
# creating a provider for genre from faker.providers import BaseProvider import random # create new provider class class GenereProvider(BaseProvider): def movie_genre(self): return random.choice(['Documentary', 'Thriller', 'Mystery', 'Horror', 'Action', 'Comedy', 'Drama', 'Romance']) # then add new provider to faker instance fake.add_provider(GenereProvider) # now you can use: movie_genre = fake.movie_genre() print(movie_genre) # Horror
Fifth, we will do the same for a language:
# creating a provider for genre from faker.providers import BaseProvider import random # create new provider class class LanguageProvider(BaseProvider): def language(self): return random.choice(['English', 'Chinese', 'Italian', 'Spanish', 'Hindi', 'Japanese']) # then add new provider to faker instance fake.add_provider(LanguageProvider) # now you can use: language = fake.language() print(language) # Spanish
Sixth we need to generate a runtime:
# Getting random movie length movie_len = random.randrange(50, 150) print(movie_len) # 143
Lastly, we need a rating with one decimal point between 1.0 and 5.0:
# Movie rating random_rating = round(random.uniform(1.0, 5.0), 1) print(random_rating) # 2.2
Now that we have all our information together, it is time to generate a CSV with 100 entries.
Generating the CSV
We can place everything we know into a last cell to generate some data:
from faker import Faker from faker.providers import BaseProvider import random import csv class GenereProvider(BaseProvider): def movie_genre(self): return random.choice(['Documentary', 'Thriller', 'Mystery', 'Horror', 'Action', 'Comedy', 'Drama', 'Romance']) class LanguageProvider(BaseProvider): def language(self): return random.choice(['English', 'Chinese', 'Italian', 'Spanish', 'Hindi', 'Japanese']) fake = Faker() fake.add_provider(GenereProvider) fake.add_provider(LanguageProvider) # Some of this is a bit verbose now, but doing so for the sake of completion def get_movie_name(): words = fake.words() capitalized_words = list(map(capitalize, words)) return ' '.join(capitalized_words) def get_movie_date(): return datetime.strftime(fake.date_time_this_decade(), "%B %d, %Y") def get_movie_len(): return random.randrange(50, 150) def get_movie_rating(): return round(random.uniform(1.0, 5.0), 1) def generate_movie(): return [get_movie_name(), fake.movie_genre(), get_movie_date(), get_movie_len(), get_movie_rating(), fake.language()] with open('movie_data.csv', 'w') as csvfile: writer = csv.writer(csvfile) writer.writerow(['Title', 'Genre', 'Premiere', 'Runtime', 'IMDB Score', 'Language']) for n in range(1, 100): writer.writerow(generate_movie())
Running the cell will output the CSV file movie_data.csv
in our root that looks like this:
Title,Genre,Premiere,Runtime,IMDB Score,Language Discuss According Model,Horror,"February 09, 2020",107,2.6,Japanese People Conference Be,Comedy,"April 25, 2020",84,1.8,Chinese Forget Great Kind,Drama,"May 22, 2021",128,3.3,Chinese Trial Employee Cover,Drama,"February 24, 2020",90,3.6,Spanish Choose System We,Drama,"June 29, 2020",102,3.3,Spanish Range Laugh Reach,Comedy,"August 09, 2021",92,3.9,Spanish Increase Fire Popular,Romance,"May 03, 2020",107,4.1,Japanese Show Job Believe,Thriller,"March 13, 2021",62,1.6,English Or Power Century,Comedy,"February 29, 2020",146,2.3,Spanish Ago Ability Within,Drama,"July 23, 2020",120,4.8,Italian Foreign Always Sing,Mystery,"May 16, 2021",112,1.9,English Once Movie Artist,Documentary,"February 09, 2020",79,4.1,Hindi Near Explain Process,Action,"July 17, 2021",134,2.0,Spanish Big Information Grow,Romance,"February 25, 2020",64,4.4,Spanish Wind Project Heavy,Drama,"February 20, 2021",128,4.8,English Child Form Theory,Mystery,"January 12, 2021",91,3.0,Spanish Bring Sport Present,Drama,"March 02, 2021",87,2.7,Hindi Themselves That Activity,Action,"August 20, 2020",148,3.0,Spanish City Threat Almost,Thriller,"February 16, 2020",107,3.9,Spanish See Main Student,Drama,"January 17, 2020",125,1.4,Chinese Population Impact Season,Action,"March 19, 2020",109,2.3,Italian Manager Thank Truth,Documentary,"February 12, 2021",124,4.1,Hindi Child South Believe,Thriller,"April 18, 2020",65,3.9,Italian Present Main Themselves,Romance,"September 08, 2020",89,3.8,Hindi Maintain Order Old,Drama,"December 14, 2020",110,1.8,Hindi Difficult Town Hair,Documentary,"October 12, 2020",51,4.9,Japanese Page Hold Discussion,Drama,"November 01, 2020",139,1.9,Chinese Style True Car,Comedy,"July 03, 2021",84,5.0,Japanese Care Item Sing,Comedy,"November 16, 2020",100,4.9,Japanese Do Car Organization,Romance,"February 28, 2021",129,1.1,Japanese Learn Service Figure,Documentary,"March 04, 2020",50,2.0,Italian Forget Situation Fact,Comedy,"January 22, 2020",52,3.9,English Order International Report,Documentary,"December 17, 2020",101,2.2,Chinese Another Black Teach,Mystery,"December 08, 2020",96,4.2,Italian Professor Watch Throughout,Action,"September 15, 2020",111,4.0,English Which Quickly Son,Documentary,"July 02, 2021",98,2.4,Chinese Change East Article,Comedy,"March 28, 2020",61,2.4,English Partner Individual Local,Romance,"May 07, 2020",149,5.0,English Instead Watch Particular,Horror,"May 04, 2020",115,2.3,Hindi Democratic Someone Available,Romance,"July 26, 2021",98,1.4,Italian Place Would Mind,Drama,"May 09, 2021",141,2.4,Italian Likely Economy Weight,Mystery,"February 03, 2021",106,3.1,Hindi Could Certain More,Drama,"January 31, 2021",137,4.9,Hindi Source Operation Sure,Action,"March 03, 2020",81,3.3,Hindi Really Share Treat,Documentary,"August 05, 2020",99,2.2,English Edge When Data,Drama,"July 27, 2020",115,1.6,Italian Huge Imagine Federal,Romance,"August 08, 2021",141,3.0,Chinese Tend Often Collection,Documentary,"June 25, 2020",73,3.2,Chinese Wait Major Move,Action,"June 17, 2021",120,2.5,Spanish Firm Reason With,Thriller,"July 16, 2021",67,2.6,Spanish Significant Fall Travel,Romance,"March 14, 2021",123,2.0,Hindi Send Size Eye,Comedy,"June 18, 2021",74,3.5,Spanish Describe Hospital She,Drama,"March 14, 2021",90,1.4,Spanish Give Drive Better,Mystery,"March 15, 2020",106,1.2,Spanish Their Measure Choose,Action,"April 28, 2021",86,2.8,Italian Resource Sell Agent,Thriller,"February 08, 2020",50,3.1,Hindi Next Plan Soon,Action,"May 16, 2021",93,3.7,Hindi Land Allow Simply,Mystery,"May 23, 2021",144,1.0,Hindi Friend Total Few,Mystery,"June 12, 2021",93,4.1,Italian Role Might Bad,Drama,"December 08, 2020",100,3.5,Japanese Opportunity Public Certainly,Horror,"August 07, 2020",76,2.0,Italian Else Play Politics,Drama,"August 01, 2021",145,2.5,Italian Staff Main West,Documentary,"May 09, 2021",76,2.5,Japanese Ready Treat Everything,Drama,"July 24, 2021",121,1.6,Hindi Ahead Yourself Crime,Horror,"February 09, 2021",80,4.9,Italian Next These Night,Comedy,"February 20, 2020",65,3.4,Hindi Line Else Along,Comedy,"February 05, 2020",83,1.8,Hindi Degree Continue Green,Documentary,"March 10, 2020",73,3.8,Hindi Marriage Until Cover,Thriller,"November 26, 2020",147,4.8,English Republican Way Mission,Drama,"April 04, 2021",57,2.9,Chinese Prepare Rich Street,Romance,"February 26, 2021",94,2.6,Japanese Term Five On,Horror,"September 06, 2020",62,2.7,English Sister Manage Relate,Documentary,"August 17, 2020",76,4.4,Hindi Scientist Beat Wonder,Horror,"June 23, 2021",137,1.5,Chinese Fast Staff If,Romance,"February 05, 2021",148,2.7,Hindi Ready Campaign Field,Comedy,"October 25, 2020",147,2.7,Chinese Worker State Every,Mystery,"May 17, 2021",104,1.7,English Bar Wind Story,Action,"January 28, 2021",108,3.2,Hindi At Total Half,Thriller,"December 03, 2020",79,4.4,Spanish One Something Focus,Thriller,"June 29, 2020",59,1.2,Japanese Play We Impact,Comedy,"March 19, 2020",88,1.3,Hindi Message After Again,Comedy,"May 28, 2021",75,4.1,Chinese Such Something Information,Comedy,"June 01, 2021",145,2.2,Spanish Power Organization Myself,Action,"January 29, 2021",119,1.4,Hindi Apply Boy Success,Documentary,"August 06, 2020",93,1.4,Italian Evening Production Bar,Romance,"April 13, 2020",102,2.5,Chinese Work For Form,Drama,"September 19, 2020",80,4.4,Hindi Occur Billion Cover,Documentary,"December 03, 2020",56,3.7,Chinese Budget Wall Tv,Horror,"January 02, 2021",135,1.0,English Share Beyond Loss,Action,"January 23, 2021",55,1.5,Italian Professional Source Make,Horror,"December 08, 2020",107,4.1,Japanese To Protect Improve,Mystery,"July 30, 2020",100,3.6,Japanese Democratic Hundred Appear,Horror,"August 18, 2020",84,4.3,Hindi Face Central Summer,Documentary,"November 25, 2020",63,1.8,Spanish Involve Clearly At,Documentary,"November 25, 2020",56,1.5,Italian Fall Term Drug,Horror,"April 05, 2020",52,2.2,Chinese Fly Language Where,Romance,"May 18, 2021",102,4.4,Chinese Service Local Door,Drama,"August 04, 2020",63,1.9,Italian Son Avoid Himself,Drama,"July 30, 2020",53,1.8,Hindi
Success!
Summary
Today's post demonstrated how to use the Faker
package to generate fake data and the CSV
library to export that data to file.
In future, we may use this data to make our data sets to work with and some some data science around.
Kaggle and Open Data are great resources for data and data visualization for any use you may also have when not generating your own data.
This "100 Days in Python" series will move towards data science and machine learning from here on out.
Resources and further reading
Photo credit: pawel_czerwinski
Generating Fake CSV Data With Python
Introduction