Building an All-in-One Web Application for Data Science Using Python: An evaluation of the open-source tool Django

By: Nikhil Widhani, Statistics Canada

With the massive advancements in technology, many tools and techniques used in the data science community are slowly transitioning to open source. Open source tools are a no-cost solutions and the codebase can be modified to fit the needs of the user – providing them with flexibility and reducing costs. Proprietary tools typically get the work done efficiently, however, they can be costly and are still need to be adapted to the organization's needs. Proprietary tools also often lack customizability and sometimes only solve parts of a problem. For more complex problems, like building a collection of microservices for a variety of tasks, it may require the use of multiple proprietary tools simultaneously.

Since open source tools give the user full control over the codebase, they allow organizations to apply security patches to new threats and maintain privacy and data confidentiality. Many open-sourced projects can be combined to form a one-point solution to serve a larger requirement.

Consider a project that builds a secure pipeline for data ingestion. Separate modules for this process could include regularly fetching new ethically-sourced data, running responsible machine learning algorithms on a timely basis, and building a dashboard for data visualization. In this article, I'll explain the process of using a Django Web framework to create an AIO (all-in-one) production-ready web application that delivers information while protecting the privacy and confidentiality of the data it uses. This particular solution uses a dashboard for clients and an admin GUI (graphic user interface) for developers. The admin GUI can be used for running scheduled tasks at regular intervals to fetch new data, and to run a pipeline with the press of a button.

Django is an open-sourced web framework based on the Python programming language and is used by multinational corporations. Its core design supports flexibility, securityFootnote 1 and scalability. Apps built using Django can be reused for other projects without rewriting the code. At times, it's as simple as copying the code from one project to another. To build this type of web app, we'll use Django toolkits and plugins to develop a flexible architecture, while keeping scalability in mind.  The proposed architecture is illustrated in Figure 1.

Figure 1: Proposed architecture for building an AIO web app.

Figure 1: Proposed architecture for building an AIO web app
Description - Figure 1: Proposed architecture for building an AIO web app

Proposed architecture for building an AIO web app. The Database works with the Django Backend (which includes the REST API (application program interface), Pipeline, Redis Server and Task Scheduler) and then onward to the Frontend, as developed by its users or administrator. Each component in this diagram are explained below.

Components

The components of the architecture and their functions are the database, the backend, pipeline, task scheduler and REST API (application program interface). I'll elaborate on each of these components.

Database

A database collects and stores data. It's the key component of a project, since all parts of a project are linked to the data. Databases can be classified as relational and non-relational databases. Relational databases can store data in a structured manner within tables and allow relationships with other tables. By default, Django supports relational databases such as PostgreSQL (structured query language), SQLite and MySQL. The choice of which to use depends on the type of project; PostgreSQL may be more suitable for larger projects, while SQLite may be more suitable for smaller projects such as those with fewer than 100,000 rows. MySQL is typically used when capabilities of PostgreSQL such as support for geospatial data or advance datatypes aren't required. Non-relational databases do not use the format of a tabular schema to store data. Instead, they use object models specific to the type of data being stored. Elastic Search consists of a modern document-oriented database commonly used to quickly search millions of documents. Elastic Search could be used when the user needs to store an extensive amount of data and queries which need to be executed in real-time. In this project, I'll use SQLite which is appropriate for the size of the task.

Backend

The backend is where most of the work is performed. Using Django as the backend framework facilitates the work based on improvements made during 15 years of steady development since its release on July 21st, 2005.Footnote 2 Django users can also benefit from a large online community. There are also many third-party packages available to suit a variety of tasks. In the following sections, I'll present the packages and toolkits used to build an AIO web app for data science.

Pipeline

In the field of data science, a pipeline is a set of code or processes that convert raw data to usable data. Pipelines can sometimes be used to extract and structure data from a raw format while other times, they can use machine learning models to extract valuable insights from the data. There are many use cases for pipelines, but the selection of a pipeline depends on the requirements of the project. Often, projects in the domain of data science involve a set Python code that can be executed to perform some of the operations.

Task Scheduler

A task scheduler is used for performing asynchronous tasks that run in the background without halting other processes of the application. For time consuming tasks such as data pipeline processing, it is often preferred to handle these tasks in the background with the help of threads, without blocking the normal request-response flow.Footnote 3 For example, if your web application requires you to upload a scanned PDF document which needs to be pre-processed before invoking an extracting module, the user will view a loading page until the file is processed. Sometimes, the loading page can appear for longer than expected, and a timeout error will be received. This error occurs because browsers and web requests are not designed to wait endlessly for a page to load. This can be solved by pre-processing the PDF data in the background and sending the user a notification when the processing is complete. Task schedulers are a convenient way to resolve this behaviour.

By default, Django doesn't support task scheduling, however, there are third-party Django apps such as Django-CeleryFootnote 4 that can integrate task queues asynchronously. This package can be installed with a pip install Django-Celery command. Extensive documentation is available for the library and can be referred to, when needed. A broker server such as Redis is required to facilitate communication between task scheduler and the backend. Redis stores messages produced by the Celery task queue application that describes the job at-hand. Redis is also used to store results from Celery queues, which can then be fetched from the frontend to present progress updates to the user. Some of the use-cases presented below can be supported by Django-Celery:

  • Running scheduled data retrieval cron jobs to update the database with the new data. (e.g., retrieve new data daily at 4 am).
  • Data cleaning and restructuring.
  • Running machine learning models whenever new data is ingested.
  • Updating database and reflecting changes on Dashboard.

Figure 2: Flow diagram of a task scheduler

Figure 2: Flow diagram of a task scheduler
Description - Figure 2: Flow diagram of a task scheduler

A process flow diagram of a task scheduler. Redis Server acts as a medium which can start tasks and keep track of the progress.
Three boxes are at the top and are labelled as follows "Task 1 – Completed", "Task 2 – Running" and "Task 3 – Running", each individually point towards the lower box labelled "Redis Server / REST API".

Figure 3: Task scheduler user interface

Figure 3: Task scheduler user interface
Description - Figure 3: Task scheduler user interface

A task scheduler user interface for the Frontend Dashboard. The table also shows useful information such as task name, status and success value. Action column contains a triangle 'play' button to execute a task; and a square 'stop' button to stop a running task.
The tasks and their Status, Success and Action are as follows.
Task 1 – Fetch New Data: Completed Status; Success is True and Action displays triangular white and blue play button.
Task 2 – Clean PDF: Running Status; Success is False and Action displays black square stop button.
Task 3 – Extract Data: Running Status; Success is False and Action displays black square stop button.

Running a pipeline this way is a better use of resources – pipeline tasks can be scheduled to run overnight when fewer users are on the app. By separating production tasks from development tasks also means that it develops efficiently. All distinct parts of the project can be combined in a single production-ready application. Finally, running the pipeline can provide users with additional control over the product since they can easily execute the pipeline as desired.

REST API

API's are microservices that simplify software development. It's often useful to separate different functions of an application into small web services. For example, when looking up an address using a postal code, a REST API for address lookup will return all street addresses within the postal code area. The response is often in JSON (JavaScript Object Notation) format which is an industry-standard and understood by many applications and programming languages. A REST API supports multiple methods, most commonly GET, POST, DELETE and UPDATE. Each method separates logic and helps structure code. These methods are self-explanatory and are responsible for getting, adding or posting new data, as well as deleting and updating data, respectively. Using REST APIs within web applications acts as a middleware between a database and a frontend – protecting the database from security vulnerabilities.

In Django, one can build APIs with a simple JSON HTTP response. However, there is a dedicated open-source toolkit known as Django-rest-framework which can be used for building APIs.Footnote 5 Some benefits of using Django REST framework include:

  • providing web browsable API out of the box,
  • Inbuilt Authentication policies for data security,
  • prebuilt serializers which can convert a Python object to a JSON object easily, and
  • extensive documentation, customizable and supported by various other Django plugins.

For use-cases similar to what's presented in this article, the Django REST framework can be used for building data API's and creating API's for fetching tasks and pipeline progress while running in the background. This data can be integrated within the frontend to show the progress as illustrated in Figure 4.

Figure 4: Processing PDF Files

Figure 4: Processing PDF Files
Description - Figure 4: Processing PDF Files

A GUI progress bar that shows the progress of a code that processes PDF files. The bar is at 70%. When a PDF file is processed, the slider size, percentage value and processed file count updates to show the status of the pipeline.
The text at the top of the bar reads Executing PDF extraction pipeline, and the text below the progress bar reads 7/10 files processed. 2 minutes remaining.

Security

Security should always be considered when reviewing framework options. In this article, I talk about the Django Python framework and some of its packages. Django is a framework with many security features. It supports deployment behind https and encrypted sites using SSL (secure sockets layer). Django ORM (object-relational mapping) also protects from SQL injection-type queries where users can execute unwanted SQL code in the database. It also authenticates web addresses through an 'ALLOWED_HOSTS' value in the settings file, which contains the hostname of the webpage. This setting prevents cross site scripting attacks where fake hostnames can be used.

At Statistics Canada many of the Python web apps are deployed on the Advanced Analytics Workspace (AWW) which follows STATCAN guidelines for security, privacy and data confidentiality. Standard network protection elements like firewalls, traffic management, reduced user rights for application code and separation of concerns are all employed to reduce the overall attack surface for any potential vector. Further, RBAC (Role Based Access Control) is used to restrict access to the dashboard, preventing unwanted users from accessing apps. Various methods for scanning and vulnerability detection are also employed to ensure continued operation of secure applications.

Conclusion

In this article, we covered one possibility of building a web app that can be used to automatically run a pipeline, as well as the dashboard component for clients. We also discussed modern techniques and tools that can be used with Python programming to achieve results.

The front end component can be built using Django with a combination of languages such as HTML (HyperText Markup Language), JavaScript and CSS (Cascading Style Sheets). Other frameworks such as React can also be used. This kind of backend can also be integrated with Microsoft PowerBI, with the help of an API. Although building the frontend is highly subjective, using Django templates for the frontend is useful as it can serve the purpose of building an all-inclusive web app. With this framework, the possibilities are endless. This architecture mostly helps to reduce coding, resources and maintenance, and facilitates reusable apps for similar projects. It consists of many non-isolated components which can be reused for other applications with the help of microservice architecture.

The Data Science Network article called Deploying Your Machine Learning Project as a Service expands on the practices relevant to project development we have mentioned. The techniques and concepts showcased in the article are also helpful during the various stages of project development and could be incorporated while deploying applications. Most of the resources are linked in the references.

For additional information, please contact us at statcan.datascience-sciencedesdonnees.statcan@statcan.gc.ca.

Please take our quick 2-minute survey

We need your help to improve the Data Science Network. Please take 2 minutes to answer a few quick questions. We welcome and value your feedback. This is your last chance to complete the Data Science Network Newsletter feedback survey. Thanks to those who have already completed it!

Date modified: