Cloud & Local Data Integrity Monitor

Automated Data Synchronization & Integrity Solution

A Python-based automation solution developed to ensure data integrity and synchronization between centralized cloud spreadsheets (Google Sheets) and local file system documents. This tool automates comparison, filtering, and reporting to identify missing or mismatched files.

Project Context

Developed during my student position at Ludan Group to address critical data integrity challenges between cloud-based master documents and local file systems.

Position: Student Intern

Organization: Ludan Group

Duration: July 2024 – May 2025

Focus: Data integrity, automation, and cloud integration

Project Overview

This Python solution was designed to solve the critical challenge of maintaining data integrity and synchronization between centralized cloud spreadsheets (Google Sheets) and local file system documents.

The core script automates the comparison, filtering, and reporting of documents, identifying any missing or mismatched files between the master document list in Google Sheets and the documents present in the local directory structure.

Key Features
Google Sheets Integration
Seamless integration with Google Sheets API using gspread for accessing master document lists
Secure Authentication
OAuth 2.0 and Service Account authentication for secure API access
Automated Data Extraction
Extracts document number, binder, and revision information from sheets
File System Scanning
Comprehensive scanning of local directories and subdirectories
Intelligent Normalization
Advanced file name and folder name normalization for accurate matching
Difference Reporting
Generates comprehensive reports (missing.txt) of discrepancies
Enhanced Monitoring
Custom JavaScript monitoring scripts for data tracking and visualization
Technology Stack
Component Technology Role
Primary Language Python 3.x Core logic, data processing, and automation
Cloud API Google Sheets API Read access to master document sheets
Authentication OAuth 2.0 / Service Account Secure API access and authorization
Main Libraries gspread, google-auth Sheets interaction and authentication
Monitoring JavaScript Data monitoring and visualization scripts
Setup and Installation

Prerequisites

Python 3.x installed on your system

Dependencies

pip install gspread google-auth

Google Sheets Setup

  1. Enable Google Sheets API in your Google Cloud Console
  2. Create a Service Account and download the JSON credentials
  3. Rename the downloaded file to credentials.json
  4. Share your Google Sheet with the service account email address

Configuration

Edit main.py to configure your settings:

root_folder = "C:\\path\\to\\local\\documents"
sheet_id = "YOUR_GOOGLE_SHEET_ID_HERE"

Execution

python main.py
Output Files
dict.txt Contains all documents extracted from Google Sheets
all files.txt Complete listing of all local files found in the directory
missing.txt Comprehensive report of missing documents and discrepancies
Core Python Implementation

The solution includes several key functions:

list_files_and_folders()

Recursively scans the file system and lists all files and folders in the target directory.

remove_keys_ending_with()

Filters out keys from dictionaries that end with specific suffixes, used for data cleaning.

file_name_fix()

Normalizes and cleans file names to ensure consistent matching between cloud and local files.

folder_name_fix()

Normalizes folder names for accurate path matching and comparison.

get_data()

Retrieves data from Google Sheets using the API and processes it for comparison.