50 Python Scripts for HDFS Performance Tuning (2026)

HDFS Performance Tuning

50 Python Scripts for HDFS Performance Tuning (2026)

A field-tested toolkit for monitoring, diagnosing and tuning HDFS clusters. Every script is short, copy-paste ready, and labelled READ (safe) or WRITE (mutating — dry-run first).

📅 Updated May 2026  |  ⏱ 25 min read  |  🎯 Intermediate to Senior

Setup & Prerequisites

All scripts use a small set of well-known libraries. Install once, then every snippet below runs as-is.

pip install hdfs requests pandas python-dateutil tabulate snakebite-py3
# Optional, for Kerberized clusters
pip install requests-kerberos

🟢 READ-ONLY scripts

  • Monitoring, fsck parsing, JMX queries
  • Block reports, quota reports
  • Safe to schedule in cron

🟠 WRITE / MUTATING scripts

  • Compaction, deletion, rebalancing
  • Always dry-run first
  • Run during low-traffic window
💡 Connection helper used everywhere Throughout this guide we reuse one client object — define it once and import in every script:
from hdfs import InsecureClient
client = InsecureClient('http://namenode-host:9870', user='hdfs')
For Kerberos: from hdfs.ext.kerberos import KerberosClient.

Section 1 — NameNode Health & Heap (Scripts 1–7)

01 NameNode Heap Usage Monitor READ

Pulls live JMX heap stats. NameNode heap pressure is the #1 silent killer — once it crosses 80% you start seeing GC pauses that block every client RPC.

import requests, json
url = "http://namenode-host:9870/jmx?qry=java.lang:type=Memory"
heap = requests.get(url).json()['beans'][0]['HeapMemoryUsage']
used_gb = heap['used'] / 1024**3
max_gb  = heap['max']  / 1024**3
pct = used_gb / max_gb * 100
print(f"NN Heap: {used_gb:.1f}/{max_gb:.1f} GB  ({pct:.1f}%)")
if pct > 80: print("WARN: NameNode heap above 80%")
02 FsImage Object Count & Growth Trend READ

Tracks total inodes — every file + directory + block consumes ~150 bytes of NN heap. Keep a daily snapshot to catch runaway growth before it crashes the NN.

import requests, csv, datetime as dt
j = requests.get("http://namenode-host:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState").json()
b = j['beans'][0]
row = [dt.date.today(), b['FilesTotal'], b['BlocksTotal'], b['CapacityUsed']]
with open('/var/log/hdfs_growth.csv', 'a') as f:
    csv.writer(f).writerow(row)
print("Files:", b['FilesTotal'], " Blocks:", b['BlocksTotal'])
03 RPC Queue Time Spike Detector READ

If RpcQueueTimeAvgTime climbs above 50 ms, clients are waiting on the NN — usually GC or lock contention. Alert immediately.

import requests
m = requests.get("http://namenode-host:9870/jmx?qry=Hadoop:service=NameNode,name=RpcActivityForPort8020").json()['beans'][0]
qt = m['RpcQueueTimeAvgTime']
print(f"RPC queue avg: {qt:.2f} ms")
if qt > 50: print("ALERT: NameNode is slow to respond")
04 Safe Mode Status Checker READ

Run during cluster startup or after maintenance. If NN stays in safe mode, your block report is incomplete or under-replicated.

import subprocess
out = subprocess.check_output(['hdfs','dfsadmin','-safemode','get']).decode()
print(out)
if 'ON' in out:
    print("WARN: cluster still in safe mode")
05 EditLog Backlog Monitor READ

Standby NN must apply edits to stay in sync. Lag > 1000 edits means failover will be slow — investigate JournalNode disk speed.

import requests
beans = requests.get("http://standby-nn:9870/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus").json()['beans'][0]
lag = beans.get('LastAppliedOrWrittenTxId', 0)
print("Standby applied TX:", lag)
06 GC Pause Time Tracker READ

Long G1 pauses freeze the NameNode. Scrape GcCount/GcTimeMillis every minute and alert on deltas above 2 s/min.

import requests, time
def gc(): return requests.get("http://namenode-host:9870/jmx?qry=java.lang:type=GarbageCollector,name=G1%20Old%20Generation").json()['beans'][0]
a = gc(); time.sleep(60); b = gc()
delta_ms = b['CollectionTime'] - a['CollectionTime']
print(f"G1 Old GC in last 60s: {delta_ms} ms")
if delta_ms > 2000: print("ALERT: GC pressure")
07 NameNode Failover History READ

Parses ZKFC log for HA failover events — frequent flips signal an unstable cluster (often network or JN disk).

import re
with open('/var/log/hadoop/hdfs/hadoop-hdfs-zkfc-*.log') as f:
    for line in f:
        if 'Becoming active' in line or 'Becoming standby' in line:
            print(re.match(r'^(\S+ \S+).*', line).group(1), '-', line.strip()[-80:])

Section 2 — DataNode Performance & Disk (Scripts 8–14)

08 DataNode Disk Usage Per Volume READ

Surfaces the imbalance between disks inside one DataNode — fixed by enabling the disk balancer (dfs.disk.balancer.enabled=true).

import requests
for dn in ['dn1:9864','dn2:9864','dn3:9864']:
    j = requests.get(f"http://{dn}/jmx?qry=Hadoop:service=DataNode,name=FSDatasetState").json()
    vols = j['beans'][0]['StorageInfo']
    print(dn, '->', vols[:200])
09 Slow DataNode Detector (Outlier Read Latency) READ

Compares SendDataPacketAvgTime across all DNs. Anything 3× the median is dragging your Spark/MR shuffle.

import requests, statistics
hosts = ['dn1','dn2','dn3','dn4']
lat = {h: requests.get(f"http://{h}:9864/jmx?qry=Hadoop:service=DataNode,name=DataNodeActivity-*").json()['beans'][0]['SendDataPacketAvgTime'] for h in hosts}
med = statistics.median(lat.values())
for h,v in lat.items():
    if v > 3*med: print(f"SLOW DN: {h} = {v:.1f}ms (median {med:.1f}ms)")
10 DataNode Liveness Heartbeat Lag READ

Stale DataNodes (heartbeat > 30 s) cause the NN to mark blocks under-replicated. Alert before they go DEAD at 10 min.

import requests, datetime as dt
j = requests.get("http://namenode-host:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem").json()['beans'][0]
print("Stale DNs:", j['NumStaleDataNodes'], " Dead:", j['NumDeadDataNodes'])
11 Disk Failure Predictor (smartctl + Python) READ

Run on every DN — flags disks where reallocated sectors are growing. Replace before HDFS marks them failed.

import subprocess, re
out = subprocess.check_output(['smartctl','-A','/dev/sda']).decode()
m = re.search(r'Reallocated_Sector_Ct.*?(\d+)\s*$', out, re.M)
if m and int(m.group(1)) > 5:
    print("Disk /dev/sda has reallocated sectors:", m.group(1))
12 DataNode Xceiver Count Monitor READ

Each in-flight read/write uses one xceiver thread. Default cap is 4096 — once you hit 80%, raise dfs.datanode.max.transfer.threads.

import requests
for dn in ['dn1','dn2','dn3']:
    j = requests.get(f"http://{dn}:9864/jmx?qry=Hadoop:service=DataNode,name=DataNodeInfo").json()['beans'][0]
    print(dn, "Xceivers:", j['XceiverCount'])
13 DataNode Volume Failure Listener READ

Polls each DN for failed volumes — a DN with a failed disk is still alive but at reduced capacity.

import requests
for dn in ['dn1','dn2','dn3']:
    j = requests.get(f"http://{dn}:9864/jmx?qry=Hadoop:service=DataNode,name=FSDatasetState").json()['beans'][0]
    if j.get('NumFailedVolumes',0) > 0:
        print(f"FAILED VOL on {dn}: {j['NumFailedVolumes']}")
14 Per-DataNode Throughput Snapshot READ

Pulls bytes-read and bytes-written per minute. Use it to spot cold DNs (rack mis-config) or hot DNs (skewed placement).

import requests, time
def get(dn): return requests.get(f"http://{dn}:9864/jmx?qry=Hadoop:service=DataNode,name=DataNodeActivity-*").json()['beans'][0]
a={h:get(h) for h in ['dn1','dn2','dn3']}
time.sleep(60)
b={h:get(h) for h in ['dn1','dn2','dn3']}
for h in a:
    rd=(b[h]['BytesRead']-a[h]['BytesRead'])/1024**2
    wr=(b[h]['BytesWritten']-a[h]['BytesWritten'])/1024**2
    print(f"{h}: read {rd:.1f} MB/min, write {wr:.1f} MB/min")

Section 3 — Blocks & Replication (Scripts 15–21)

15 Missing & Corrupt Block Reporter READ

Parses hdfs fsck / output. Missing blocks = data loss — page on-call immediately.

import subprocess, re
out = subprocess.check_output(['hdfs','fsck','/','-files','-blocks']).decode()
miss = int(re.search(r'Missing blocks:\s+(\d+)', out).group(1))
corr = int(re.search(r'Corrupt blocks:\s+(\d+)', out).group(1))
under= int(re.search(r'Under replicated blocks:\s+(\d+)', out).group(1))
print(f"missing={miss} corrupt={corr} under-replicated={under}")
if miss or corr: print("PAGE ONCALL")
16 Under-Replicated Block Trigger WRITE

Forces re-replication for files stuck under-replicated (e.g., after DN decommission). Run with --dry-run first.

import subprocess
files = subprocess.check_output(['hdfs','fsck','/','-list-corruptfileblocks']).decode().splitlines()
for f in files:
    if f.startswith('/'):
        print('setrep 3 on', f)
        # subprocess.run(['hdfs','dfs','-setrep','3',f])  # uncomment to apply
17 Over-Replicated Block Cleaner WRITE

Files left at replication 5 from old jobs waste storage. Drop them back to 3 — saves ~40% on those paths.

from hdfs import InsecureClient
client = InsecureClient('http://namenode-host:9870', user='hdfs')
for path,_,files in client.walk('/data/legacy'):
    for f in files:
        full=f"{path}/{f}"
        st=client.status(full)
        if st['replication']>3:
            print(f"setrep 3: {full}")
            # client.set_replication(full, 3)
18 Block Placement Policy Auditor READ

Verifies the 3 replicas of every block are split across at least 2 racks. Same-rack replicas defeat HDFS's fault tolerance.

import subprocess, re
out = subprocess.check_output(['hdfs','fsck','/important','-files','-blocks','-racks']).decode()
for blk in re.findall(r'BP-[^\s]+ .*?\n', out):
    racks=re.findall(r'/[^/\s]+/[^/\s]+', blk)
    if len(set(racks))<2: print("SAME-RACK:", blk[:80])
19 Block Size Distribution READ

Histogram of file sizes. Lots of files < 64 MB on a 128 MB block size means you're wasting NN heap and reading inefficiently.

from hdfs import InsecureClient
import collections
client = InsecureClient('http://namenode-host:9870', user='hdfs')
buckets=collections.Counter()
for p,_,files in client.walk('/data'):
    for f in files:
        sz=client.status(f"{p}/{f}")['length']
        b='<1MB' if sz<1e6 else '1-64MB' if sz<64e6 else '64-128MB' if sz<128e6 else '>128MB'
        buckets[b]+=1
print(buckets)
20 Replication Lag Tracker READ

Watches PendingReplicationBlocks — sustained values > 0 for hours mean DNs can't keep up.

import requests
j = requests.get("http://namenode-host:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem").json()['beans'][0]
print("Pending:", j['PendingReplicationBlocks'], " UnderRepl:", j['UnderReplicatedBlocks'])
21 Erasure Coding Conversion Helper WRITE

Converts cold paths from 3× replication to RS-6-3-1024k erasure coding — saves 50% storage. Only safe for read-mostly data.

import subprocess
COLD='/data/archive/2023'
subprocess.run(['hdfs','ec','-setPolicy','-path',COLD,'-policy','RS-6-3-1024k'])
# rewrite existing files so the policy applies
subprocess.run(['hadoop','distcp','-update','-skipcrccheck',COLD,COLD+'_ec'])

Section 4 — Small Files Detection & Cleanup (Scripts 22–28)

⚠️ Why small files kill HDFS Each file consumes ~150 bytes of NameNode heap. 100M small files = 15 GB heap just for metadata, plus task scheduling overhead in Spark/MR. Hunt them aggressively.
22 Small File Top-N Reporter READ

Lists the top directories by count of files smaller than 32 MB — the worst offenders against the NN.

from hdfs import InsecureClient
import collections
client = InsecureClient('http://namenode-host:9870', user='hdfs')
counts=collections.Counter()
for p,_,files in client.walk('/'):
    for f in files:
        if client.status(f"{p}/{f}")['length'] < 32*1024*1024:
            counts[p]+=1
for d,c in counts.most_common(20): print(f"{c:>8}  {d}")
23 HAR (Hadoop Archive) Compactor WRITE

Bundles thousands of small files into a single .har archive — readable from MR/Spark, drops NN inode count by 1000×.

import subprocess
SRC='/raw/2023'
subprocess.run(['hadoop','archive','-archiveName','raw2023.har','-p','/raw','2023','/archive'])
# delete originals only after verifying the har
# subprocess.run(['hdfs','dfs','-rm','-r',SRC])
24 Parquet Compaction with Spark WRITE

The right fix for partitioned tables — read-coalesce-write. Targets ~256 MB output files.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('compact').getOrCreate()
df = spark.read.parquet('/data/events/dt=2024-12-01')
n = max(1, int(df.count()*200/1e9))   # rough size estimate
df.coalesce(n).write.mode('overwrite').parquet('/data/events/dt=2024-12-01')
25 Empty File & Empty Dir Sweeper WRITE

Empty files still cost NN heap. Sweep them weekly.

from hdfs import InsecureClient
client = InsecureClient('http://namenode-host:9870', user='hdfs')
for p,dirs,files in client.walk('/tmp'):
    for f in files:
        if client.status(f"{p}/{f}")['length']==0:
            print('rm', f"{p}/{f}")
            # client.delete(f"{p}/{f}")
26 SequenceFile Bundler for Logs WRITE

Old approach but still useful for one-off log dumps. Bundles many tiny logs into one binary file with key=filename.

import subprocess
subprocess.run(['hadoop','jar','/opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-*.jar',
  '-input','/logs/2024-12','-output','/logs/2024-12.seq',
  '-mapper','/bin/cat','-reducer','NONE','-outputformat','SequenceFileOutputFormat'])
27 Per-Owner Small-File Leaderboard READ

Names and shames the team or service account creating the most small files. Best motivational tool you have.

from hdfs import InsecureClient
import collections
client = InsecureClient('http://namenode-host:9870', user='hdfs')
by_owner=collections.Counter()
for p,_,files in client.walk('/'):
    for f in files:
        st=client.status(f"{p}/{f}")
        if st['length']<16*1024*1024:
            by_owner[st['owner']]+=1
for o,c in by_owner.most_common(10): print(f"{c:>10}  {o}")
28 Spark Streaming Output Coalescer WRITE

Streaming jobs love producing one tiny file per micro-batch. Apply this in foreachBatch to keep files at ~128 MB.

def write_batch(df, batch_id):
    df.coalesce(4).write.mode('append').parquet('/stream/events')
streaming.writeStream.foreachBatch(write_batch).start()

Section 5 — Storage Analysis & Reporting (Scripts 29–35)

29 Top-N Largest Directories READ

Equivalent of du -h --max-depth=2 but cluster-wide. Run weekly and post to a Slack channel.

import subprocess
out = subprocess.check_output(['hdfs','dfs','-du','-h','/']).decode().splitlines()
sized=[(int(l.split()[0].replace('G','').replace('T','000') or 0), l) for l in out if l]
for _,l in sorted(sized,reverse=True)[:20]: print(l)
30 Cold Data Finder (mtime > 180 days) READ

Finds candidates for archive tier or erasure coding. Most clusters have 30%+ data untouched in 6 months.

from hdfs import InsecureClient
import datetime as dt
client = InsecureClient('http://namenode-host:9870', user='hdfs')
cutoff=(dt.datetime.now()-dt.timedelta(days=180)).timestamp()*1000
cold=0
for p,_,files in client.walk('/data'):
    for f in files:
        st=client.status(f"{p}/{f}")
        if st['modificationTime']
    
31 Storage Tier (HOT/WARM/COLD) Mover WRITE

Sets storage policy on cold paths and runs the Mover to physically migrate blocks to ARCHIVE-typed disks.

import subprocess
COLD='/data/archive'
subprocess.run(['hdfs','storagepolicies','-setStoragePolicy','-path',COLD,'-policy','COLD'])
subprocess.run(['hdfs','mover','-p',COLD])
32 Duplicate File Hash Scanner READ

Uses HDFS's built-in checksum to find byte-identical files across the cluster. Surprisingly common after careless distcp.

from hdfs import InsecureClient
import collections
client=InsecureClient('http://namenode-host:9870',user='hdfs')
dup=collections.defaultdict(list)
for p,_,files in client.walk('/data'):
    for f in files:
        c=client.checksum(f"{p}/{f}")
        dup[c['bytes']].append(f"{p}/{f}")
for h,paths in dup.items():
    if len(paths)>1: print('DUP:', paths)
33 Per-User Storage Cost Report READ

Multiplies per-user bytes by replication factor and a $/GB/month rate. Send to finance.

from hdfs import InsecureClient
import collections
client=InsecureClient('http://namenode-host:9870',user='hdfs')
RATE=0.023  # $/GB/month
by={}
for p,_,files in client.walk('/'):
    for f in files:
        st=client.status(f"{p}/{f}")
        b=st['length']*st['replication']
        by[st['owner']]=by.get(st['owner'],0)+b
for u,b in sorted(by.items(),key=lambda x:-x[1])[:15]:
    print(f"{u:<20} {b/1e9:>10.1f} GB  ${b/1e9*RATE:>8.2f}/mo")
34 DataNode Free Space Forecast READ

Linear projection of when the cluster will fill at the current 7-day growth rate.

import csv, datetime as dt
rows=list(csv.reader(open('/var/log/hdfs_growth.csv')))[-7:]
days=[(dt.datetime.fromisoformat(r[0])-dt.datetime.fromisoformat(rows[0][0])).days for r in rows]
used=[int(r[3])/1e12 for r in rows]
slope=(used[-1]-used[0])/max(1,days[-1])
free_tb=200-used[-1]
print(f"Days until full: {free_tb/slope:.0f}")
35 Fastest-Growing Path Detector READ

Compares hdfs dfs -du snapshots day over day. Finds the path eating storage the fastest.

import subprocess, json, os
today={l.split()[2]:int(l.split()[0]) for l in subprocess.check_output(['hdfs','dfs','-du','/data']).decode().splitlines()}
yest=json.load(open('/tmp/du_yest.json')) if os.path.exists('/tmp/du_yest.json') else {}
for p,sz in today.items():
    delta=sz-yest.get(p,sz)
    if delta>10*1e9: print(f"{p}: +{delta/1e9:.1f} GB in 24h")
json.dump(today,open('/tmp/du_yest.json','w'))

Section 6 — Quota, Snapshots & Trash (Scripts 36–42)

36 Quota Headroom Report READ

Lists every quota-enabled path with current consumption — flag anything > 90%.

import subprocess
out=subprocess.check_output(['hdfs','dfs','-count','-q','-h','/data/*']).decode()
for l in out.splitlines():
    parts=l.split()
    if len(parts)>=8 and parts[0]!='none':
        used=parts[2]; quota=parts[0]; path=parts[-1]
        print(f"{path:<40} {used} / {quota}")
37 Auto-Quota Setter for New Tenants WRITE

Creates a tenant directory and applies space + name quotas in one shot — prevents one team filling the cluster.

import subprocess
def add_tenant(name, space_tb=10, files=1_000_000):
    p=f'/tenants/{name}'
    subprocess.run(['hdfs','dfs','-mkdir','-p',p])
    subprocess.run(['hdfs','dfsadmin','-setSpaceQuota',f'{space_tb}t',p])
    subprocess.run(['hdfs','dfsadmin','-setQuota',str(files),p])
add_tenant('analytics-team', 20, 5_000_000)
38 Snapshot Age Auditor READ

Lists snapshots older than the retention policy. Stale snapshots silently lock blocks from deletion.

import subprocess, datetime as dt, re
out=subprocess.check_output(['hdfs','dfs','-ls','-R','/.snapshot']).decode()
for l in out.splitlines():
    m=re.match(r'.*?(\d{4}-\d{2}-\d{2})', l)
    if m:
        age=(dt.date.today()-dt.date.fromisoformat(m.group(1))).days
        if age>30: print(f"OLD SNAPSHOT ({age}d):", l[-80:])
39 Snapshot Diff Reporter READ

Compares two snapshots — shows how much data changed and what got deleted. Great for backup verification.

import subprocess
out=subprocess.check_output(['hdfs','snapshotDiff','/data/sales','s1','s2']).decode()
print(out)
40 Trash Cleaner (force expunge) WRITE

Default trash interval is 24 h, but trash often grows to TBs. Force expunge weekly during off-hours.

import subprocess
subprocess.run(['hdfs','dfs','-expunge'])
print("Trash expunged")
41 Per-User Trash Size Reporter READ

Surfaces users whose .Trash/ is wasting space. Send them a polite reminder.

import subprocess
out=subprocess.check_output(['hdfs','dfs','-du','-s','-h','/user/*/.Trash']).decode()
print(out)
42 ACL & Permission Drift Detector READ

Compares current ACLs against a golden reference file. Catches accidental chmod 777 on prod paths.

import subprocess, json
ref=json.load(open('/etc/hdfs_acl_baseline.json'))
for path,expected in ref.items():
    actual=subprocess.check_output(['hdfs','dfs','-getfacl',path]).decode()
    if expected not in actual:
        print(f"DRIFT: {path}\nexpected: {expected}\nactual:\n{actual}")

Section 7 — Balancer, I/O & Cluster Tuning (Scripts 43–50)

43 Cluster Imbalance Score READ

Compares each DN's % full to the cluster average. Anything > 10% off triggers the balancer.

import subprocess, re, statistics
out=subprocess.check_output(['hdfs','dfsadmin','-report']).decode()
nodes=re.findall(r'DFS Used%\s*:\s*([\d.]+)%', out)
pcts=[float(n) for n in nodes]
mean=statistics.mean(pcts)
worst=max(pcts,key=lambda p:abs(p-mean))
print(f"avg {mean:.1f}%  worst {worst:.1f}%  spread {max(pcts)-min(pcts):.1f}%")
44 Throttled HDFS Balancer Runner WRITE

Runs the balancer at off-peak with bandwidth capped to 50 MB/s/DN — prevents balancing from saturating the network.

import subprocess
subprocess.run(['hdfs','dfsadmin','-setBalancerBandwidth','52428800'])
subprocess.run(['hdfs','balancer','-threshold','5','-policy','datanode'])
45 Disk Balancer Plan Generator WRITE

Different from the cluster balancer — this rebalances across disks within one DN. Critical after replacing disks.

import subprocess
for dn in ['dn1','dn2','dn3']:
    subprocess.run(['hdfs','diskbalancer','-plan',dn])
    subprocess.run(['hdfs','diskbalancer','-execute',f'/system/diskbalancer/{dn}.plan.json'])
46 Short-Circuit Read Verifier READ

Confirms dfs.client.read.shortcircuit=true is actually firing. Saves up to 30% on local reads (Spark, HBase).

import requests
for dn in ['dn1','dn2','dn3']:
    j=requests.get(f"http://{dn}:9864/jmx?qry=Hadoop:service=DataNode,name=DataNodeActivity-*").json()['beans'][0]
    sc=j.get('TotalReadsFromLocalClient',0)
    rd=j.get('BlocksRead',1)
    print(f"{dn}: {sc/rd*100:.1f}% short-circuit reads")
47 Pipeline Timeout Auditor READ

Greps DN logs for SocketTimeoutException on the write pipeline — usually means a slow disk or saturated NIC.

import subprocess, collections
out=subprocess.check_output(['grep','-h','SocketTimeoutException','/var/log/hadoop/hdfs/hadoop-hdfs-datanode-*.log']).decode()
hosts=collections.Counter(l.split()[-1] for l in out.splitlines() if 'remote' in l)
for h,c in hosts.most_common(10): print(f"{c:>6}  {h}")
48 Configuration Drift Checker READ

Compares running config against the version-controlled baseline. Catches the "someone hot-edited hdfs-site.xml" problem.

import subprocess, re, json
baseline=json.load(open('/etc/hdfs_baseline.json'))
out=subprocess.check_output(['hdfs','getconf','-confKey','dfs.replication']).decode().strip()
for k,v in baseline.items():
    actual=subprocess.check_output(['hdfs','getconf','-confKey',k]).decode().strip()
    if actual!=v: print(f"DRIFT {k}: expected={v} actual={actual}")
49 DistCp Throughput Tuner WRITE

Sweet spot for cross-cluster copy: many small mappers + bandwidth cap. This script estimates mappers from source size.

import subprocess
SRC='/data/2024'; DST='hdfs://dr-namenode:9820/data/2024'
size_gb=int(subprocess.check_output(['hdfs','dfs','-du','-s','-h',SRC]).decode().split()[0])
mappers=min(200, max(20, size_gb//5))
subprocess.run(['hadoop','distcp','-m',str(mappers),'-bandwidth','100','-update',SRC,DST])
50 End-to-End Cluster Health Dashboard (one script) READ

Combines the most useful checks above into a single morning-coffee script. Pipe its output to Slack.

import requests, subprocess
nn="http://namenode-host:9870"
def jmx(q): return requests.get(f"{nn}/jmx?qry={q}").json()['beans'][0]
fs=jmx("Hadoop:service=NameNode,name=FSNamesystem")
mem=jmx("java.lang:type=Memory")['HeapMemoryUsage']
print("=== HDFS Daily Report ===")
print(f"Files     : {fs['FilesTotal']:>12,}")
print(f"Blocks    : {fs['BlocksTotal']:>12,}")
print(f"Missing   : {fs['MissingBlocks']:>12,}")
print(f"UnderRepl : {fs['UnderReplicatedBlocks']:>12,}")
print(f"NN Heap   : {mem['used']/mem['max']*100:>11.1f}%")
print(f"DeadNodes : {fs['NumDeadDataNodes']:>12,}")
print(f"StaleNodes: {fs['NumStaleDataNodes']:>12,}")
fsck=subprocess.check_output(['hdfs','fsck','/','-list-corruptfileblocks']).decode()
print(f"Corrupt   : {fsck.count('blk_'):>12,}")

How To Productionise These Scripts

📅 Schedule (cron / Airflow)

  • Hourly: 1, 3, 10, 15, 20 (NN health, DN heartbeat, RPC)
  • Daily 06:00: 2, 22, 27, 29, 33, 36, 50
  • Weekly Sun 02:00: 17, 23, 24, 25, 31, 38, 40, 44

🚨 Alert thresholds

  • NN heap > 80% — page
  • Missing blocks > 0 — page
  • RPC queue > 50 ms — warn
  • Imbalance > 15% — warn
  • Cold data > 30% — review monthly
✅ Quick wins — pick these 5 first Most clusters see immediate impact from: #1 NN Heap monitor, #15 fsck reporter, #22 small-file top-N, #43 imbalance score, and #50 daily health dashboard. Wire those into Slack first, then add the rest.

Frequently Asked Questions

FAQ Do I need root or hdfs user to run these?

Read-only JMX scripts work as any user with HTTP access to NameNode/DataNode UIs. The hdfs dfsadmin, fsck, balancer, and storagepolicies commands require the hdfs superuser. Run them via sudo -u hdfs or schedule under that user in cron.

FAQ How do I run these on a Kerberized cluster?

Two changes: replace InsecureClient with KerberosClient from hdfs.ext.kerberos, and run kinit with a keytab before invoking the script. For JMX queries, install requests-kerberos and pass HTTPKerberosAuth() as the auth parameter.

FAQ Will these work on Cloudera CDP / HDP / EMR?

Yes — the JMX endpoints, hdfs CLI, and WebHDFS API are identical across distributions. The only thing that changes is the NameNode hostname/port (CDP often uses 9871 with TLS; EMR uses 9870 over HTTP).

FAQ What about cloud-native HDFS replacements (S3, ADLS, GCS)?

Most monitoring scripts (small files, replication) don't apply, but the storage analysis ones (cold data, dup hash, top-N largest) translate directly using boto3 / azure-storage-blob / google-cloud-storage. Same logic, different SDK.

📦 Want These as a Ready-To-Run Repo?

Get the full HDFS Performance Toolkit — all 50 scripts, packaged with a Makefile, a Dockerfile, an Airflow DAG, and Slack alert templates. Plus 100 Hadoop interview questions and the 300Q Big Data PDF bundle.

Read More on the Blog → Get the 300Q PDF Bundle

No comments:

Post a Comment

Networking concepts of Data Engineer

Networking for Data Engineers Networking Concepts Every Data Engineer Must Know (2026) You don't need to be a n...

🚫
Content Protected
Copying content from this site is not permitted.
© 2026 InterviewQuestionsToLearn.com