50 Python Scripts for HDFS Performance Tuning (2026)
A field-tested toolkit for monitoring, diagnosing and tuning HDFS clusters. Every script is short, copy-paste ready, and labelled READ (safe) or WRITE (mutating — dry-run first).
📋 What's Covered (50 Scripts in 8 Sections)
- Setup & Prerequisites
- NameNode Health & Heap (Scripts 1–7)
- DataNode Performance & Disk (Scripts 8–14)
- Blocks & Replication (Scripts 15–21)
- Small Files Detection & Cleanup (Scripts 22–28)
- Storage Analysis & Reporting (Scripts 29–35)
- Quota, Snapshots & Trash (Scripts 36–42)
- Balancer, I/O & Cluster Tuning (Scripts 43–50)
Setup & Prerequisites
All scripts use a small set of well-known libraries. Install once, then every snippet below runs as-is.
pip install hdfs requests pandas python-dateutil tabulate snakebite-py3 # Optional, for Kerberized clusters pip install requests-kerberos
🟢 READ-ONLY scripts
- Monitoring, fsck parsing, JMX queries
- Block reports, quota reports
- Safe to schedule in cron
🟠WRITE / MUTATING scripts
- Compaction, deletion, rebalancing
- Always dry-run first
- Run during low-traffic window
from hdfs import InsecureClient
client = InsecureClient('http://namenode-host:9870', user='hdfs')
For Kerberos: from hdfs.ext.kerberos import KerberosClient.
Section 1 — NameNode Health & Heap (Scripts 1–7)
Pulls live JMX heap stats. NameNode heap pressure is the #1 silent killer — once it crosses 80% you start seeing GC pauses that block every client RPC.
import requests, json
url = "http://namenode-host:9870/jmx?qry=java.lang:type=Memory"
heap = requests.get(url).json()['beans'][0]['HeapMemoryUsage']
used_gb = heap['used'] / 1024**3
max_gb = heap['max'] / 1024**3
pct = used_gb / max_gb * 100
print(f"NN Heap: {used_gb:.1f}/{max_gb:.1f} GB ({pct:.1f}%)")
if pct > 80: print("WARN: NameNode heap above 80%")
Tracks total inodes — every file + directory + block consumes ~150 bytes of NN heap. Keep a daily snapshot to catch runaway growth before it crashes the NN.
import requests, csv, datetime as dt
j = requests.get("http://namenode-host:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState").json()
b = j['beans'][0]
row = [dt.date.today(), b['FilesTotal'], b['BlocksTotal'], b['CapacityUsed']]
with open('/var/log/hdfs_growth.csv', 'a') as f:
csv.writer(f).writerow(row)
print("Files:", b['FilesTotal'], " Blocks:", b['BlocksTotal'])
If RpcQueueTimeAvgTime climbs above 50 ms, clients are waiting on the NN — usually GC or lock contention. Alert immediately.
import requests
m = requests.get("http://namenode-host:9870/jmx?qry=Hadoop:service=NameNode,name=RpcActivityForPort8020").json()['beans'][0]
qt = m['RpcQueueTimeAvgTime']
print(f"RPC queue avg: {qt:.2f} ms")
if qt > 50: print("ALERT: NameNode is slow to respond")
Run during cluster startup or after maintenance. If NN stays in safe mode, your block report is incomplete or under-replicated.
import subprocess
out = subprocess.check_output(['hdfs','dfsadmin','-safemode','get']).decode()
print(out)
if 'ON' in out:
print("WARN: cluster still in safe mode")
Standby NN must apply edits to stay in sync. Lag > 1000 edits means failover will be slow — investigate JournalNode disk speed.
import requests
beans = requests.get("http://standby-nn:9870/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus").json()['beans'][0]
lag = beans.get('LastAppliedOrWrittenTxId', 0)
print("Standby applied TX:", lag)
Long G1 pauses freeze the NameNode. Scrape GcCount/GcTimeMillis every minute and alert on deltas above 2 s/min.
import requests, time
def gc(): return requests.get("http://namenode-host:9870/jmx?qry=java.lang:type=GarbageCollector,name=G1%20Old%20Generation").json()['beans'][0]
a = gc(); time.sleep(60); b = gc()
delta_ms = b['CollectionTime'] - a['CollectionTime']
print(f"G1 Old GC in last 60s: {delta_ms} ms")
if delta_ms > 2000: print("ALERT: GC pressure")
Parses ZKFC log for HA failover events — frequent flips signal an unstable cluster (often network or JN disk).
import re
with open('/var/log/hadoop/hdfs/hadoop-hdfs-zkfc-*.log') as f:
for line in f:
if 'Becoming active' in line or 'Becoming standby' in line:
print(re.match(r'^(\S+ \S+).*', line).group(1), '-', line.strip()[-80:])
Section 2 — DataNode Performance & Disk (Scripts 8–14)
Surfaces the imbalance between disks inside one DataNode — fixed by enabling the disk balancer (dfs.disk.balancer.enabled=true).
import requests
for dn in ['dn1:9864','dn2:9864','dn3:9864']:
j = requests.get(f"http://{dn}/jmx?qry=Hadoop:service=DataNode,name=FSDatasetState").json()
vols = j['beans'][0]['StorageInfo']
print(dn, '->', vols[:200])
Compares SendDataPacketAvgTime across all DNs. Anything 3× the median is dragging your Spark/MR shuffle.
import requests, statistics
hosts = ['dn1','dn2','dn3','dn4']
lat = {h: requests.get(f"http://{h}:9864/jmx?qry=Hadoop:service=DataNode,name=DataNodeActivity-*").json()['beans'][0]['SendDataPacketAvgTime'] for h in hosts}
med = statistics.median(lat.values())
for h,v in lat.items():
if v > 3*med: print(f"SLOW DN: {h} = {v:.1f}ms (median {med:.1f}ms)")
Stale DataNodes (heartbeat > 30 s) cause the NN to mark blocks under-replicated. Alert before they go DEAD at 10 min.
import requests, datetime as dt
j = requests.get("http://namenode-host:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem").json()['beans'][0]
print("Stale DNs:", j['NumStaleDataNodes'], " Dead:", j['NumDeadDataNodes'])
Run on every DN — flags disks where reallocated sectors are growing. Replace before HDFS marks them failed.
import subprocess, re
out = subprocess.check_output(['smartctl','-A','/dev/sda']).decode()
m = re.search(r'Reallocated_Sector_Ct.*?(\d+)\s*$', out, re.M)
if m and int(m.group(1)) > 5:
print("Disk /dev/sda has reallocated sectors:", m.group(1))
Each in-flight read/write uses one xceiver thread. Default cap is 4096 — once you hit 80%, raise dfs.datanode.max.transfer.threads.
import requests
for dn in ['dn1','dn2','dn3']:
j = requests.get(f"http://{dn}:9864/jmx?qry=Hadoop:service=DataNode,name=DataNodeInfo").json()['beans'][0]
print(dn, "Xceivers:", j['XceiverCount'])
Polls each DN for failed volumes — a DN with a failed disk is still alive but at reduced capacity.
import requests
for dn in ['dn1','dn2','dn3']:
j = requests.get(f"http://{dn}:9864/jmx?qry=Hadoop:service=DataNode,name=FSDatasetState").json()['beans'][0]
if j.get('NumFailedVolumes',0) > 0:
print(f"FAILED VOL on {dn}: {j['NumFailedVolumes']}")
Pulls bytes-read and bytes-written per minute. Use it to spot cold DNs (rack mis-config) or hot DNs (skewed placement).
import requests, time
def get(dn): return requests.get(f"http://{dn}:9864/jmx?qry=Hadoop:service=DataNode,name=DataNodeActivity-*").json()['beans'][0]
a={h:get(h) for h in ['dn1','dn2','dn3']}
time.sleep(60)
b={h:get(h) for h in ['dn1','dn2','dn3']}
for h in a:
rd=(b[h]['BytesRead']-a[h]['BytesRead'])/1024**2
wr=(b[h]['BytesWritten']-a[h]['BytesWritten'])/1024**2
print(f"{h}: read {rd:.1f} MB/min, write {wr:.1f} MB/min")
Section 3 — Blocks & Replication (Scripts 15–21)
Parses hdfs fsck / output. Missing blocks = data loss — page on-call immediately.
import subprocess, re
out = subprocess.check_output(['hdfs','fsck','/','-files','-blocks']).decode()
miss = int(re.search(r'Missing blocks:\s+(\d+)', out).group(1))
corr = int(re.search(r'Corrupt blocks:\s+(\d+)', out).group(1))
under= int(re.search(r'Under replicated blocks:\s+(\d+)', out).group(1))
print(f"missing={miss} corrupt={corr} under-replicated={under}")
if miss or corr: print("PAGE ONCALL")
Forces re-replication for files stuck under-replicated (e.g., after DN decommission). Run with --dry-run first.
import subprocess
files = subprocess.check_output(['hdfs','fsck','/','-list-corruptfileblocks']).decode().splitlines()
for f in files:
if f.startswith('/'):
print('setrep 3 on', f)
# subprocess.run(['hdfs','dfs','-setrep','3',f]) # uncomment to apply
Files left at replication 5 from old jobs waste storage. Drop them back to 3 — saves ~40% on those paths.
from hdfs import InsecureClient
client = InsecureClient('http://namenode-host:9870', user='hdfs')
for path,_,files in client.walk('/data/legacy'):
for f in files:
full=f"{path}/{f}"
st=client.status(full)
if st['replication']>3:
print(f"setrep 3: {full}")
# client.set_replication(full, 3)
Verifies the 3 replicas of every block are split across at least 2 racks. Same-rack replicas defeat HDFS's fault tolerance.
import subprocess, re
out = subprocess.check_output(['hdfs','fsck','/important','-files','-blocks','-racks']).decode()
for blk in re.findall(r'BP-[^\s]+ .*?\n', out):
racks=re.findall(r'/[^/\s]+/[^/\s]+', blk)
if len(set(racks))<2: print("SAME-RACK:", blk[:80])
Histogram of file sizes. Lots of files < 64 MB on a 128 MB block size means you're wasting NN heap and reading inefficiently.
from hdfs import InsecureClient
import collections
client = InsecureClient('http://namenode-host:9870', user='hdfs')
buckets=collections.Counter()
for p,_,files in client.walk('/data'):
for f in files:
sz=client.status(f"{p}/{f}")['length']
b='<1MB' if sz<1e6 else '1-64MB' if sz<64e6 else '64-128MB' if sz<128e6 else '>128MB'
buckets[b]+=1
print(buckets)
Watches PendingReplicationBlocks — sustained values > 0 for hours mean DNs can't keep up.
import requests
j = requests.get("http://namenode-host:9870/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem").json()['beans'][0]
print("Pending:", j['PendingReplicationBlocks'], " UnderRepl:", j['UnderReplicatedBlocks'])
Converts cold paths from 3× replication to RS-6-3-1024k erasure coding — saves 50% storage. Only safe for read-mostly data.
import subprocess COLD='/data/archive/2023' subprocess.run(['hdfs','ec','-setPolicy','-path',COLD,'-policy','RS-6-3-1024k']) # rewrite existing files so the policy applies subprocess.run(['hadoop','distcp','-update','-skipcrccheck',COLD,COLD+'_ec'])
Section 4 — Small Files Detection & Cleanup (Scripts 22–28)
Lists the top directories by count of files smaller than 32 MB — the worst offenders against the NN.
from hdfs import InsecureClient
import collections
client = InsecureClient('http://namenode-host:9870', user='hdfs')
counts=collections.Counter()
for p,_,files in client.walk('/'):
for f in files:
if client.status(f"{p}/{f}")['length'] < 32*1024*1024:
counts[p]+=1
for d,c in counts.most_common(20): print(f"{c:>8} {d}")
Bundles thousands of small files into a single .har archive — readable from MR/Spark, drops NN inode count by 1000×.
import subprocess SRC='/raw/2023' subprocess.run(['hadoop','archive','-archiveName','raw2023.har','-p','/raw','2023','/archive']) # delete originals only after verifying the har # subprocess.run(['hdfs','dfs','-rm','-r',SRC])
The right fix for partitioned tables — read-coalesce-write. Targets ~256 MB output files.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('compact').getOrCreate()
df = spark.read.parquet('/data/events/dt=2024-12-01')
n = max(1, int(df.count()*200/1e9)) # rough size estimate
df.coalesce(n).write.mode('overwrite').parquet('/data/events/dt=2024-12-01')
Empty files still cost NN heap. Sweep them weekly.
from hdfs import InsecureClient
client = InsecureClient('http://namenode-host:9870', user='hdfs')
for p,dirs,files in client.walk('/tmp'):
for f in files:
if client.status(f"{p}/{f}")['length']==0:
print('rm', f"{p}/{f}")
# client.delete(f"{p}/{f}")
Old approach but still useful for one-off log dumps. Bundles many tiny logs into one binary file with key=filename.
import subprocess subprocess.run(['hadoop','jar','/opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-*.jar', '-input','/logs/2024-12','-output','/logs/2024-12.seq', '-mapper','/bin/cat','-reducer','NONE','-outputformat','SequenceFileOutputFormat'])
Names and shames the team or service account creating the most small files. Best motivational tool you have.
from hdfs import InsecureClient
import collections
client = InsecureClient('http://namenode-host:9870', user='hdfs')
by_owner=collections.Counter()
for p,_,files in client.walk('/'):
for f in files:
st=client.status(f"{p}/{f}")
if st['length']<16*1024*1024:
by_owner[st['owner']]+=1
for o,c in by_owner.most_common(10): print(f"{c:>10} {o}")
Streaming jobs love producing one tiny file per micro-batch. Apply this in foreachBatch to keep files at ~128 MB.
def write_batch(df, batch_id):
df.coalesce(4).write.mode('append').parquet('/stream/events')
streaming.writeStream.foreachBatch(write_batch).start()
Section 5 — Storage Analysis & Reporting (Scripts 29–35)
Equivalent of du -h --max-depth=2 but cluster-wide. Run weekly and post to a Slack channel.
import subprocess
out = subprocess.check_output(['hdfs','dfs','-du','-h','/']).decode().splitlines()
sized=[(int(l.split()[0].replace('G','').replace('T','000') or 0), l) for l in out if l]
for _,l in sorted(sized,reverse=True)[:20]: print(l)
Finds candidates for archive tier or erasure coding. Most clusters have 30%+ data untouched in 6 months.
from hdfs import InsecureClient
import datetime as dt
client = InsecureClient('http://namenode-host:9870', user='hdfs')
cutoff=(dt.datetime.now()-dt.timedelta(days=180)).timestamp()*1000
cold=0
for p,_,files in client.walk('/data'):
for f in files:
st=client.status(f"{p}/{f}")
if st['modificationTime']
Sets storage policy on cold paths and runs the Mover to physically migrate blocks to ARCHIVE-typed disks.
import subprocess COLD='/data/archive' subprocess.run(['hdfs','storagepolicies','-setStoragePolicy','-path',COLD,'-policy','COLD']) subprocess.run(['hdfs','mover','-p',COLD])
Uses HDFS's built-in checksum to find byte-identical files across the cluster. Surprisingly common after careless distcp.
from hdfs import InsecureClient
import collections
client=InsecureClient('http://namenode-host:9870',user='hdfs')
dup=collections.defaultdict(list)
for p,_,files in client.walk('/data'):
for f in files:
c=client.checksum(f"{p}/{f}")
dup[c['bytes']].append(f"{p}/{f}")
for h,paths in dup.items():
if len(paths)>1: print('DUP:', paths)
Multiplies per-user bytes by replication factor and a $/GB/month rate. Send to finance.
from hdfs import InsecureClient
import collections
client=InsecureClient('http://namenode-host:9870',user='hdfs')
RATE=0.023 # $/GB/month
by={}
for p,_,files in client.walk('/'):
for f in files:
st=client.status(f"{p}/{f}")
b=st['length']*st['replication']
by[st['owner']]=by.get(st['owner'],0)+b
for u,b in sorted(by.items(),key=lambda x:-x[1])[:15]:
print(f"{u:<20} {b/1e9:>10.1f} GB ${b/1e9*RATE:>8.2f}/mo")
Linear projection of when the cluster will fill at the current 7-day growth rate.
import csv, datetime as dt
rows=list(csv.reader(open('/var/log/hdfs_growth.csv')))[-7:]
days=[(dt.datetime.fromisoformat(r[0])-dt.datetime.fromisoformat(rows[0][0])).days for r in rows]
used=[int(r[3])/1e12 for r in rows]
slope=(used[-1]-used[0])/max(1,days[-1])
free_tb=200-used[-1]
print(f"Days until full: {free_tb/slope:.0f}")
Compares hdfs dfs -du snapshots day over day. Finds the path eating storage the fastest.
import subprocess, json, os
today={l.split()[2]:int(l.split()[0]) for l in subprocess.check_output(['hdfs','dfs','-du','/data']).decode().splitlines()}
yest=json.load(open('/tmp/du_yest.json')) if os.path.exists('/tmp/du_yest.json') else {}
for p,sz in today.items():
delta=sz-yest.get(p,sz)
if delta>10*1e9: print(f"{p}: +{delta/1e9:.1f} GB in 24h")
json.dump(today,open('/tmp/du_yest.json','w'))
Section 6 — Quota, Snapshots & Trash (Scripts 36–42)
Lists every quota-enabled path with current consumption — flag anything > 90%.
import subprocess
out=subprocess.check_output(['hdfs','dfs','-count','-q','-h','/data/*']).decode()
for l in out.splitlines():
parts=l.split()
if len(parts)>=8 and parts[0]!='none':
used=parts[2]; quota=parts[0]; path=parts[-1]
print(f"{path:<40} {used} / {quota}")
Creates a tenant directory and applies space + name quotas in one shot — prevents one team filling the cluster.
import subprocess
def add_tenant(name, space_tb=10, files=1_000_000):
p=f'/tenants/{name}'
subprocess.run(['hdfs','dfs','-mkdir','-p',p])
subprocess.run(['hdfs','dfsadmin','-setSpaceQuota',f'{space_tb}t',p])
subprocess.run(['hdfs','dfsadmin','-setQuota',str(files),p])
add_tenant('analytics-team', 20, 5_000_000)
Lists snapshots older than the retention policy. Stale snapshots silently lock blocks from deletion.
import subprocess, datetime as dt, re
out=subprocess.check_output(['hdfs','dfs','-ls','-R','/.snapshot']).decode()
for l in out.splitlines():
m=re.match(r'.*?(\d{4}-\d{2}-\d{2})', l)
if m:
age=(dt.date.today()-dt.date.fromisoformat(m.group(1))).days
if age>30: print(f"OLD SNAPSHOT ({age}d):", l[-80:])
Compares two snapshots — shows how much data changed and what got deleted. Great for backup verification.
import subprocess out=subprocess.check_output(['hdfs','snapshotDiff','/data/sales','s1','s2']).decode() print(out)
Default trash interval is 24 h, but trash often grows to TBs. Force expunge weekly during off-hours.
import subprocess
subprocess.run(['hdfs','dfs','-expunge'])
print("Trash expunged")
Surfaces users whose .Trash/ is wasting space. Send them a polite reminder.
import subprocess out=subprocess.check_output(['hdfs','dfs','-du','-s','-h','/user/*/.Trash']).decode() print(out)
Compares current ACLs against a golden reference file. Catches accidental chmod 777 on prod paths.
import subprocess, json
ref=json.load(open('/etc/hdfs_acl_baseline.json'))
for path,expected in ref.items():
actual=subprocess.check_output(['hdfs','dfs','-getfacl',path]).decode()
if expected not in actual:
print(f"DRIFT: {path}\nexpected: {expected}\nactual:\n{actual}")
Section 7 — Balancer, I/O & Cluster Tuning (Scripts 43–50)
Compares each DN's % full to the cluster average. Anything > 10% off triggers the balancer.
import subprocess, re, statistics
out=subprocess.check_output(['hdfs','dfsadmin','-report']).decode()
nodes=re.findall(r'DFS Used%\s*:\s*([\d.]+)%', out)
pcts=[float(n) for n in nodes]
mean=statistics.mean(pcts)
worst=max(pcts,key=lambda p:abs(p-mean))
print(f"avg {mean:.1f}% worst {worst:.1f}% spread {max(pcts)-min(pcts):.1f}%")
Runs the balancer at off-peak with bandwidth capped to 50 MB/s/DN — prevents balancing from saturating the network.
import subprocess subprocess.run(['hdfs','dfsadmin','-setBalancerBandwidth','52428800']) subprocess.run(['hdfs','balancer','-threshold','5','-policy','datanode'])
Different from the cluster balancer — this rebalances across disks within one DN. Critical after replacing disks.
import subprocess
for dn in ['dn1','dn2','dn3']:
subprocess.run(['hdfs','diskbalancer','-plan',dn])
subprocess.run(['hdfs','diskbalancer','-execute',f'/system/diskbalancer/{dn}.plan.json'])
Confirms dfs.client.read.shortcircuit=true is actually firing. Saves up to 30% on local reads (Spark, HBase).
import requests
for dn in ['dn1','dn2','dn3']:
j=requests.get(f"http://{dn}:9864/jmx?qry=Hadoop:service=DataNode,name=DataNodeActivity-*").json()['beans'][0]
sc=j.get('TotalReadsFromLocalClient',0)
rd=j.get('BlocksRead',1)
print(f"{dn}: {sc/rd*100:.1f}% short-circuit reads")
Greps DN logs for SocketTimeoutException on the write pipeline — usually means a slow disk or saturated NIC.
import subprocess, collections
out=subprocess.check_output(['grep','-h','SocketTimeoutException','/var/log/hadoop/hdfs/hadoop-hdfs-datanode-*.log']).decode()
hosts=collections.Counter(l.split()[-1] for l in out.splitlines() if 'remote' in l)
for h,c in hosts.most_common(10): print(f"{c:>6} {h}")
Compares running config against the version-controlled baseline. Catches the "someone hot-edited hdfs-site.xml" problem.
import subprocess, re, json
baseline=json.load(open('/etc/hdfs_baseline.json'))
out=subprocess.check_output(['hdfs','getconf','-confKey','dfs.replication']).decode().strip()
for k,v in baseline.items():
actual=subprocess.check_output(['hdfs','getconf','-confKey',k]).decode().strip()
if actual!=v: print(f"DRIFT {k}: expected={v} actual={actual}")
Sweet spot for cross-cluster copy: many small mappers + bandwidth cap. This script estimates mappers from source size.
import subprocess SRC='/data/2024'; DST='hdfs://dr-namenode:9820/data/2024' size_gb=int(subprocess.check_output(['hdfs','dfs','-du','-s','-h',SRC]).decode().split()[0]) mappers=min(200, max(20, size_gb//5)) subprocess.run(['hadoop','distcp','-m',str(mappers),'-bandwidth','100','-update',SRC,DST])
Combines the most useful checks above into a single morning-coffee script. Pipe its output to Slack.
import requests, subprocess
nn="http://namenode-host:9870"
def jmx(q): return requests.get(f"{nn}/jmx?qry={q}").json()['beans'][0]
fs=jmx("Hadoop:service=NameNode,name=FSNamesystem")
mem=jmx("java.lang:type=Memory")['HeapMemoryUsage']
print("=== HDFS Daily Report ===")
print(f"Files : {fs['FilesTotal']:>12,}")
print(f"Blocks : {fs['BlocksTotal']:>12,}")
print(f"Missing : {fs['MissingBlocks']:>12,}")
print(f"UnderRepl : {fs['UnderReplicatedBlocks']:>12,}")
print(f"NN Heap : {mem['used']/mem['max']*100:>11.1f}%")
print(f"DeadNodes : {fs['NumDeadDataNodes']:>12,}")
print(f"StaleNodes: {fs['NumStaleDataNodes']:>12,}")
fsck=subprocess.check_output(['hdfs','fsck','/','-list-corruptfileblocks']).decode()
print(f"Corrupt : {fsck.count('blk_'):>12,}")
How To Productionise These Scripts
📅 Schedule (cron / Airflow)
- Hourly: 1, 3, 10, 15, 20 (NN health, DN heartbeat, RPC)
- Daily 06:00: 2, 22, 27, 29, 33, 36, 50
- Weekly Sun 02:00: 17, 23, 24, 25, 31, 38, 40, 44
🚨 Alert thresholds
- NN heap > 80% — page
- Missing blocks > 0 — page
- RPC queue > 50 ms — warn
- Imbalance > 15% — warn
- Cold data > 30% — review monthly
Frequently Asked Questions
Read-only JMX scripts work as any user with HTTP access to NameNode/DataNode UIs. The hdfs dfsadmin, fsck, balancer, and storagepolicies commands require the hdfs superuser. Run them via sudo -u hdfs or schedule under that user in cron.
Two changes: replace InsecureClient with KerberosClient from hdfs.ext.kerberos, and run kinit with a keytab before invoking the script. For JMX queries, install requests-kerberos and pass HTTPKerberosAuth() as the auth parameter.
Yes — the JMX endpoints, hdfs CLI, and WebHDFS API are identical across distributions. The only thing that changes is the NameNode hostname/port (CDP often uses 9871 with TLS; EMR uses 9870 over HTTP).
Most monitoring scripts (small files, replication) don't apply, but the storage analysis ones (cold data, dup hash, top-N largest) translate directly using boto3 / azure-storage-blob / google-cloud-storage. Same logic, different SDK.
📦 Want These as a Ready-To-Run Repo?
Get the full HDFS Performance Toolkit — all 50 scripts, packaged with a Makefile, a Dockerfile, an Airflow DAG, and Slack alert templates. Plus 100 Hadoop interview questions and the 300Q Big Data PDF bundle.
Read More on the Blog → Get the 300Q PDF Bundle
No comments:
Post a Comment