Tuesday 30 August 2016

Gathering Logs from "Nutanix NCC Health Check"

To gather the logs from Nutanix NCC, you can follow the below simple steps. In Version 4.5, the NCC is built into the Nutanix controller VM and hence no additional steps are required.

1) Login to the Nutanix Cluster using the admin username and password.

2) Go to Home => Hardware => Table. You can get the list of controller VMs IPs here. In case you don't remember them that nano second. :-)
3) Next, you will need to ssh to one of the controller VMs using Putty,using the username : nutanix/nutanix/4u (default) in case the password has not been changed.

4) Run the command : ncc health_checks run_all

5) Take a break, come back and hopefully the health_check should have completed.Now we need to move the logs to our local computer in case you need to share them.

6) I use WINSCP to download the logs from the controller VM to my local computer.Install Winscp and when prompted for the IP,enter the Controller VM Ip address and don't forget to select SCP and enter the same username and password. 

 7) Browse to the below directory where the logs are stroed, and you can copy them by just dragging and dropping to your computer location on the left : 
/home/nutanix/data/logs/ncc-output-latest.log

Hope this information helps.



XX

Sunday 14 August 2016

" Cannot truncate SQL logs for database: model " Veeam Backup

Was setting up Veeam for one of our clients and found the below Warning message in a SQL Server backup. The below steps helped me resolve the issue. Hope it helps you too..

Issue:
 You will see the below warning message in a Veeam backup Job with SQL VMs. " Failed to truncate Microsoft SQL Server transaction logs. Details:  Error code: 0x80004005 " .

The complete error message is a below:

Failed to truncate Microsoft SQL Server transaction logs. Details:  Error code: 0x80004005
Failed to invoke func [TruncateSqlLogs]: Unspecified error. Failed to process 'TruncateSQLLog' command.
Failed to truncate SQL server transaction logs for instances:  . See guest helper log. .
 Error code: 0x80004005
Failed to invoke func [TruncateSqlLogs]: Unspecified error. Failed to process 'TruncateSQLLog' command.
Failed to truncate SQL server transaction logs for instances:  See guest helper log.


Verification:
1) You can open the Guest Agent log file located in the below location " C:\ProgramData\Veeam\Backup\VeeamGuestHelper***.log " in the SQL VM and verify if you are seeing the  below message :

 5604                                  Major version: 12
 5604                                  Database tempdb has not been backed up yet.
 5604                                  Database found: master. Recovery model: 3. Is readonly: false. State: 0.
 5604                                  Database found: tempdb. Recovery model: 3. Is readonly: false. State: 0.
 5604                                  Database found: model. Recovery model: 1. Is readonly: false. State: 0.
 5604  WARN                    Cannot truncate SQL logs for database: model. Code = 0x80040e14
 5604  WARN                    Code meaning = IDispatch error #3092
 5604  WARN                    Source = Microsoft OLE DB Provider for SQL Server
 5604  WARN                    Description = BACKUP LOG cannot be performed because there is no current database backup.
 5604  WARN                    No OLE DB Error Information found: hr = 0x80004005
 5604                                  Database found: msdb. Recovery model: 3. Is readonly: false. State: 0.
 5604                                  Database found: TEST. Recovery model: 1. Is readonly: false. State: 0.
 5604                                  Database found: omar. Recovery model: 1. Is readonly: false. State: 0.
 5604                              Truncating database logs (SQL instance: ). User: XXXXX.. Ok.
 5604                              Truncating database logs (SQL instance: ). User: NT AUTHORITY\SYSTEM.
5604  INFO                            Connecting to mssql, connection string: Provider='sqloledb';Data Source='(local)';Integrated Security='SSPI';Persist Security Info=False, timeout: 15


Solution :
 The resolution to the issue was pretty simple:
1) Login to SQL DB using SQL Management Studio.
2) GO to system database => Right-click the Model Database => Tasks => Backup = > Provide a name for the backup and initiate a backup.
3) Once a initial backup was taken the Warning message disappears.


XX

Saturday 4 June 2016

Nutanix Bible notes

Well, my learning has started with the "Nutanix Bible" which contains a wealth of information around how Nutanix has been designed and other components.It took me a whole day to go through the entire content, not that I can remember all of it now. I will be going through this the second time hopefully over the next weekend.I hope it was always so easy to find good information on all the other solutions I work on. :-) . 

What I wanted to do here is make an index of all the Tips that has been provided in the "Nutanix Bible" for easy reference.  All the credits here are to the author of the site " Steven Poitras ", he has done an excellent job.

1) Pro tip: 
For larger or distributed deployments (e.g. more than one cluster or multiple sites) it is recommended to use Prism Central to simplify operations and provide a single management UI for all clusters / sites.

2) Pro tip:
You can determine the current Prism leader by running 'curl localhost:2019/prism/leader' on any CVM.

3) Pro tip:
You can also get cluster wide upgrade status from any Nutanix CVM by running 'host_upgrade --status'.  The detailed per host status is logged to ~/data/logs/host_upgrade.out on each CVM.

4) Pro tip:
For larger deployments Glance should run on at least two Acropolis Clusters per site. This will provide Image Repo HA in the case of a cluster outage and ensure the images 
will always be available when not in the Image Cache.

5) Pro tip:
Check NTP if a service is seen as state 'down' in OpenStack Manager (Admin UI or CLI) even though the service is running in the OVM. Many services have a requirement for time 
to be in sync between the OpenStack Controller and Acropolis OVM.

6) Pro tip:
Data resiliency state will be shown in Prism on the dashboard page.

You can also check data resiliency state via the cli:
# Node Status ncli cluster get-domain-fault-tolerance-status type=node.

# Block Status ncli cluster get-domain-fault-tolerance-status type=rackable_units.

These should always be up to date, however to refresh the data you can kick off a Curator partial scan.

7) Pro tip
You can override the default strip size (4/1 for “RF2 like” or 4/2 for “RF3 like”) via NCLI ‘ctr [create / edit] … erasure-code=<N>/<K>’ where N is the number of data blocks 
and K is the number of parity blocks.

8) Pro tip
It is always recommended to have a cluster size which has at least 1 more node than the combined strip size (data + parity) to allow for rebuilding of the strips in the event 
of a node failure. This eliminates any computation overhead on reads once the strips have been rebuilt (automated via Curator). For example, a 4/1 strip should have at least 
6 nodes in the cluster. The previous table follows this best practice.

9) Pro tip
Erasure Coding pairs perfectly with inline compression which will add to the storage savings. I leverage inline compression + EC in my environments.

10) Pro tip
Almost always use inline compression (compression delay = 0) as it will only compress larger / sequential writes and not impact random write performance. Inline compression also pairs perfectly with erasure coding.

11) Pro tip
Use performance tier deduplication on your base images (you can manually fingerprint them using vdisk_manipulator) to take advantage of the unified cache.

Use capacity tier deduplication for P2V / V2V, when using Hyper-V since ODX does a full data copy, or when doing cross-container clones (not usually recommended as a single container is preferred).

In most other cases compression will yield the highest capacity savings and should be used instead.

12) Pro tip
Create multiple PDs for various services tiers driven by a desired RPO/RTO.  For file distribution (e.g. golden images, ISOs, etc.) you can create a PD with the files to 
replication.

13) Pro tip
Group dependent application or service VMs in a consistency group to ensure they are recovered in a consistent state (e.g. App and DB).

14) Pro tip
The snapshot schedule should be equal to your desired RPO.

15) Pro tip
The retention policy should equal the number of restore points required per VM/file.

16) Pro tip
Ensure the target site has ample capacity (compute/storage) to handle a full site failure.  In certain cases replication/DR between racks within a single site can also make sense.

17) Pro tip
When using a remote site configured with a proxy, always utilize the cluster IP as that will always be hosted by the Prism Leader and available, even if CVM(s) go down.

18) Pro tip
Use reserve hosts when:
You have homogenous clusters (all hosts DO have the same amount of RAM)
Consolidation ratio is higher priority than performance

Use reserve segments when:
You have heterogeneous clusters (all hosts DO NOT have the same amount of RAM)
Performance is higher priority than consolidation ratio.

19) Pro tip
You can override or manually set the number of reserved failover hosts with the following ACLI command:
acli ha.update num_reserved_hosts=<NUM_RESERVED>

20) Pro tip
Keep your hosts balanced when using segment based reservation. This will give the highest utilization and ensure not too many segments are reserved.

21) Pro tip
In ideal cases the hit rates should be above 80-90%+ if the workload is read heavy for the best possible read performance.

22) Pro tip
When looking at any potential performance issues I always look at the following:

Avg. latency
Avg. op size
Avg. outstanding
For more specific details the vdisk_stats page holds a plethora of information.

23) Pro tip
If you're seeing high read latency take a look at the read source for the vDisk and take a look where the I/Os are being served from.  In most cases high latency could be caused by reads coming from HDD (Estore HDD).

24) Pro tip
Random I/Os will be written to the Oplog, sequential I/Os will bypass the Oplog and be directly written to the Extent Store (Estore).

Tuesday 16 February 2016

" partedUtil Failed with message:Error " During ESXi Installation on HP ConvergedSystem 242-HC

Recently, I was delivering a POC using the HP ConvergedSystems 242-HC and ran into the below issue. I was not able to find any solution online or on HPE's Website and hence decided to share this information in case anyone else runs into this issue. I don't think all customers would face this as the 242-HC comes pre-installed with VMware ESXi.If, you have had to reinstall the entire unit from scratch,you might face this issue on the 242-HC/250-HC and the same solution should apply.

Hopefully, when I have the time I will write a review about this product.

You can find more information about HP's Hyper Converged Systems here : HP ConvergedSystem 200-HC StoreVirtual

Issue : 
1) The 242-HC comes in a 2U chasis, with 4 Nodes and each node has 6 drives of it's own. 2 of them being SSD and the rest 4 are SAS. For the installation of the ESXi and running the HP StoreVirtual VM ,we need to create a " Logical Volume" using the SAS disks and then create a Logical Drive of 150GB. On this 150GB partition we will install ESXi first and then on the remaining space we create a VMFS volume onto which we import the VSA StoreVirtual Appliance. The problem is after you create the Logical drive, and during the installation of ESXi  you will get the below error message. 
Figure 1: Operation Failed.


Solution:
1) The Solution that worked for me was to create a 150GB Logical drive and change the "Parity Initialization Method " to Rapid instead of Default. This will take more time for the drive to be initialized so please be patient....

Find below the details about what this setting does:

Rapid Parity Initialization:
When you create a logical drive, you must initialize the parity using Rapid Parity Initialization.

RAID levels that use parity (RAID 5, RAID 6 (ADG), RAID 50, and RAID 60) require that the parity blocks be initialized to valid values. Valid parity data is required to enable enhanced data protection through background surface scan analysis and higher performance write operations. Two initialization methods are available:  

• Default – Initializes parity blocks in the background while the logical drive is available for access by the operating system. A lower RAID level results in faster parity initialization.

• Rapid – Overwrites both the data and parity blocks in the foreground. The logical drive remains invisible and unavailable to the operating system until the parity initialization process completes. All parity groups are initialized in parallel, but initialization is faster for single parity groups (RAID 5 and RAID 6). RAID level does not affect system performance during rapid initialization.

Figure 2: Parity Initialization Method

Hope the information helps....  As usual, if you have any questions... please leave a comment.


xx